Module 5: Date, Time & Text Functions35 min

String Functions

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Cleaning and Transforming Text

We’ll be honest: most of your time as a data analyst isn’t writing elegant analytical queries. It’s cleaning up someone else’s mess. Names with trailing spaces. Addresses stored as “city, state zip” in a single field. Phone numbers in twelve different formats.

String functions are your cleanup toolkit. Master these, and you’ll spend less time fighting data and more time analyzing it.

Dialect Comparison: String Concatenation

  • Concatenate:
    • PostgreSQL uses || or CONCAT()
    • MySQL uses CONCAT()
    • MSSQL uses + or CONCAT()
    • Oracle uses || or CONCAT()
  • With NULL:
    • PostgreSQL’s || with NULL returns NULL
    • MySQL’s CONCAT('a', NULL) returns 'a'
    • MSSQL’s 'a' + NULL returns NULL
    • Oracle’s || with NULL returns 'a'
NULL handling differs across dialects

PostgreSQL’s || with NULL returns NULL. We’ve seen this silently break reports when one field is unexpectedly NULL. Use CONCAT() or COALESCE() for safer concatenation.

Safe Concatenation with COALESCE

In production, always assume data can be NULL. Here’s the defensive pattern:

PostgreSQL
-- Handle potential NULLs (no awkward double spaces)
SELECT
  COALESCE(first_name, '') || ' ' || COALESCE(last_name, '') AS full_name
FROM users;

-- Even better: TRIM the result to handle missing first OR last name
SELECT
  TRIM(
    COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')
  ) AS full_name
FROM users;

Flags per Video

Use CONCAT with COALESCE to safely combine user_firstname and user_lastname, handling NULL values.

Table: user_flags
user_firstnameuser_lastnamevideo_idflag_id
RichardHassony6120QOlsfU0cazx3
MarkMayCt6BUPvE2sM1cn76u
GinaKormandQw4w9WgXcQ1i43zk
MarkMayCt6BUPvE2sM1n0vef
MarkMayjNQXAC9IVRw1sv6ib
1
Flags per Video
View solution

For each video, find how many unique users flagged it. A unique user can be identified using the combination of their first name and last name. Do not consider rows in which there is no flag ID.

Tables: user_flags

SUBSTRING: Extracting Parts of Strings

Extract a portion of a string by position:

2
Try SUBSTRING

Extract the first 3 characters of each first_name and show the full name length. Remember: SQL positions are 1-indexed.

Tables: worker
1-indexed, not 0

Position is 1-indexed in SQL (not 0 like Python or JavaScript). The first character is position 1. This trips up every programmer switching to SQL.

Dialect Comparison: Substring

  • Substring:
    • PostgreSQL uses SUBSTRING(s, start, len)
    • MySQL uses SUBSTRING(s, start, len)
    • MSSQL uses SUBSTRING(s, start, len)
    • Oracle uses SUBSTR(s, start, len)
  • Left N chars:
    • PostgreSQL uses LEFT(s, n)
    • MySQL uses LEFT(s, n)
    • MSSQL uses LEFT(s, n)
    • Oracle uses SUBSTR(s, 1, n)
  • Right N chars:
    • PostgreSQL uses RIGHT(s, n)
    • MySQL uses RIGHT(s, n)
    • MSSQL uses RIGHT(s, n)
    • Oracle uses SUBSTR(s, -n)

Case Conversion: UPPER and LOWER

Case inconsistency is the one of the most common causes of “why doesn’t my JOIN work?” bugs. Standardize before comparing:

PostgreSQL
-- Convert to lowercase for case-insensitive comparison
SELECT *
FROM products
WHERE LOWER(category) = 'electronics';

-- Format for display
SELECT
  UPPER(first_name) || ' ' || UPPER(last_name) AS name_caps
FROM employees;

Good news: UPPER(s) and LOWER(s) work identically across all major dialects.

3
Try Case-Insensitive Filter

Find all employees in the Admin department using a case-insensitive comparison.

Tables: worker

TRIM: Removing Whitespace

Invisible trailing spaces are insidious. ‘Admin’ and ‘Admin ’ look the same but won’t match. TRIM is your friend:

PostgreSQL
-- Remove leading/trailing spaces
SELECT TRIM('  hello world  ');
-- 'hello world'

-- Clean messy data before comparing
SELECT
  TRIM(product_name) AS clean_name
FROM products
WHERE TRIM(product_name) = 'Widget';

Variants: LTRIM and RTRIM

PostgreSQL
-- Left trim only
SELECT LTRIM('  hello');
-- 'hello'

-- Right trim only
SELECT RTRIM('hello  ');
-- 'hello'
Suspect whitespace when matches fail

When debugging “this should match but doesn’t” issues, always suspect whitespace. Adding TRIM() to both sides of JOINs on string columns is a good reflex.

Percentage of Shipable Orders

Use TRIM to clean address data before joining, then calculate the percentage.

Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80
Table: customers
idfirst_namelast_namecityaddressphone_number
8JohnJosephSan Francisco928-386-8164
7JillMichaelAustin813-297-0692
4WilliamDanielDenver813-368-1200
5HenryJacksonMiami808-601-7513
13EmmaIsaacMiami808-690-5201
4
Find the percentage of shipable orders
View solution

Find the percentage of shipable orders. Consider an order is shipable if the customer's address is known.

Tables: orders, customers

LENGTH: String Length

Count characters in a string:

PostgreSQL
SELECT LENGTH('Hello');
-- 5

-- Find long product names (maybe truncated in reports)
SELECT
  product_name,
  LENGTH(product_name) AS name_length
FROM products
WHERE LENGTH(product_name) > 50;

Dialect Note

  • String length:
    • PostgreSQL uses LENGTH
    • MySQL uses LENGTH
    • MSSQL uses LEN
    • Oracle uses LENGTH

Finding Substrings: POSITION

Find where a substring appears:

PostgreSQL
-- POSITION(substring IN string)
SELECT POSITION('@' IN 'user@email.com');
-- 5

-- Find emails with specific domain
SELECT email
FROM users
WHERE POSITION('@gmail.com' IN email) > 0;

Dialect Comparison: Position/Index

  • Find position:
    • PostgreSQL uses POSITION(sub IN s)
    • MySQL uses LOCATE(sub, s)
    • MSSQL uses CHARINDEX(sub, s)
    • Oracle uses INSTR(s, sub)

Extracting Parts with POSITION + SUBSTRING

Combine POSITION and SUBSTRING to extract dynamic portions:

PostgreSQL
-- Extract domain from email (everything after @)
SELECT
  email,
  SUBSTRING(email, POSITION('@' IN email) + 1) AS domain
FROM users;

Combining String Functions

Real data cleaning chains multiple functions. Here’s a pattern we use constantly:

5
Try Chaining String Functions

Create a clean_name column (uppercase trimmed first + last name), a clean_dept column (lowercase trimmed department), and sort by name_length descending.

Tables: worker

Key Takeaways

  • Use || or CONCAT() to combine strings (watch NULL behavior — it’s a silent killer)
  • SUBSTRING extracts parts of strings by position (1-indexed, not 0!)
  • UPPER/LOWER standardize case — use before comparing or joining on text
  • TRIM removes leading/trailing whitespace — suspect it whenever matches fail
  • LENGTH returns string length (LEN in MSSQL, and watch trailing spaces)
  • POSITION finds where substrings appear — combine with SUBSTRING for extraction

What’s Next

These core functions handle 80% of string work. But sometimes you need heavier machinery: splitting comma-separated values into rows, pattern matching with regular expressions, and extracting the Nth element from delimited data. That’s next.