String Functions
Progress Tracking
Log in to save this lesson and continue from where you left off.
Cleaning and Transforming Text
We’ll be honest: most of your time as a data analyst isn’t writing elegant analytical queries. It’s cleaning up someone else’s mess. Names with trailing spaces. Addresses stored as “city, state zip” in a single field. Phone numbers in twelve different formats.
String functions are your cleanup toolkit. Master these, and you’ll spend less time fighting data and more time analyzing it.
Dialect Comparison: String Concatenation
Concatenation syntax:
- PostgreSQL —
||orCONCAT() - MySQL —
CONCAT() - MSSQL —
+orCONCAT() - Oracle —
||orCONCAT()
Behavior with NULL:
- PostgreSQL —
||returnsNULL;CONCAT()ignoresNULL - MySQL —
CONCAT('a', NULL)returnsNULL; useCONCAT_WS()to skip NULLs - MSSQL —
'a' + NULLreturnsNULL;CONCAT()ignoresNULL - Oracle —
||andCONCAT()both treatNULLas empty string, returning the non-NULL value
Operators (||, +) return NULL when any argument is NULL — except in Oracle. We’ve seen this silently break reports when one field is unexpectedly NULL. Use CONCAT() (which skips NULLs in PostgreSQL/MSSQL/Oracle) or COALESCE() for safer concatenation.
Safe Concatenation with COALESCE
In production, always assume data can be NULL. Here’s the defensive pattern:
-- Handle potential NULLs (no awkward double spaces)
SELECT
COALESCE(first_name, '') || ' ' || COALESCE(last_name, '') AS full_name
FROM users;
-- Even better: TRIM the result to handle missing first OR last name
SELECT
TRIM(
COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')
) AS full_name
FROM users;Flags per Video
Use CONCAT with COALESCE to safely combine user_firstname and user_lastname, handling NULL values.
| user_firstname | user_lastname | video_id | flag_id |
|---|---|---|---|
| Richard | Hasson | y6120QOlsfU | 0cazx3 |
| Mark | May | Ct6BUPvE2sM | 1cn76u |
| Gina | Korman | dQw4w9WgXcQ | 1i43zk |
| Mark | May | Ct6BUPvE2sM | 1n0vef |
| Mark | May | jNQXAC9IVRw | 1sv6ib |
For each video, find how many unique users flagged it. A unique user can be identified using the combination of their first name and last name. Do not consider rows in which there is no flag ID.
SUBSTRING: Extracting Parts of Strings
Extract a portion of a string by position:
Extract the first 3 characters of each first_name and show the full name length. Remember: SQL positions are 1-indexed.
Position is 1-indexed in SQL (not 0 like Python or JavaScript). The first character is position 1. This trips up many programmers switching to SQL.
Dialect Comparison: Substring
Substring:
- PostgreSQL —
SUBSTRING(s, start, len) - MySQL —
SUBSTRING(s, start, len) - MSSQL —
SUBSTRING(s, start, len) - Oracle —
SUBSTR(s, start, len)
Left N chars:
- PostgreSQL —
LEFT(s, n) - MySQL —
LEFT(s, n) - MSSQL —
LEFT(s, n) - Oracle —
SUBSTR(s, 1, n)
Right N chars:
- PostgreSQL —
RIGHT(s, n) - MySQL —
RIGHT(s, n) - MSSQL —
RIGHT(s, n) - Oracle —
SUBSTR(s, -n)
Case Conversion: UPPER and LOWER
Case inconsistency is a common cause of “why doesn’t my JOIN work?” bugs. Standardize before comparing:
-- Convert to lowercase for case-insensitive comparison
SELECT *
FROM products
WHERE LOWER(category) = 'electronics';
-- Format for display
SELECT
UPPER(first_name) || ' ' || UPPER(last_name) AS name_caps
FROM employees;Good news: UPPER(s) and LOWER(s) work identically across all major dialects.
Find all employees in the Admin department using a case-insensitive comparison.
TRIM: Removing Whitespace
Invisible trailing spaces are insidious. ‘Admin’ and ‘Admin ’ look the same but won’t match. TRIM is your friend:
-- Remove leading/trailing spaces
SELECT TRIM(' hello world ');
-- 'hello world'
-- Clean messy data before comparing
SELECT
TRIM(product_name) AS clean_name
FROM products
WHERE TRIM(product_name) = 'Widget';Variants: LTRIM and RTRIM
-- Left trim only
SELECT LTRIM(' hello');
-- 'hello'
-- Right trim only
SELECT RTRIM('hello ');
-- 'hello'When debugging “this should match but doesn’t” issues, always suspect whitespace. Adding TRIM() to both sides of JOINs on string columns is a good reflex.
Percentage of Shipable Orders
Use TRIM to clean address data before joining, then calculate the percentage.
| id | cust_id | order_date | order_details | total_order_cost |
|---|---|---|---|---|
| 1 | 3 | 2019-03-04 | Coat | 100 |
| 2 | 3 | 2019-03-01 | Shoes | 80 |
| 3 | 3 | 2019-03-07 | Skirt | 30 |
| 4 | 7 | 2019-02-01 | Coat | 25 |
| 5 | 7 | 2019-03-10 | Shoes | 80 |
| id | first_name | last_name | city | address | phone_number |
|---|---|---|---|---|---|
| 8 | John | Joseph | San Francisco | 928-386-8164 | |
| 7 | Jill | Michael | Austin | 813-297-0692 | |
| 4 | William | Daniel | Denver | 813-368-1200 | |
| 5 | Henry | Jackson | Miami | 808-601-7513 | |
| 13 | Emma | Isaac | Miami | 808-690-5201 |
Find the percentage of shipable orders. Consider an order is shipable if the customer's address is known.
LENGTH: String Length
Count characters in a string:
SELECT LENGTH('Hello');
-- 5
-- Find long product names (maybe truncated in reports)
SELECT
product_name,
LENGTH(product_name) AS name_length
FROM products
WHERE LENGTH(product_name) > 50;Dialect Note
String length:
- PostgreSQL —
LENGTH(s) - MySQL —
LENGTH(s) - MSSQL —
LEN(s) - Oracle —
LENGTH(s)
Finding Substrings: POSITION
Find where a substring appears:
-- POSITION(substring IN string)
SELECT POSITION('@' IN 'user@email.com');
-- 5
-- Find emails with specific domain
SELECT email
FROM users
WHERE POSITION('@gmail.com' IN email) > 0;Dialect Comparison: Position/Index
Find position:
- PostgreSQL —
POSITION(sub IN s) - MySQL —
LOCATE(sub, s) - MSSQL —
CHARINDEX(sub, s) - Oracle —
INSTR(s, sub)
Extracting Parts with POSITION + SUBSTRING
Combine POSITION and SUBSTRING to extract dynamic portions:
-- Extract domain from email (everything after @)
SELECT
email,
SUBSTRING(email, POSITION('@' IN email) + 1) AS domain
FROM users;Combining String Functions
Real data cleaning chains multiple functions. Here’s a pattern we use constantly:
Create a clean_name column (uppercase trimmed first + last name), a clean_dept column (lowercase trimmed department), and sort by name_length descending.
Key Takeaways
- Use
||orCONCAT()to combine strings (watchNULLbehavior — it’s a silent killer) SUBSTRINGextracts parts of strings by position (1-indexed, not 0!)UPPER/LOWERstandardize case — use before comparing or joining on textTRIMremoves leading/trailing whitespace — suspect it whenever matches failLENGTHreturns string length (LENin MSSQL, and watch trailing spaces)POSITIONfinds where substrings appear — combine withSUBSTRINGfor extraction
What’s Next
These core functions handle most string work. But sometimes you need heavier machinery: splitting comma-separated values into rows, pattern matching with regular expressions, and extracting the Nth element from delimited data. That’s next.