String Functions
Progress Tracking
Log in to save this lesson and continue from where you left off.
Cleaning and Transforming Text
We’ll be honest: most of your time as a data analyst isn’t writing elegant analytical queries. It’s cleaning up someone else’s mess. Names with trailing spaces. Addresses stored as “city, state zip” in a single field. Phone numbers in twelve different formats.
String functions are your cleanup toolkit. Master these, and you’ll spend less time fighting data and more time analyzing it.
Dialect Comparison: String Concatenation
- Concatenate:
- PostgreSQL uses
||orCONCAT() - MySQL uses
CONCAT() - MSSQL uses
+orCONCAT() - Oracle uses
||orCONCAT()
- PostgreSQL uses
- With
NULL:- PostgreSQL’s
||withNULLreturnsNULL - MySQL’s
CONCAT('a', NULL)returns'a' - MSSQL’s
'a' + NULLreturnsNULL - Oracle’s
||withNULLreturns'a'
- PostgreSQL’s
PostgreSQL’s || with NULL returns NULL. We’ve seen this silently break reports when one field is unexpectedly NULL. Use CONCAT() or COALESCE() for safer concatenation.
Safe Concatenation with COALESCE
In production, always assume data can be NULL. Here’s the defensive pattern:
-- Handle potential NULLs (no awkward double spaces)
SELECT
COALESCE(first_name, '') || ' ' || COALESCE(last_name, '') AS full_name
FROM users;
-- Even better: TRIM the result to handle missing first OR last name
SELECT
TRIM(
COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')
) AS full_name
FROM users;Flags per Video
Use CONCAT with COALESCE to safely combine user_firstname and user_lastname, handling NULL values.
| user_firstname | user_lastname | video_id | flag_id |
|---|---|---|---|
| Richard | Hasson | y6120QOlsfU | 0cazx3 |
| Mark | May | Ct6BUPvE2sM | 1cn76u |
| Gina | Korman | dQw4w9WgXcQ | 1i43zk |
| Mark | May | Ct6BUPvE2sM | 1n0vef |
| Mark | May | jNQXAC9IVRw | 1sv6ib |
For each video, find how many unique users flagged it. A unique user can be identified using the combination of their first name and last name. Do not consider rows in which there is no flag ID.
SUBSTRING: Extracting Parts of Strings
Extract a portion of a string by position:
Extract the first 3 characters of each first_name and show the full name length. Remember: SQL positions are 1-indexed.
Position is 1-indexed in SQL (not 0 like Python or JavaScript). The first character is position 1. This trips up every programmer switching to SQL.
Dialect Comparison: Substring
- Substring:
- PostgreSQL uses
SUBSTRING(s, start, len) - MySQL uses
SUBSTRING(s, start, len) - MSSQL uses
SUBSTRING(s, start, len) - Oracle uses
SUBSTR(s, start, len)
- PostgreSQL uses
- Left N chars:
- PostgreSQL uses
LEFT(s, n) - MySQL uses
LEFT(s, n) - MSSQL uses
LEFT(s, n) - Oracle uses
SUBSTR(s, 1, n)
- PostgreSQL uses
- Right N chars:
- PostgreSQL uses
RIGHT(s, n) - MySQL uses
RIGHT(s, n) - MSSQL uses
RIGHT(s, n) - Oracle uses
SUBSTR(s, -n)
- PostgreSQL uses
Case Conversion: UPPER and LOWER
Case inconsistency is the one of the most common causes of “why doesn’t my JOIN work?” bugs. Standardize before comparing:
-- Convert to lowercase for case-insensitive comparison
SELECT *
FROM products
WHERE LOWER(category) = 'electronics';
-- Format for display
SELECT
UPPER(first_name) || ' ' || UPPER(last_name) AS name_caps
FROM employees;Good news: UPPER(s) and LOWER(s) work identically across all major dialects.
Find all employees in the Admin department using a case-insensitive comparison.
TRIM: Removing Whitespace
Invisible trailing spaces are insidious. ‘Admin’ and ‘Admin ’ look the same but won’t match. TRIM is your friend:
-- Remove leading/trailing spaces
SELECT TRIM(' hello world ');
-- 'hello world'
-- Clean messy data before comparing
SELECT
TRIM(product_name) AS clean_name
FROM products
WHERE TRIM(product_name) = 'Widget';Variants: LTRIM and RTRIM
-- Left trim only
SELECT LTRIM(' hello');
-- 'hello'
-- Right trim only
SELECT RTRIM('hello ');
-- 'hello'When debugging “this should match but doesn’t” issues, always suspect whitespace. Adding TRIM() to both sides of JOINs on string columns is a good reflex.
Percentage of Shipable Orders
Use TRIM to clean address data before joining, then calculate the percentage.
| id | cust_id | order_date | order_details | total_order_cost |
|---|---|---|---|---|
| 1 | 3 | 2019-03-04 | Coat | 100 |
| 2 | 3 | 2019-03-01 | Shoes | 80 |
| 3 | 3 | 2019-03-07 | Skirt | 30 |
| 4 | 7 | 2019-02-01 | Coat | 25 |
| 5 | 7 | 2019-03-10 | Shoes | 80 |
| id | first_name | last_name | city | address | phone_number |
|---|---|---|---|---|---|
| 8 | John | Joseph | San Francisco | 928-386-8164 | |
| 7 | Jill | Michael | Austin | 813-297-0692 | |
| 4 | William | Daniel | Denver | 813-368-1200 | |
| 5 | Henry | Jackson | Miami | 808-601-7513 | |
| 13 | Emma | Isaac | Miami | 808-690-5201 |
Find the percentage of shipable orders. Consider an order is shipable if the customer's address is known.
LENGTH: String Length
Count characters in a string:
SELECT LENGTH('Hello');
-- 5
-- Find long product names (maybe truncated in reports)
SELECT
product_name,
LENGTH(product_name) AS name_length
FROM products
WHERE LENGTH(product_name) > 50;Dialect Note
- String length:
- PostgreSQL uses
LENGTH - MySQL uses
LENGTH - MSSQL uses
LEN - Oracle uses
LENGTH
- PostgreSQL uses
Finding Substrings: POSITION
Find where a substring appears:
-- POSITION(substring IN string)
SELECT POSITION('@' IN 'user@email.com');
-- 5
-- Find emails with specific domain
SELECT email
FROM users
WHERE POSITION('@gmail.com' IN email) > 0;Dialect Comparison: Position/Index
- Find position:
- PostgreSQL uses
POSITION(sub IN s) - MySQL uses
LOCATE(sub, s) - MSSQL uses
CHARINDEX(sub, s) - Oracle uses
INSTR(s, sub)
- PostgreSQL uses
Extracting Parts with POSITION + SUBSTRING
Combine POSITION and SUBSTRING to extract dynamic portions:
-- Extract domain from email (everything after @)
SELECT
email,
SUBSTRING(email, POSITION('@' IN email) + 1) AS domain
FROM users;Combining String Functions
Real data cleaning chains multiple functions. Here’s a pattern we use constantly:
Create a clean_name column (uppercase trimmed first + last name), a clean_dept column (lowercase trimmed department), and sort by name_length descending.
Key Takeaways
- Use
||orCONCAT()to combine strings (watchNULLbehavior — it’s a silent killer) SUBSTRINGextracts parts of strings by position (1-indexed, not 0!)UPPER/LOWERstandardize case — use before comparing or joining on textTRIMremoves leading/trailing whitespace — suspect it whenever matches failLENGTHreturns string length (LENin MSSQL, and watch trailing spaces)POSITIONfinds where substrings appear — combine withSUBSTRINGfor extraction
What’s Next
These core functions handle 80% of string work. But sometimes you need heavier machinery: splitting comma-separated values into rows, pattern matching with regular expressions, and extracting the Nth element from delimited data. That’s next.