Pattern Matching and String Splitting
Progress Tracking
Log in to save this lesson and continue from where you left off.
Advanced String Operations
Here’s a scenario we see constantly: someone stored categories as ‘Electronics,Gadgets,Sale’ in a single column. Or addresses as ‘City, State ZIP’. Basic SUBSTRING won’t cut it when you need the 2nd element from a comma-separated list, or when the position varies.
This lesson covers the power tools: splitting delimited strings, expanding them into rows, and pattern matching with regular expressions.
SELECT
SPLIT_PART('apple,banana,cherry', ',', 1) AS first_item,
SPLIT_PART('apple,banana,cherry', ',', 2) AS second_item,
SPLIT_PART('user@company.com', '@', 2) AS domain,
SPLIT_PART('New York, NY, 10001', ',', 1) AS city;Compare this to the POSITION + SUBSTRING approach from last lesson. SPLIT_PART is much cleaner when you’re dealing with delimited data.
SPLIT_PART is PostgreSQL-specific. If you’re interviewing somewhere that uses MySQL or MSSQL, know the equivalents below.
Dialect Comparison: String Splitting
- Split and get Nth part:
- PostgreSQL —
SPLIT_PART(s, ',', n) - MySQL —
SUBSTRING_INDEX(s, ',', n) - MSSQL uses value from
STRING_SPLIT() - Oracle —
REGEXP_SUBSTR(s, '[^,]+', 1, n)
- PostgreSQL —
- Split to rows:
- PostgreSQL —
UNNEST(STRING_TO_ARRAY()) - MySQL —
JSON_TABLE(MySQL 8) - MSSQL —
STRING_SPLIT() - Oracle uses custom function
- PostgreSQL —
The Oracle example uses [^,]+ which means “one or more characters that aren’t commas.” Replace the comma with whatever delimiter you’re splitting on. MSSQL’s STRING_SPLIT returns a table, so you SELECT from it.
SELECT UNNEST(STRING_TO_ARRAY('apple,banana,cherry', ','));Real-World Use Case: Denormalized Categories
Here’s why this matters. Say you have products with categories stored as comma-separated values (bad schema design, but incredibly common):
-- Product categories: 'Electronics,Gadgets,Sale'
SELECT
product_id,
product_name,
UNNEST(STRING_TO_ARRAY(categories, ',')) AS category
FROM products;
-- Now you can properly aggregate by category
SELECT
UNNEST(STRING_TO_ARRAY(categories, ',')) AS category,
COUNT(*) AS product_count
FROM products
GROUP BY UNNEST(STRING_TO_ARRAY(categories, ','));This transforms one row with multiple categories into multiple rows, one per category. Essential for proper aggregation when someone’s stored denormalized data.
Splitting Categories into Rows
If you run in PostgreSQL, use STRING_TO_ARRAY and UNNEST to split category strings into rows, then aggregate reviews by category.
| business_id | name | neighborhood | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | categories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| G5ERFWvPfHy7IDAUYlWL2A | All Colors Mobile Bumper Repair | 7137 N 28th Ave | Phoenix | AZ | 85051 | 33.45 | -112.07 | 1 | 4 | 1 | Auto Detailing;Automotive | |
| 0jDvRJS-z9zdMgOUXgr6rA | Sunfare | 811 W Deer Valley Rd | Phoenix | AZ | 85027 | 33.68 | -112.08 | 5 | 27 | 1 | Personal Chefs;Food;Gluten-Free;Food Delivery Services;Event Planning & Services;Restaurants | |
| 6HmDqeNNZtHMK0t2glF_gg | Dry Clean Vegas | Southeast | 2550 Windmill Ln, Ste 100 | Las Vegas | NV | 89123 | 36.04 | -115.12 | 1 | 4 | 1 | Dry Cleaning & Laundry;Laundry Services;Local Services;Dry Cleaning |
| pbt3SBcEmxCfZPdnmU9tNA | The Cuyahoga Room | 740 Munroe Falls Ave | Cuyahoga Falls | OH | 44221 | 41.14 | -81.47 | 1 | 3 | 0 | Wedding Planning;Caterers;Event Planning & Services;Venues & Event Spaces | |
| CX8pfLn7Bk9o2-8yDMp_2w | The UPS Store | 4815 E Carefree Hwy, Ste 108 | Cave Creek | AZ | 85331 | 33.8 | -111.98 | 1.5 | 5 | 1 | Notaries;Printing Services;Local Services;Shipping Centers |
Calculate number of reviews for every business category. Output the category along with the total number of reviews. Order by total reviews in descending order.
The city is already extracted. Add the `state` and `zip` columns using the same splitting pattern.
Regular Expressions
When patterns get complex — “extract all numbers”, “find anything that looks like an email”, “replace all non-alphanumeric characters” — regular expressions are the answer. They’re intimidating at first, but incredibly powerful.
Pattern Matching with ~
PostgreSQL uses ~ for regex matching:
-- ~ for case-sensitive regex match
SELECT *
FROM products
WHERE product_name ~ '^[A-Z]';
-- Starts with uppercase letter
-- ~* for case-insensitive match
SELECT *
FROM users
WHERE email ~* '@gmail.com$';
-- Ends with @gmail.comREGEXP_MATCHES: Extracting Matches
-- Extract all numbers from a string
SELECT REGEXP_MATCHES('Order #123 for $45.99', '\d+', 'g');
-- Returns: {123}, {45}, {99}REGEXP_REPLACE: Find and Replace
This is the go-to for cleaning messy data:
-- Remove all non-numeric characters from phone numbers
SELECT REGEXP_REPLACE('Phone: (555) 123-4567', '[^0-9]', '', 'g');
-- Returns: '5551234567'
-- Standardize phone numbers across your entire table
SELECT
REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS clean_phone
FROM customers;The ‘g’ flag means “global” — replace ALL matches, not just the first one. Forgetting this is a common bug.
Dialect Comparison: Regex Support
- Regex match:
- PostgreSQL —
~or~* - MySQL —
REGEXPorRLIKE - MSSQL has no native support
- Oracle —
REGEXP_LIKE()
- PostgreSQL —
- Regex replace:
- PostgreSQL —
REGEXP_REPLACE() - MySQL —
REGEXP_REPLACE() - MSSQL has no native support
- Oracle —
REGEXP_REPLACE()
- PostgreSQL —
- Extract match:
- PostgreSQL —
REGEXP_MATCHES() - MySQL —
REGEXP_SUBSTR() - MSSQL has no native support
- Oracle —
REGEXP_SUBSTR()
- PostgreSQL —
MSSQL has no native regex support. Your options: CLR functions (requires DBA help), LIKE patterns for simple cases, or handle regex in application code. If you’re interviewing at a Microsoft shop, be prepared to work around this.
Counting Word Frequencies
Use REGEXP_SPLIT_TO_TABLE or string splitting with LOWER() to count word frequencies.
| filename | contents |
|---|---|
| draft1.txt | The stock exchange predicts a bull market which would make many investors happy. |
| draft2.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. |
| final.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices. |
Find the number of times each word appears in the `contents` column across all rows in the `google_file_store` dataset. Output two columns: `word` and `occurrences`.
Practical Regex Patterns
Regex syntax is its own language. Here are the patterns that cover 90% of real-world cases:
^pattern— Starts with (anchor to beginning)pattern$— Ends with (anchor to end)\d+— One or more digits\w+— One or more “word” characters (letters, numbers, underscore)[A-Z]— Any uppercase letter[^0-9]— Any character that’s NOT a digit (^ inside brackets = negation).*— Any characters, any number of times (greedy)\.— Literal period (backslash escapes special characters)
Finding Patterns in File Names
Use regex functions to find and count specific patterns within text columns.
| filename | contents |
|---|---|
| draft1.txt | The stock exchange predicts a bull market which would make many investors happy. |
| draft2.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. |
| final.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices. |
Find the number of times the exact words `bull` and `bear` appear in the `contents` column. Count all occurrences, even if they appear multiple times within the same row. Matches should be case-insensitive and only count exact words, that is, exclude substrings like `bullish` or `bearing`. Output the word (`bull` or `bear`) and the corresponding number of occurrences.
When to Use What
Here’s the decision tree after years of production SQL:
- Fixed position extraction →
SUBSTRING(you know exactly where it is) - Known delimiter, need Nth element →
SPLIT_PART(cleaner thanPOSITION+SUBSTRING) - Need to expand delimited values to rows →
STRING_TO_ARRAY+UNNEST - Complex patterns or cleaning → Regular expressions
- Simple starts/ends with →
LIKEis faster and more portable than regex
Key Takeaways
SPLIT_PARTextracts the Nth piece from delimited strings — cleaner thanPOSITION+SUBSTRINGSTRING_TO_ARRAY+UNNESTtransforms comma-separated values into proper rows- Regular expressions handle complex pattern matching and text cleaning
REGEXP_REPLACEwith ‘g’ flag replaces ALL matches (don’t forget the flag!)- MSSQL lacks native regex — know your workarounds if targeting Microsoft environments
What’s Next
You now have the full string manipulation toolkit. The final lesson combines everything: date functions, string functions, and the CASE WHEN logic you learned in Module 2. We’ll tackle real-world data transformation scenarios that combine all these techniques.