Module 5: Date, Time & Text Functions•45 min

Pattern Matching and String Splitting

Progress Tracking

Advanced String Operations

Here’s a scenario we see constantly: someone stored categories as ‘Electronics,Gadgets,Sale’ in a single column. Or addresses as ‘City, State ZIP’. Basic SUBSTRING won’t cut it when you need the 2nd element from a comma-separated list, or when the position varies.

This lesson covers the power tools: splitting delimited strings, expanding them into rows, and pattern matching with regular expressions.

PostgreSQL

SELECT
  SPLIT_PART('apple,banana,cherry', ',', 1) AS first_item,
  SPLIT_PART('apple,banana,cherry', ',', 2) AS second_item,
  SPLIT_PART('user@company.com', '@', 2) AS domain,
  SPLIT_PART('New York, NY, 10001', ',', 1) AS city;

Compare this to the POSITION + SUBSTRING approach from last lesson. SPLIT_PART is much cleaner when you’re dealing with delimited data.

SPLIT_PART is PostgreSQL-specific

SPLIT_PART is PostgreSQL-specific. If you’re interviewing somewhere that uses MySQL or MSSQL, know the equivalents below.

Dialect Comparison: String Splitting

Split and get Nth part:

PostgreSQL — SPLIT_PART(s, ',', n)
MySQL — SUBSTRING_INDEX(s, ',', n)
MSSQL uses value from STRING_SPLIT()
Oracle — REGEXP_SUBSTR(s, '[^,]+', 1, n)

Split to rows:

PostgreSQL — UNNEST(STRING_TO_ARRAY())
MySQL — JSON_TABLE (MySQL 8)
MSSQL — CROSS APPLY STRING_SPLIT()
Oracle uses custom function

Oracle regex pattern

The Oracle example uses [^,]+ which means “one or more characters that aren’t commas.” Replace the comma with whatever delimiter you’re splitting on. MSSQL’s STRING_SPLIT is a table-valued function — apply it per row with CROSS APPLY: SELECT p.product_id, s.value AS category FROM products p CROSS APPLY STRING_SPLIT(p.categories, ',') s;

PostgreSQL

SELECT UNNEST(STRING_TO_ARRAY('apple,banana,cherry', ','));

Real-World Use Case: Denormalized Categories

Here’s why this matters. Say you have products with categories stored as comma-separated values (bad schema design, but incredibly common):

PostgreSQL

-- Product categories: 'Electronics,Gadgets,Sale'
SELECT
  product_id,
  product_name,
  UNNEST(STRING_TO_ARRAY(categories, ',')) AS category
FROM products;

-- Now you can properly aggregate by category
SELECT
  UNNEST(STRING_TO_ARRAY(categories, ',')) AS category,
  COUNT(*) AS product_count
FROM products
GROUP BY UNNEST(STRING_TO_ARRAY(categories, ','));

This transforms one row with multiple categories into multiple rows, one per category. Essential for proper aggregation when someone’s stored denormalized data.

Splitting Categories into Rows

If you run in PostgreSQL, use STRING_TO_ARRAY and UNNEST to split category strings into rows, then aggregate reviews by category.

Table: yelp_business

business_id	name	neighborhood	address	city	state	postal_code	latitude	longitude	stars	review_count	is_open	categories
G5ERFWvPfHy7IDAUYlWL2A	All Colors Mobile Bumper Repair		7137 N 28th Ave	Phoenix	AZ	85051	33.45	-112.07	1	4	1	Auto Detailing;Automotive
0jDvRJS-z9zdMgOUXgr6rA	Sunfare		811 W Deer Valley Rd	Phoenix	AZ	85027	33.68	-112.08	5	27	1	Personal Chefs;Food;Gluten-Free;Food Delivery Services;Event Planning & Services;Restaurants
6HmDqeNNZtHMK0t2glF_gg	Dry Clean Vegas	Southeast	2550 Windmill Ln, Ste 100	Las Vegas	NV	89123	36.04	-115.12	1	4	1	Dry Cleaning & Laundry;Laundry Services;Local Services;Dry Cleaning
pbt3SBcEmxCfZPdnmU9tNA	The Cuyahoga Room		740 Munroe Falls Ave	Cuyahoga Falls	OH	44221	41.14	-81.47	1	3	0	Wedding Planning;Caterers;Event Planning & Services;Venues & Event Spaces
CX8pfLn7Bk9o2-8yDMp_2w	The UPS Store		4815 E Carefree Hwy, Ste 108	Cave Creek	AZ	85331	33.8	-111.98	1.5	5	1	Notaries;Printing Services;Local Services;Shipping Centers

Tables: yelp_business

Regular Expressions

When patterns get complex — “extract all numbers”, “find anything that looks like an email”, “replace all non-alphanumeric characters” — regular expressions are the answer. They’re intimidating at first, but incredibly powerful.

Pattern Matching with ~

PostgreSQL uses ~ for regex matching:

PostgreSQL

-- ~ for case-sensitive regex match
SELECT *
FROM products
WHERE product_name ~ '^[A-Z]';
-- Starts with uppercase letter

-- ~* for case-insensitive match
SELECT *
FROM users
WHERE email ~* '@gmail.com$';
-- Ends with @gmail.com

REGEXP_MATCHES: Extracting Matches

PostgreSQL

-- Extract all numbers from a string
SELECT REGEXP_MATCHES('Order #123 for $45.99', '\d+', 'g');
-- Returns: {123}, {45}, {99}

REGEXP_REPLACE: Find and Replace

This is the go-to for cleaning messy data:

PostgreSQL

-- Remove all non-numeric characters from phone numbers
SELECT REGEXP_REPLACE('Phone: (555) 123-4567', '[^0-9]', '', 'g');
-- Returns: '5551234567'

-- Standardize phone numbers across your entire table
SELECT
  REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS clean_phone
FROM customers;

Don’t forget the g flag

The ‘g’ flag means “global” — replace ALL matches, not just the first one. Forgetting this is a common bug.

Dialect Comparison: Regex Support

Regex match:

PostgreSQL — ~ or ~*
MySQL — REGEXP or RLIKE
MSSQL has no native support
Oracle — REGEXP_LIKE()

Regex replace:

PostgreSQL — REGEXP_REPLACE()
MySQL — REGEXP_REPLACE()
MSSQL has no native support
Oracle — REGEXP_REPLACE()

Extract match:

PostgreSQL — REGEXP_MATCHES()
MySQL — REGEXP_SUBSTR()
MSSQL has no native support
Oracle — REGEXP_SUBSTR()

MSSQL has no native regex

MSSQL has no native regex support. Your options: CLR functions (requires DBA help), LIKE patterns for simple cases, or handle regex in application code. If you’re interviewing at a Microsoft shop, be prepared to work around this.

Counting Word Frequencies

Use REGEXP_SPLIT_TO_TABLE or string splitting with LOWER() to count word frequencies.

Table: google_file_store

filename	contents
draft1.txt	The stock exchange predicts a bull market which would make many investors happy.
draft2.txt	The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txt	The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.

Tables: google_file_store

Practical Regex Patterns

Regex syntax is its own language. Here are the patterns that cover 90% of real-world cases:

^pattern — Starts with (anchor to beginning)
pattern$ — Ends with (anchor to end)
\d+ — One or more digits
\w+ — One or more “word” characters (letters, numbers, underscore)
[A-Z] — Any uppercase letter
[^0-9] — Any character that’s NOT a digit (^ inside brackets = negation)
.* — Any characters, any number of times (greedy)
\. — Literal period (backslash escapes special characters)

Finding Patterns in File Names

Use regex functions to find and count specific patterns within text columns.

Table: google_file_store

filename	contents
draft1.txt	The stock exchange predicts a bull market which would make many investors happy.
draft2.txt	The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txt	The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.

Tables: google_file_store

When to Use What

Here’s the decision tree after years of production SQL:

Fixed position extraction → SUBSTRING (you know exactly where it is)
Known delimiter, need Nth element → SPLIT_PART (cleaner than POSITION + SUBSTRING)
Need to expand delimited values to rows → STRING_TO_ARRAY + UNNEST
Complex patterns or cleaning → Regular expressions
Simple starts/ends with → LIKE is faster and more portable than regex

Key Takeaways

SPLIT_PART extracts the Nth piece from delimited strings — cleaner than POSITION + SUBSTRING
STRING_TO_ARRAY + UNNEST transforms comma-separated values into proper rows
Regular expressions handle complex pattern matching and text cleaning
REGEXP_REPLACE with ‘g’ flag replaces ALL matches (don’t forget the flag!)
MSSQL lacks native regex — know your workarounds if targeting Microsoft environments

What’s Next

You now have the full string manipulation toolkit. The final lesson combines everything: date functions, string functions, and the CASE WHEN logic you learned in Module 2. We’ll tackle real-world data transformation scenarios that combine all these techniques.

Next upContinue →

Combining Functions

45 min