Module 5: Date, Time & Text Functions45 min

Pattern Matching and String Splitting

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Advanced String Operations

Here’s a scenario we see constantly: someone stored categories as ‘Electronics,Gadgets,Sale’ in a single column. Or addresses as ‘City, State ZIP’. Basic SUBSTRING won’t cut it when you need the 2nd element from a comma-separated list, or when the position varies.

This lesson covers the power tools: splitting delimited strings, expanding them into rows, and pattern matching with regular expressions.

PostgreSQL
SELECT
  SPLIT_PART('apple,banana,cherry', ',', 1) AS first_item,
  SPLIT_PART('apple,banana,cherry', ',', 2) AS second_item,
  SPLIT_PART('user@company.com', '@', 2) AS domain,
  SPLIT_PART('New York, NY, 10001', ',', 1) AS city;

Compare this to the POSITION + SUBSTRING approach from last lesson. SPLIT_PART is much cleaner when you’re dealing with delimited data.

SPLIT_PART is PostgreSQL-specific

SPLIT_PART is PostgreSQL-specific. If you’re interviewing somewhere that uses MySQL or MSSQL, know the equivalents below.

Dialect Comparison: String Splitting

  • Split and get Nth part:
    • PostgreSQL — SPLIT_PART(s, ',', n)
    • MySQL — SUBSTRING_INDEX(s, ',', n)
    • MSSQL uses value from STRING_SPLIT()
    • Oracle — REGEXP_SUBSTR(s, '[^,]+', 1, n)
  • Split to rows:
    • PostgreSQL — UNNEST(STRING_TO_ARRAY())
    • MySQL — JSON_TABLE (MySQL 8)
    • MSSQL — STRING_SPLIT()
    • Oracle uses custom function
Oracle regex pattern

The Oracle example uses [^,]+ which means “one or more characters that aren’t commas.” Replace the comma with whatever delimiter you’re splitting on. MSSQL’s STRING_SPLIT returns a table, so you SELECT from it.

PostgreSQL
SELECT UNNEST(STRING_TO_ARRAY('apple,banana,cherry', ','));

Real-World Use Case: Denormalized Categories

Here’s why this matters. Say you have products with categories stored as comma-separated values (bad schema design, but incredibly common):

PostgreSQL
-- Product categories: 'Electronics,Gadgets,Sale'
SELECT
  product_id,
  product_name,
  UNNEST(STRING_TO_ARRAY(categories, ',')) AS category
FROM products;

-- Now you can properly aggregate by category
SELECT
  UNNEST(STRING_TO_ARRAY(categories, ',')) AS category,
  COUNT(*) AS product_count
FROM products
GROUP BY UNNEST(STRING_TO_ARRAY(categories, ','));

This transforms one row with multiple categories into multiple rows, one per category. Essential for proper aggregation when someone’s stored denormalized data.

Splitting Categories into Rows

If you run in PostgreSQL, use STRING_TO_ARRAY and UNNEST to split category strings into rows, then aggregate reviews by category.

Table: yelp_business
business_idnameneighborhoodaddresscitystatepostal_codelatitudelongitudestarsreview_countis_opencategories
G5ERFWvPfHy7IDAUYlWL2AAll Colors Mobile Bumper Repair7137 N 28th AvePhoenixAZ8505133.45-112.07141Auto Detailing;Automotive
0jDvRJS-z9zdMgOUXgr6rASunfare811 W Deer Valley RdPhoenixAZ8502733.68-112.085271Personal Chefs;Food;Gluten-Free;Food Delivery Services;Event Planning & Services;Restaurants
6HmDqeNNZtHMK0t2glF_ggDry Clean VegasSoutheast2550 Windmill Ln, Ste 100Las VegasNV8912336.04-115.12141Dry Cleaning & Laundry;Laundry Services;Local Services;Dry Cleaning
pbt3SBcEmxCfZPdnmU9tNAThe Cuyahoga Room740 Munroe Falls AveCuyahoga FallsOH4422141.14-81.47130Wedding Planning;Caterers;Event Planning & Services;Venues & Event Spaces
CX8pfLn7Bk9o2-8yDMp_2wThe UPS Store4815 E Carefree Hwy, Ste 108Cave CreekAZ8533133.8-111.981.551Notaries;Printing Services;Local Services;Shipping Centers
1
Reviews of Categories
View solution

Calculate number of reviews for every business category. Output the category along with the total number of reviews. Order by total reviews in descending order.

Tables: yelp_business
2
Split an Address into Parts

The city is already extracted. Add the `state` and `zip` columns using the same splitting pattern.

Tables: yelp_business

Regular Expressions

When patterns get complex — “extract all numbers”, “find anything that looks like an email”, “replace all non-alphanumeric characters” — regular expressions are the answer. They’re intimidating at first, but incredibly powerful.

Pattern Matching with ~

PostgreSQL uses ~ for regex matching:

PostgreSQL
-- ~ for case-sensitive regex match
SELECT *
FROM products
WHERE product_name ~ '^[A-Z]';
-- Starts with uppercase letter

-- ~* for case-insensitive match
SELECT *
FROM users
WHERE email ~* '@gmail.com$';
-- Ends with @gmail.com

REGEXP_MATCHES: Extracting Matches

PostgreSQL
-- Extract all numbers from a string
SELECT REGEXP_MATCHES('Order #123 for $45.99', '\d+', 'g');
-- Returns: {123}, {45}, {99}

REGEXP_REPLACE: Find and Replace

This is the go-to for cleaning messy data:

PostgreSQL
-- Remove all non-numeric characters from phone numbers
SELECT REGEXP_REPLACE('Phone: (555) 123-4567', '[^0-9]', '', 'g');
-- Returns: '5551234567'

-- Standardize phone numbers across your entire table
SELECT
  REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS clean_phone
FROM customers;
Don’t forget the g flag

The ‘g’ flag means “global” — replace ALL matches, not just the first one. Forgetting this is a common bug.

Dialect Comparison: Regex Support

  • Regex match:
    • PostgreSQL — ~ or ~*
    • MySQL — REGEXP or RLIKE
    • MSSQL has no native support
    • Oracle — REGEXP_LIKE()
  • Regex replace:
    • PostgreSQL — REGEXP_REPLACE()
    • MySQL — REGEXP_REPLACE()
    • MSSQL has no native support
    • Oracle — REGEXP_REPLACE()
  • Extract match:
    • PostgreSQL — REGEXP_MATCHES()
    • MySQL — REGEXP_SUBSTR()
    • MSSQL has no native support
    • Oracle — REGEXP_SUBSTR()
MSSQL has no native regex

MSSQL has no native regex support. Your options: CLR functions (requires DBA help), LIKE patterns for simple cases, or handle regex in application code. If you’re interviewing at a Microsoft shop, be prepared to work around this.

Counting Word Frequencies

Use REGEXP_SPLIT_TO_TABLE or string splitting with LOWER() to count word frequencies.

Table: google_file_store
filenamecontents
draft1.txtThe stock exchange predicts a bull market which would make many investors happy.
draft2.txtThe stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txtThe stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.
3
Count Occurrences Of Words In Drafts
View solution

Find the number of times each word appears in the `contents` column across all rows in the `google_file_store` dataset. Output two columns: `word` and `occurrences`.

Tables: google_file_store

Practical Regex Patterns

Regex syntax is its own language. Here are the patterns that cover 90% of real-world cases:

  • ^pattern — Starts with (anchor to beginning)
  • pattern$ — Ends with (anchor to end)
  • \d+ — One or more digits
  • \w+ — One or more “word” characters (letters, numbers, underscore)
  • [A-Z] — Any uppercase letter
  • [^0-9] — Any character that’s NOT a digit (^ inside brackets = negation)
  • .* — Any characters, any number of times (greedy)
  • \. — Literal period (backslash escapes special characters)

Finding Patterns in File Names

Use regex functions to find and count specific patterns within text columns.

Table: google_file_store
filenamecontents
draft1.txtThe stock exchange predicts a bull market which would make many investors happy.
draft2.txtThe stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txtThe stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.
4
Counting Instances in Text
View solution

Find the number of times the exact words `bull` and `bear` appear in the `contents` column. Count all occurrences, even if they appear multiple times within the same row. Matches should be case-insensitive and only count exact words, that is, exclude substrings like `bullish` or `bearing`. Output the word (`bull` or `bear`) and the corresponding number of occurrences.

Tables: google_file_store

When to Use What

Here’s the decision tree after years of production SQL:

  • Fixed position extraction → SUBSTRING (you know exactly where it is)
  • Known delimiter, need Nth element → SPLIT_PART (cleaner than POSITION + SUBSTRING)
  • Need to expand delimited values to rows → STRING_TO_ARRAY + UNNEST
  • Complex patterns or cleaning → Regular expressions
  • Simple starts/ends with → LIKE is faster and more portable than regex

Key Takeaways

  • SPLIT_PART extracts the Nth piece from delimited strings — cleaner than POSITION + SUBSTRING
  • STRING_TO_ARRAY + UNNEST transforms comma-separated values into proper rows
  • Regular expressions handle complex pattern matching and text cleaning
  • REGEXP_REPLACE with ‘g’ flag replaces ALL matches (don’t forget the flag!)
  • MSSQL lacks native regex — know your workarounds if targeting Microsoft environments

What’s Next

You now have the full string manipulation toolkit. The final lesson combines everything: date functions, string functions, and the CASE WHEN logic you learned in Module 2. We’ll tackle real-world data transformation scenarios that combine all these techniques.