Module 5: Dates, Strings & Logic40 min

Pattern Matching and String Splitting

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Splitting Strings

Here’s a scenario you’ll encounter often: someone stored categories as "Electronics;Gadgets;Sale" in a single column. Or addresses as "City, State ZIP". You need the second element, or all elements as separate rows. .str.split() handles both — it’s the pandas equivalent of SQL’s SPLIT_PART, but more flexible because you can split into columns or explode into rows.

The .str.split() method splits by a delimiter and returns a list per row. Use expand=True to get separate columns:

Python
# Split into a list
df["tags"].str.split(",")

# Split into separate columns
df["tags"].str.split(",", expand=True)

# Get the Nth element
df["tags"].str.split(",").str[1]  # second element
Python
yelp_business["categories"].str.split(";").str[0].head(10)

Exploding Lists into Rows

When a column contains comma-separated values, .str.split() + .explode() turns each value into its own row:

Python
# One row per category
yelp_business["categories"].str.split(";").explode()

One Row Per Value

1
Explode Categories into Rows

Split the categories column by semicolon and explode into separate rows. The starter splits — add `.explode()`.

Tables: yelp_business
.str.strip() After Split

Splitting "a, b, c" by "," gives ["a", " b", " c"] with leading spaces. Chain .str.strip() after .explode() to clean them.

Regex with .str Methods

When patterns get complex — "extract all digits", "find anything that looks like an email", "remove all non-alphanumeric characters" — regular expressions are the answer. They’re intimidating at first, but pandas makes them easy: .str.contains(), .str.extract(), and .str.replace() all accept regex patterns natively. You don’t need to import re for most cases.

The .str accessor supports regex natively:

Python
# Check if matches pattern
df["email"].str.contains(r"@gmail\.com$", regex=True)

# Extract pattern groups
df["email"].str.extract(r"@(.+)$")  # domain

# Replace with regex
df["phone"].str.replace(r"[^0-9]", "", regex=True)

Reviews of Categories

Table: yelp_business
business_idnameneighborhoodaddresscitystatepostal_codelatitudelongitudestarsreview_countis_opencategories
G5ERFWvPfHy7IDAUYlWL2AAll Colors Mobile Bumper Repair7137 N 28th AvePhoenixAZ8505133.45-112.07141Auto Detailing;Automotive
0jDvRJS-z9zdMgOUXgr6rASunfare811 W Deer Valley RdPhoenixAZ8502733.68-112.085271Personal Chefs;Food;Gluten-Free;Food Delivery Services;Event Planning & Services;Restaurants
6HmDqeNNZtHMK0t2glF_ggDry Clean VegasSoutheast2550 Windmill Ln, Ste 100Las VegasNV8912336.04-115.12141Dry Cleaning & Laundry;Laundry Services;Local Services;Dry Cleaning
pbt3SBcEmxCfZPdnmU9tNAThe Cuyahoga Room740 Munroe Falls AveCuyahoga FallsOH4422141.14-81.47130Wedding Planning;Caterers;Event Planning & Services;Venues & Event Spaces
CX8pfLn7Bk9o2-8yDMp_2wThe UPS Store4815 E Carefree Hwy, Ste 108Cave CreekAZ8533133.8-111.981.551Notaries;Printing Services;Local Services;Shipping Centers
2
Reviews of Categories
View solution

Calculate number of reviews for every business category. Output the category along with the total number of reviews. Order by total reviews in descending order.

Tables: yelp_business

Count Occurrences Of Words In Drafts

Table: google_file_store
filenamecontents
draft1.txtThe stock exchange predicts a bull market which would make many investors happy.
draft2.txtThe stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txtThe stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.
3
Count Occurrences Of Words In Drafts
View solution

Find the number of times each word appears in the `contents` column across all rows in the `google_file_store` dataset. Output two columns: `word` and `occurrences`.

Tables: google_file_store

Counting Instances in Text

Table: google_file_store
filenamecontents
draft1.txtThe stock exchange predicts a bull market which would make many investors happy.
draft2.txtThe stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txtThe stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.
4
Counting Instances in Text
View solution

Find the number of times the exact words `bull` and `bear` appear in the `contents` column. Count all occurrences, even if they appear multiple times within the same row. Matches should be case-insensitive and only count exact words, that is, exclude substrings like `bullish` or `bearing`. Output the word (`bull` or `bear`) and the corresponding number of occurrences.

Tables: google_file_store

Key Takeaways

  • .str.split(delim) splits strings; .str.split(delim).str[n] gets the Nth element.
  • .str.split().explode() turns comma-separated values into rows.
  • .str.strip() after splitting to clean whitespace.
  • .str.contains(), .str.extract(), .str.replace() all support regex.
  • Use regex=True explicitly for clarity.

What’s Next

You now have the full string toolkit. The final lesson combines date extraction, string cleanup, and conditional logic to solve realistic multi-technique problems.