Pattern Matching and String Splitting
Progress Tracking
Log in to save this lesson and continue from where you left off.
Splitting Strings
Here’s a scenario you’ll encounter often: someone stored categories as "Electronics;Gadgets;Sale" in a single column. Or addresses as "City, State ZIP". You need the second element, or all elements as separate rows. .str.split() handles both — it’s the pandas equivalent of SQL’s SPLIT_PART, but more flexible because you can split into columns or explode into rows.
The .str.split() method splits by a delimiter and returns a list per row. Use expand=True to get separate columns:
# Split into a list
df["tags"].str.split(",")
# Split into separate columns
df["tags"].str.split(",", expand=True)
# Get the Nth element
df["tags"].str.split(",").str[1] # second elementyelp_business["categories"].str.split(";").str[0].head(10)Exploding Lists into Rows
When a column contains comma-separated values, .str.split() + .explode() turns each value into its own row:
# One row per category
yelp_business["categories"].str.split(";").explode()One Row Per Value
Split the categories column by semicolon and explode into separate rows. The starter splits — add `.explode()`.
Splitting "a, b, c" by "," gives ["a", " b", " c"] with leading spaces. Chain .str.strip() after .explode() to clean them.
Regex with .str Methods
When patterns get complex — "extract all digits", "find anything that looks like an email", "remove all non-alphanumeric characters" — regular expressions are the answer. They’re intimidating at first, but pandas makes them easy: .str.contains(), .str.extract(), and .str.replace() all accept regex patterns natively. You don’t need to import re for most cases.
The .str accessor supports regex natively:
# Check if matches pattern
df["email"].str.contains(r"@gmail\.com$", regex=True)
# Extract pattern groups
df["email"].str.extract(r"@(.+)$") # domain
# Replace with regex
df["phone"].str.replace(r"[^0-9]", "", regex=True)Reviews of Categories
| business_id | name | neighborhood | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | categories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| G5ERFWvPfHy7IDAUYlWL2A | All Colors Mobile Bumper Repair | 7137 N 28th Ave | Phoenix | AZ | 85051 | 33.45 | -112.07 | 1 | 4 | 1 | Auto Detailing;Automotive | |
| 0jDvRJS-z9zdMgOUXgr6rA | Sunfare | 811 W Deer Valley Rd | Phoenix | AZ | 85027 | 33.68 | -112.08 | 5 | 27 | 1 | Personal Chefs;Food;Gluten-Free;Food Delivery Services;Event Planning & Services;Restaurants | |
| 6HmDqeNNZtHMK0t2glF_gg | Dry Clean Vegas | Southeast | 2550 Windmill Ln, Ste 100 | Las Vegas | NV | 89123 | 36.04 | -115.12 | 1 | 4 | 1 | Dry Cleaning & Laundry;Laundry Services;Local Services;Dry Cleaning |
| pbt3SBcEmxCfZPdnmU9tNA | The Cuyahoga Room | 740 Munroe Falls Ave | Cuyahoga Falls | OH | 44221 | 41.14 | -81.47 | 1 | 3 | 0 | Wedding Planning;Caterers;Event Planning & Services;Venues & Event Spaces | |
| CX8pfLn7Bk9o2-8yDMp_2w | The UPS Store | 4815 E Carefree Hwy, Ste 108 | Cave Creek | AZ | 85331 | 33.8 | -111.98 | 1.5 | 5 | 1 | Notaries;Printing Services;Local Services;Shipping Centers |
Calculate number of reviews for every business category. Output the category along with the total number of reviews. Order by total reviews in descending order.
Count Occurrences Of Words In Drafts
| filename | contents |
|---|---|
| draft1.txt | The stock exchange predicts a bull market which would make many investors happy. |
| draft2.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. |
| final.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices. |
Find the number of times each word appears in the `contents` column across all rows in the `google_file_store` dataset. Output two columns: `word` and `occurrences`.
Counting Instances in Text
| filename | contents |
|---|---|
| draft1.txt | The stock exchange predicts a bull market which would make many investors happy. |
| draft2.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. |
| final.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices. |
Find the number of times the exact words `bull` and `bear` appear in the `contents` column. Count all occurrences, even if they appear multiple times within the same row. Matches should be case-insensitive and only count exact words, that is, exclude substrings like `bullish` or `bearing`. Output the word (`bull` or `bear`) and the corresponding number of occurrences.
Key Takeaways
.str.split(delim)splits strings;.str.split(delim).str[n]gets the Nth element..str.split().explode()turns comma-separated values into rows..str.strip()after splitting to clean whitespace..str.contains(),.str.extract(),.str.replace()all support regex.- Use
regex=Trueexplicitly for clarity.
What’s Next
You now have the full string toolkit. The final lesson combines date extraction, string cleanup, and conditional logic to solve realistic multi-technique problems.