Working with Missing Data
Progress Tracking
Log in to save this lesson and continue from where you left off.
The Missing Data Problem
Missing values cause bugs, return unexpected results, and trip up even experienced analysts. But once you understand how pandas handles them, you can avoid the pitfalls.
What NaN Actually Means
NaN stands for "Not a Number," but it really means "missing" or "unknown." It’s not zero. It’s not an empty string. It’s the absence of a value.
Think of it this way:
0— the value is zero""(empty string) — the value is blank textNaN— we don’t know what the value is
This distinction matters because NaN behaves differently from actual values.
In pandas, both NaN (from NumPy) and None (Python’s built-in) represent missing data. Pandas treats them interchangeably in most situations. You’ll see both, and the same methods handle both.
A Real Example
Let’s look at the fintech_app_users table:
| user_id | user_name | phone_number |
|---|---|---|
| 1 | Emma Johnson | 8994954670 |
| 2 | Noah Williams | 7848177904 |
| 3 | Olivia Garcia | 921-171-1750 |
| 4 | Liam Martinez | |
| 5 | Ava Rodriguez |
Some users haven’t provided their phone numbers — those values show up as NaN. We don’t know their phone numbers. Maybe they never entered one. Maybe the data was lost. The point is: we don’t have a value.
Finding Missing Values: .isna()
Here’s the first surprise: you can’t use == to check for NaN. NaN == NaN evaluates to False in Python, so equality checks won’t find missing values. Instead, use .isna():
fintech_app_users[fintech_app_users["phone_number"].isna()].isna() returns True for every row where the value is missing, and False everywhere else.
Users Missing Phone Numbers
| user_id | user_name | phone_number |
|---|---|---|
| 1 | Emma Johnson | 8994954670 |
| 2 | Noah Williams | 7848177904 |
| 3 | Olivia Garcia | 921-171-1750 |
| 4 | Liam Martinez | |
| 5 | Ava Rodriguez |
The product team is launching a new WhatsApp notification feature and needs to identify users who haven't provided their phone numbers yet. These users will be shown a prompt to add their contact information. Find all users who have not provided a phone number. Return the `user ID` and `name`.
Excluding Missing Values: .notna()
To find rows where a value exists, use .notna() — the opposite of .isna():
Find all users who have a phone number on record.
This returns only rows where phone_number has a non-null value.
Wine Varieties Tasted by Roger Voss
| id | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 118040 | US | A bit too soft and thus lacks structure. Shows a good array of wild berry and briary, brambly flavors. Dry and spicy, and ready now. | The Convict Rocky Ridge Vineyard | 86 | 38 | California | Rockpile | Sonoma | Paradise Ridge 2006 The Convict Rocky Ridge Vineyard Zinfandel (Rockpile) | Zinfandel | Paradise Ridge | ||
| 59743 | Italy | Aromas of toasted oak, vanilla and a confectionary note lead the nose while the palate offers mature Golden Delicious apple, butterscotch and walnut skin alongside bracing acidity. | Mongris Riserva | 88 | 30 | Northeastern Italy | Collio | Kerin O'Keefe | @kerinokeefe | Marco Felluga 2012 Mongris Riserva Pinot Grigio (Collio) | Pinot Grigio | Marco Felluga | |
| 117951 | US | Heavy in alcohol and overripe, but entirely dry, and the black currant, chocolate and pepper flavors will play well against richly sauced barbecue. | The Caboose | 84 | 30 | California | Alexander Valley | Sonoma | Starry Night 2007 The Caboose Zinfandel (Alexander Valley) | Zinfandel | Starry Night | ||
| 10202 | France | Fragrant; dusty plum and rose aromas are followed on the palate by very fruity flavors of cherries and apple skins. The fine bead and rich mousse result in a creamy mouthfeel. It's all a bit obvious and simple, but undeniably good. | Brut Rosa | 87 | 83 | Champagne | Champagne | Joe Czerwinski | @JoeCz | Ruinart NV Brut Rosa (Champagne) | Champagne Blend | Ruinart | |
| 69268 | Germany | Layers of plush yellow peach and apricot flavors highlight this easy, crowd-pleasing kabinett. Sunny lemon acidity lends refreshment to the midpalate, cutting through all the ripeness and leading to a brisk finish. Drink now through 2019. | Kabinett | 88 | 15 | Rheinhessen | Anna Lee C. Iijima | Weinreich 2012 Kabinett Riesling (Rheinhessen) | Riesling | Weinreich |
Find wine varieties tasted by 'Roger Voss' and with a value in the 'region_1' column of the dataset. Output unique variety names only.
NaN in Comparisons
NaN makes comparisons tricky. Any comparison with NaN returns False:
np.nan == np.nan # False (not True!)
np.nan != np.nan # True
np.nan > 5 # False
np.nan == "text" # FalseThis means rows with NaN values are excluded from most filters, even when you might not expect it.
NaN Disappears from Both Sides of a Filter
# Filtering with == excludes NaN automatically
fintech_app_users[fintech_app_users["status"] == "active"].str.contains(), .str.startswith(), and other .str methods return NaN (not False) for missing values. This means NaN rows are silently dropped from your filter results. If you need to keep them, fill missing values first with .fillna() or combine your filter with .isna().
Whenever you’re filtering and NaN might be involved, ask yourself: should missing rows be included or excluded? Then write your filter accordingly. Being explicit about NaN handling prevents subtle bugs.
Counting Missing Values
Before cleaning data, it helps to know how much is missing. A few useful patterns:
# Count NaN values per column
fintech_app_users.isna().sum()
# Percentage missing per column
fintech_app_users.isna().mean() * 100
# Total rows with any NaN
fintech_app_users.isna().any(axis=1).sum().isna().sum() counts missing values in each column. .isna().mean() gives you the proportion. These are the first things to run when exploring a new dataset.
Handling Missing Values with .fillna()
# Replace NaN in a specific column and keep the DataFrame
fintech_app_users["phone_number"] = (
fintech_app_users["phone_number"].fillna("No phone")
)This overwrites the column in place. You can also fill different columns with different defaults:
# Fill different columns with different values
df.fillna({"phone_number": "No phone", "email": "unknown@example.com"})Replace missing phone number values in `fintech_app_users` with the text Unknown.
Dropping Missing Values with .dropna()
Sometimes missing data makes a row useless. .dropna() removes rows that contain any NaN:
# Drop rows where ANY column is NaN
fintech_app_users.dropna()
# Drop rows where a SPECIFIC column is NaN
fintech_app_users.dropna(subset=["phone_number"])The subset parameter is important — without it, .dropna() removes any row that has a NaN in ANY column, which can be too aggressive.
Removing Incomplete Rows
Remove all rows from `fintech_app_users` where the phone number is missing, then display the user ID and name.
.fillna() replaces missing values with something. .dropna() removes rows (or columns) that contain missing values entirely. Use .fillna() when you want to keep the rows; use .dropna() when missing data makes the row useless.
Common Mistakes
Using == to check for NaN
# Wrong: equality check doesn't find NaN
fintech_app_users[fintech_app_users["phone_number"] == None]
# Correct: use .isna()
fintech_app_users[fintech_app_users["phone_number"].isna()]Assuming an empty string equals NaN
# These are different:
df[df["name"] == ""] # matches empty strings
df[df["name"].isna()] # matches NaN/NoneAn empty string "" is a value. NaN is the absence of a value. They require different checks.
Forgetting that NaN propagates through math
# Any arithmetic with NaN returns NaN
# 5 + NaN = NaN
# NaN * 100 = NaN
# Use .fillna(0) before math if NaN should be treated as zero
df["revenue"].fillna(0) + df["bonus"].fillna(0)Key Takeaways
NaNmeans unknown or missing — not zero, not an empty string.- Use
.isna()and.notna(), never== Noneor== NaN. - Comparisons with
NaNreturnFalse— missing rows silently disappear from filters. - Be explicit about whether missing rows should be included in your results.
.fillna()replaces missing values with defaults..dropna()removes rows with missing values.- Always check
.isna().sum()when exploring new data.
What’s Next
You now know how to filter data in all the ways that matter. In the next lesson, you’ll learn to sort your results and grab the top or bottom N rows — essential for any ranking or “top performers” question.