Module 1: DataFrame Fundamentals•40 min

Working with Missing Data

Progress Tracking

The Missing Data Problem

Missing values cause bugs, return unexpected results, and trip up even experienced analysts. But once you understand how pandas handles them, you can avoid the pitfalls.

What NaN Actually Means

NaN stands for "Not a Number," but it really means "missing" or "unknown." It’s not zero. It’s not an empty string. It’s the absence of a value.

Think of it this way:

0 — the value is zero
"" (empty string) — the value is blank text
NaN — we don’t know what the value is

This distinction matters because NaN behaves differently from actual values.

NaN vs None

In pandas, both NaN (from NumPy) and None (Python’s built-in) represent missing data. Pandas treats them interchangeably in most situations. You’ll see both, and the same methods handle both.

A Real Example

Let’s look at the fintech_app_users table:

Table: fintech_app_users

user_id	user_name	phone_number
1	Emma Johnson	8994954670
2	Noah Williams	7848177904
3	Olivia Garcia	921-171-1750
4	Liam Martinez
5	Ava Rodriguez

Some users haven’t provided their phone numbers — those values show up as NaN. We don’t know their phone numbers. Maybe they never entered one. Maybe the data was lost. The point is: we don’t have a value.

Finding Missing Values: .isna()

Here’s the first surprise: you can’t use == to check for NaN. NaN == NaN evaluates to False in Python, so equality checks won’t find missing values. Instead, use .isna():

Python

fintech_app_users[fintech_app_users["phone_number"].isna()]

.isna() returns True for every row where the value is missing, and False everywhere else.

Users Missing Phone Numbers

Tables: fintech_app_users

Excluding Missing Values: .notna()

To find rows where a value exists, use .notna() — the opposite of .isna():

Tables: fintech_app_users

This returns only rows where phone_number has a non-null value.

Wine Varieties Tasted by Roger Voss

Table: winemag_p2

id	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
118040	US	A bit too soft and thus lacks structure. Shows a good array of wild berry and briary, brambly flavors. Dry and spicy, and ready now.	The Convict Rocky Ridge Vineyard	86	38	California	Rockpile	Sonoma			Paradise Ridge 2006 The Convict Rocky Ridge Vineyard Zinfandel (Rockpile)	Zinfandel	Paradise Ridge
59743	Italy	Aromas of toasted oak, vanilla and a confectionary note lead the nose while the palate offers mature Golden Delicious apple, butterscotch and walnut skin alongside bracing acidity.	Mongris Riserva	88	30	Northeastern Italy	Collio		Kerin O'Keefe	@kerinokeefe	Marco Felluga 2012 Mongris Riserva Pinot Grigio (Collio)	Pinot Grigio	Marco Felluga
117951	US	Heavy in alcohol and overripe, but entirely dry, and the black currant, chocolate and pepper flavors will play well against richly sauced barbecue.	The Caboose	84	30	California	Alexander Valley	Sonoma			Starry Night 2007 The Caboose Zinfandel (Alexander Valley)	Zinfandel	Starry Night
10202	France	Fragrant; dusty plum and rose aromas are followed on the palate by very fruity flavors of cherries and apple skins. The fine bead and rich mousse result in a creamy mouthfeel. It's all a bit obvious and simple, but undeniably good.	Brut Rosa	87	83	Champagne	Champagne		Joe Czerwinski	@JoeCz	Ruinart NV Brut Rosa (Champagne)	Champagne Blend	Ruinart
69268	Germany	Layers of plush yellow peach and apricot flavors highlight this easy, crowd-pleasing kabinett. Sunny lemon acidity lends refreshment to the midpalate, cutting through all the ripeness and leading to a brisk finish. Drink now through 2019.	Kabinett	88	15	Rheinhessen			Anna Lee C. Iijima		Weinreich 2012 Kabinett Riesling (Rheinhessen)	Riesling	Weinreich

Tables: winemag_p2

NaN in Comparisons

NaN makes comparisons tricky. Any comparison with NaN returns False:

Python

np.nan == np.nan    # False (not True!)
np.nan != np.nan    # True
np.nan > 5          # False
np.nan == "text"    # False

This means rows with NaN values are excluded from most filters, even when you might not expect it.

NaN Disappears from Both Sides of a Filter

Python

# Filtering with == excludes NaN automatically
fintech_app_users[fintech_app_users["status"] == "active"]

String Methods and NaN

.str.contains(), .str.startswith(), and other .str methods return NaN (not False) for missing values. This means NaN rows are silently dropped from your filter results. If you need to keep them, fill missing values first with .fillna() or combine your filter with .isna().

Always Ask: Should Missing Values Be Included?

Whenever you’re filtering and NaN might be involved, ask yourself: should missing rows be included or excluded? Then write your filter accordingly. Being explicit about NaN handling prevents subtle bugs.

Counting Missing Values

Before cleaning data, it helps to know how much is missing. A few useful patterns:

Python

# Count NaN values per column
fintech_app_users.isna().sum()

# Percentage missing per column
fintech_app_users.isna().mean() * 100

# Total rows with any NaN
fintech_app_users.isna().any(axis=1).sum()

.isna().sum() counts missing values in each column. .isna().mean() gives you the proportion. These are the first things to run when exploring a new dataset.

Handling Missing Values with .fillna()

Python

# Replace NaN in a specific column and keep the DataFrame
fintech_app_users["phone_number"] = (
    fintech_app_users["phone_number"].fillna("No phone")
)

This overwrites the column in place. You can also fill different columns with different defaults:

Python

# Fill different columns with different values
df.fillna({"phone_number": "No phone", "email": "unknown@example.com"})

Table: fintech_app_users

user_id	user_name	phone_number
1	Emma Johnson	8994954670
2	Noah Williams	7848177904
3	Olivia Garcia	921-171-1750
4	Liam Martinez
5	Ava Rodriguez

Tables: fintech_app_users

Dropping Missing Values with .dropna()

Sometimes missing data makes a row useless. .dropna() removes rows that contain any NaN:

Python

# Drop rows where ANY column is NaN
fintech_app_users.dropna()

# Drop rows where a SPECIFIC column is NaN
fintech_app_users.dropna(subset=["phone_number"])

The subset parameter is important — without it, .dropna() removes any row that has a NaN in ANY column, which can be too aggressive.

Removing Incomplete Rows

Tables: fintech_app_users

.fillna() vs .dropna()

.fillna() replaces missing values with something. .dropna() removes rows (or columns) that contain missing values entirely. Use .fillna() when you want to keep the rows; use .dropna() when missing data makes the row useless.

Common Mistakes

Using == to check for NaN

Python

# Wrong: equality check doesn't find NaN
fintech_app_users[fintech_app_users["phone_number"] == None]

# Correct: use .isna()
fintech_app_users[fintech_app_users["phone_number"].isna()]

Assuming an empty string equals NaN

Python

# These are different:
df[df["name"] == ""]      # matches empty strings
df[df["name"].isna()]     # matches NaN/None

An empty string "" is a value. NaN is the absence of a value. They require different checks.

Forgetting that NaN propagates through math

Python

# Any arithmetic with NaN returns NaN
# 5 + NaN = NaN
# NaN * 100 = NaN

# Use .fillna(0) before math if NaN should be treated as zero
df["revenue"].fillna(0) + df["bonus"].fillna(0)

Key Takeaways

NaN means unknown or missing — not zero, not an empty string.
Use .isna() and .notna(), never == None or == NaN.
Comparisons with NaN return False — missing rows silently disappear from filters.
Be explicit about whether missing rows should be included in your results.
.fillna() replaces missing values with defaults. .dropna() removes rows with missing values.
Always check .isna().sum() when exploring new data.

What’s Next

You now know how to filter data in all the ways that matter. In the next lesson, you’ll learn to sort your results and grab the top or bottom N rows — essential for any ranking or “top performers” question.

Next upContinue →

Sorting and Limiting Results

30 min