Module 1: DataFrame Fundamentals40 min

Working with Missing Data

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

The Missing Data Problem

Missing values cause bugs, return unexpected results, and trip up even experienced analysts. But once you understand how pandas handles them, you can avoid the pitfalls.

What NaN Actually Means

NaN stands for "Not a Number," but it really means "missing" or "unknown." It’s not zero. It’s not an empty string. It’s the absence of a value.

Think of it this way:

  • 0 — the value is zero
  • "" (empty string) — the value is blank text
  • NaN — we don’t know what the value is

This distinction matters because NaN behaves differently from actual values.

NaN vs None

In pandas, both NaN (from NumPy) and None (Python’s built-in) represent missing data. Pandas treats them interchangeably in most situations. You’ll see both, and the same methods handle both.

A Real Example

Let’s look at the fintech_app_users table:

Table: fintech_app_users
user_iduser_namephone_number
1Emma Johnson8994954670
2Noah Williams7848177904
3Olivia Garcia921-171-1750
4Liam Martinez
5Ava Rodriguez

Some users haven’t provided their phone numbers — those values show up as NaN. We don’t know their phone numbers. Maybe they never entered one. Maybe the data was lost. The point is: we don’t have a value.

Finding Missing Values: .isna()

Here’s the first surprise: you can’t use == to check for NaN. NaN == NaN evaluates to False in Python, so equality checks won’t find missing values. Instead, use .isna():

Python
fintech_app_users[fintech_app_users["phone_number"].isna()]

.isna() returns True for every row where the value is missing, and False everywhere else.

Users Missing Phone Numbers

Table: fintech_app_users
user_iduser_namephone_number
1Emma Johnson8994954670
2Noah Williams7848177904
3Olivia Garcia921-171-1750
4Liam Martinez
5Ava Rodriguez
1
Users Missing Phone Numbers
View solution

The product team is launching a new WhatsApp notification feature and needs to identify users who haven't provided their phone numbers yet. These users will be shown a prompt to add their contact information. Find all users who have not provided a phone number. Return the `user ID` and `name`.

Tables: fintech_app_users

Excluding Missing Values: .notna()

To find rows where a value exists, use .notna() — the opposite of .isna():

2
Find Users with a Phone Number

Find all users who have a phone number on record.

Tables: fintech_app_users

This returns only rows where phone_number has a non-null value.

Wine Varieties Tasted by Roger Voss

Table: winemag_p2
idcountrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
118040USA bit too soft and thus lacks structure. Shows a good array of wild berry and briary, brambly flavors. Dry and spicy, and ready now.The Convict Rocky Ridge Vineyard8638CaliforniaRockpileSonomaParadise Ridge 2006 The Convict Rocky Ridge Vineyard Zinfandel (Rockpile)ZinfandelParadise Ridge
59743ItalyAromas of toasted oak, vanilla and a confectionary note lead the nose while the palate offers mature Golden Delicious apple, butterscotch and walnut skin alongside bracing acidity.Mongris Riserva8830Northeastern ItalyCollioKerin O'Keefe@kerinokeefeMarco Felluga 2012 Mongris Riserva Pinot Grigio (Collio)Pinot GrigioMarco Felluga
117951USHeavy in alcohol and overripe, but entirely dry, and the black currant, chocolate and pepper flavors will play well against richly sauced barbecue.The Caboose8430CaliforniaAlexander ValleySonomaStarry Night 2007 The Caboose Zinfandel (Alexander Valley)ZinfandelStarry Night
10202FranceFragrant; dusty plum and rose aromas are followed on the palate by very fruity flavors of cherries and apple skins. The fine bead and rich mousse result in a creamy mouthfeel. It's all a bit obvious and simple, but undeniably good.Brut Rosa8783ChampagneChampagneJoe Czerwinski@JoeCzRuinart NV Brut Rosa (Champagne)Champagne BlendRuinart
69268GermanyLayers of plush yellow peach and apricot flavors highlight this easy, crowd-pleasing kabinett. Sunny lemon acidity lends refreshment to the midpalate, cutting through all the ripeness and leading to a brisk finish. Drink now through 2019.Kabinett8815RheinhessenAnna Lee C. IijimaWeinreich 2012 Kabinett Riesling (Rheinhessen)RieslingWeinreich
3
Wine varieties tasted by 'Roger Voss'
View solution

Find wine varieties tasted by 'Roger Voss' and with a value in the 'region_1' column of the dataset. Output unique variety names only.

Tables: winemag_p2

NaN in Comparisons

NaN makes comparisons tricky. Any comparison with NaN returns False:

Python
np.nan == np.nan    # False (not True!)
np.nan != np.nan    # True
np.nan > 5          # False
np.nan == "text"    # False

This means rows with NaN values are excluded from most filters, even when you might not expect it.

NaN Disappears from Both Sides of a Filter

Python
# Filtering with == excludes NaN automatically
fintech_app_users[fintech_app_users["status"] == "active"]
String Methods and NaN

.str.contains(), .str.startswith(), and other .str methods return NaN (not False) for missing values. This means NaN rows are silently dropped from your filter results. If you need to keep them, fill missing values first with .fillna() or combine your filter with .isna().

Always Ask: Should Missing Values Be Included?

Whenever you’re filtering and NaN might be involved, ask yourself: should missing rows be included or excluded? Then write your filter accordingly. Being explicit about NaN handling prevents subtle bugs.

Counting Missing Values

Before cleaning data, it helps to know how much is missing. A few useful patterns:

Python
# Count NaN values per column
fintech_app_users.isna().sum()

# Percentage missing per column
fintech_app_users.isna().mean() * 100

# Total rows with any NaN
fintech_app_users.isna().any(axis=1).sum()

.isna().sum() counts missing values in each column. .isna().mean() gives you the proportion. These are the first things to run when exploring a new dataset.

Handling Missing Values with .fillna()

Python
# Replace NaN in a specific column and keep the DataFrame
fintech_app_users["phone_number"] = (
    fintech_app_users["phone_number"].fillna("No phone")
)

This overwrites the column in place. You can also fill different columns with different defaults:

Python
# Fill different columns with different values
df.fillna({"phone_number": "No phone", "email": "unknown@example.com"})
4
Fill Missing Phone Numbers

Replace missing phone number values in `fintech_app_users` with the text Unknown.

Tables: fintech_app_users

Dropping Missing Values with .dropna()

Sometimes missing data makes a row useless. .dropna() removes rows that contain any NaN:

Python
# Drop rows where ANY column is NaN
fintech_app_users.dropna()

# Drop rows where a SPECIFIC column is NaN
fintech_app_users.dropna(subset=["phone_number"])

The subset parameter is important — without it, .dropna() removes any row that has a NaN in ANY column, which can be too aggressive.

Removing Incomplete Rows

5
Drop Rows Missing Phone Numbers

Remove all rows from `fintech_app_users` where the phone number is missing, then display the user ID and name.

Tables: fintech_app_users
.fillna() vs .dropna()

.fillna() replaces missing values with something. .dropna() removes rows (or columns) that contain missing values entirely. Use .fillna() when you want to keep the rows; use .dropna() when missing data makes the row useless.

Common Mistakes

Using == to check for NaN

Python
# Wrong: equality check doesn't find NaN
fintech_app_users[fintech_app_users["phone_number"] == None]

# Correct: use .isna()
fintech_app_users[fintech_app_users["phone_number"].isna()]

Assuming an empty string equals NaN

Python
# These are different:
df[df["name"] == ""]      # matches empty strings
df[df["name"].isna()]     # matches NaN/None

An empty string "" is a value. NaN is the absence of a value. They require different checks.

Forgetting that NaN propagates through math

Python
# Any arithmetic with NaN returns NaN
# 5 + NaN = NaN
# NaN * 100 = NaN

# Use .fillna(0) before math if NaN should be treated as zero
df["revenue"].fillna(0) + df["bonus"].fillna(0)

Key Takeaways

  • NaN means unknown or missing — not zero, not an empty string.
  • Use .isna() and .notna(), never == None or == NaN.
  • Comparisons with NaN return False — missing rows silently disappear from filters.
  • Be explicit about whether missing rows should be included in your results.
  • .fillna() replaces missing values with defaults. .dropna() removes rows with missing values.
  • Always check .isna().sum() when exploring new data.

What’s Next

You now know how to filter data in all the ways that matter. In the next lesson, you’ll learn to sort your results and grab the top or bottom N rows — essential for any ranking or “top performers” question.