Module 2: Aggregation & Grouping•15 min

NaN in Aggregations

Progress Tracking

NaN (Not a Number) is pandas’ way of saying “missing.” The tricky part: aggregation functions handle NaN differently from what you’d expect. .mean() quietly ignores NaN rows. len() counts them. .count() skips them. If you don’t know these rules, your numbers will be wrong and you won’t know why.

The Silent Data Problem

All pandas aggregate methods skip NaN by default. Usually that’s what you want. But it can produce misleading numbers if you’re not paying attention.

How Each Method Handles NaN

Table: techcorp_workforce

id	first_name	last_name	department	salary	phone_number	joining_date
1	Sarah	Mitchell	HR	95000	555-0101	2021-03-15
2	Michael	Chen	HR	88000	555-0102	2022-06-01
3	Emily	Rodriguez	HR	82500		2021-09-20
4	David	Park	HR	80000	555-0104	2023-01-10
5	Lisa	Thompson	HR	65000		2021-04-05

Python

# len() counts ALL rows, including NaN
len(techcorp_workforce)

# .count() counts only non-null values
techcorp_workforce["phone_number"].count()

# .sum(), .mean(), .min(), .max() all skip NaN
techcorp_workforce["salary"].mean()

The gap between len(df) and .count() tells you exactly how many values are missing.

.mean() with NaN

If a column has 15 rows but 5 are NaN, .mean() divides by the 10 non-null values, not all 15 rows. This is usually correct, but be aware of it when reporting averages — your sample size might be smaller than you think.

Calculating Ratios Safely

This count gap is exactly how you calculate completeness ratios. Divide .count() (non-null) by len() (total rows) to get the fraction of rows that have data. Get it wrong — use .count() for both — and you'll always get 1.0.

Tables: techcorp_workforce

Contact Information Completeness

Table: techcorp_workforce

id	first_name	last_name	department	salary	phone_number	joining_date
1	Sarah	Mitchell	HR	95000	555-0101	2021-03-15
2	Michael	Chen	HR	88000	555-0102	2022-06-01
3	Emily	Rodriguez	HR	82500		2021-09-20
4	David	Park	HR	80000	555-0104	2023-01-10
5	Lisa	Thompson	HR	65000		2021-04-05

Tables: techcorp_workforce

NaN in GROUP BY

When you group by a column that has NaN values, those rows silently disappear:

Python

# If department has NaN values, those rows are excluded
(
    techcorp_workforce
    .groupby("department")["id"]
    .count()
    .reset_index(name="emp_count")
)

NaN Groups Are Silently Dropped

By default, .groupby() drops NaN keys. If you need a group for missing values, fill them first: df["col"].fillna("Unknown") before grouping.

When to Fill vs. When to Skip

Skip NaN (default) when missing means “unknown.” Fill with a value when missing has a business meaning — for example, .fillna("N/A") for display, or .fillna(0) when missing means zero.

Key Takeaways

All aggregate methods skip NaN by default.
len(df) counts all rows; .count() counts non-null only.
The gap between them tells you how much data is missing.
.fillna(0) before aggregating treats missing as zero.
.groupby() silently drops NaN group keys.

What's Next

So far, your aggregations operate on all rows in a group. But what if you need to count only certain values, or sum conditionally? That's where conditional aggregation comes in.

Next upContinue →

Conditional Aggregation

35 min