NaN in Aggregations
Progress Tracking
Log in to save this lesson and continue from where you left off.
NaN (Not a Number) is pandas’ way of saying “missing.” The tricky part: aggregation functions handle NaN differently from what you’d expect. .mean() quietly ignores NaN rows. len() counts them. .count() skips them. If you don’t know these rules, your numbers will be wrong and you won’t know why.
The Silent Data Problem
All pandas aggregate methods skip NaN by default. Usually that’s what you want. But it can produce misleading numbers if you’re not paying attention.
How Each Method Handles NaN
| id | first_name | last_name | department | salary | phone_number | joining_date |
|---|---|---|---|---|---|---|
| 1 | Sarah | Mitchell | HR | 95000 | 555-0101 | 2021-03-15 |
| 2 | Michael | Chen | HR | 88000 | 555-0102 | 2022-06-01 |
| 3 | Emily | Rodriguez | HR | 82500 | 2021-09-20 | |
| 4 | David | Park | HR | 80000 | 555-0104 | 2023-01-10 |
| 5 | Lisa | Thompson | HR | 65000 | 2021-04-05 |
# len() counts ALL rows, including NaN
len(techcorp_workforce)
# .count() counts only non-null values
techcorp_workforce["phone_number"].count()
# .sum(), .mean(), .min(), .max() all skip NaN
techcorp_workforce["salary"].mean()The gap between len(df) and .count() tells you exactly how many values are missing.
If a column has 15 rows but 5 are NaN, .mean() divides by the 10 non-null values, not all 15 rows. This is usually correct, but be aware of it when reporting averages — your sample size might be smaller than you think.
Calculating Ratios Safely
This count gap is exactly how you calculate completeness ratios. Divide .count() (non-null) by len() (total rows) to get the fraction of rows that have data. Get it wrong — use .count() for both — and you'll always get 1.0.
Calculate what fraction of employees have a phone number on record.
Contact Information Completeness
| id | first_name | last_name | department | salary | phone_number | joining_date |
|---|---|---|---|---|---|---|
| 1 | Sarah | Mitchell | HR | 95000 | 555-0101 | 2021-03-15 |
| 2 | Michael | Chen | HR | 88000 | 555-0102 | 2022-06-01 |
| 3 | Emily | Rodriguez | HR | 82500 | 2021-09-20 | |
| 4 | David | Park | HR | 80000 | 555-0104 | 2023-01-10 |
| 5 | Lisa | Thompson | HR | 65000 | 2021-04-05 |
The data quality team is auditing employee records to assess the completeness of contact information. Calculate and return the ratio of employees who have a `NULL` phone number.
NaN in GROUP BY
When you group by a column that has NaN values, those rows silently disappear:
# If department has NaN values, those rows are excluded
(
techcorp_workforce
.groupby("department")["id"]
.count()
.reset_index(name="emp_count")
)By default, .groupby() drops NaN keys. If you need a group for missing values, fill them first: df["col"].fillna("Unknown") before grouping.
Skip NaN (default) when missing means “unknown.” Fill with a value when missing has a business meaning — for example, .fillna("N/A") for display, or .fillna(0) when missing means zero.
Key Takeaways
- All aggregate methods skip
NaNby default. len(df)counts all rows;.count()counts non-null only.- The gap between them tells you how much data is missing.
.fillna(0)before aggregating treats missing as zero..groupby()silently dropsNaNgroup keys.
What's Next
So far, your aggregations operate on all rows in a group. But what if you need to count only certain values, or sum conditionally? That's where conditional aggregation comes in.