Module 2: Aggregation & Grouping15 min

NaN in Aggregations

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

NaN (Not a Number) is pandas’ way of saying “missing.” The tricky part: aggregation functions handle NaN differently from what you’d expect. .mean() quietly ignores NaN rows. len() counts them. .count() skips them. If you don’t know these rules, your numbers will be wrong and you won’t know why.

The Silent Data Problem

All pandas aggregate methods skip NaN by default. Usually that’s what you want. But it can produce misleading numbers if you’re not paying attention.

How Each Method Handles NaN

Table: techcorp_workforce
idfirst_namelast_namedepartmentsalaryphone_numberjoining_date
1SarahMitchellHR95000555-01012021-03-15
2MichaelChenHR88000555-01022022-06-01
3EmilyRodriguezHR825002021-09-20
4DavidParkHR80000555-01042023-01-10
5LisaThompsonHR650002021-04-05
Python
# len() counts ALL rows, including NaN
len(techcorp_workforce)

# .count() counts only non-null values
techcorp_workforce["phone_number"].count()

# .sum(), .mean(), .min(), .max() all skip NaN
techcorp_workforce["salary"].mean()

The gap between len(df) and .count() tells you exactly how many values are missing.

.mean() with NaN

If a column has 15 rows but 5 are NaN, .mean() divides by the 10 non-null values, not all 15 rows. This is usually correct, but be aware of it when reporting averages — your sample size might be smaller than you think.

Calculating Ratios Safely

This count gap is exactly how you calculate completeness ratios. Divide .count() (non-null) by len() (total rows) to get the fraction of rows that have data. Get it wrong — use .count() for both — and you'll always get 1.0.

1
Calculate a Phone Number Ratio

Calculate what fraction of employees have a phone number on record.

Tables: techcorp_workforce

Contact Information Completeness

Table: techcorp_workforce
idfirst_namelast_namedepartmentsalaryphone_numberjoining_date
1SarahMitchellHR95000555-01012021-03-15
2MichaelChenHR88000555-01022022-06-01
3EmilyRodriguezHR825002021-09-20
4DavidParkHR80000555-01042023-01-10
5LisaThompsonHR650002021-04-05
2
Contact Information Completeness
View solution

The data quality team is auditing employee records to assess the completeness of contact information. Calculate and return the ratio of employees who have a `NULL` phone number.

Tables: techcorp_workforce

NaN in GROUP BY

When you group by a column that has NaN values, those rows silently disappear:

Python
# If department has NaN values, those rows are excluded
(
    techcorp_workforce
    .groupby("department")["id"]
    .count()
    .reset_index(name="emp_count")
)
NaN Groups Are Silently Dropped

By default, .groupby() drops NaN keys. If you need a group for missing values, fill them first: df["col"].fillna("Unknown") before grouping.

When to Fill vs. When to Skip

Skip NaN (default) when missing means “unknown.” Fill with a value when missing has a business meaning — for example, .fillna("N/A") for display, or .fillna(0) when missing means zero.

Key Takeaways

  • All aggregate methods skip NaN by default.
  • len(df) counts all rows; .count() counts non-null only.
  • The gap between them tells you how much data is missing.
  • .fillna(0) before aggregating treats missing as zero.
  • .groupby() silently drops NaN group keys.

What's Next

So far, your aggregations operate on all rows in a group. But what if you need to count only certain values, or sum conditionally? That's where conditional aggregation comes in.