Module 2: Aggregation & Grouping35 min

Introduction to Aggregate Methods

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Beyond Individual Rows

In Module 1, you learned to retrieve and filter individual rows. That’s useful, but it’s not how most business questions are phrased.

Nobody asks “Show me row 47 from the orders table.” They ask:

  • How many customers do we have?
  • What’s our total revenue this quarter?
  • What’s the average order value?

These questions need you to crunch multiple rows down into a single answer. That’s what aggregate methods do.

The Methods You’ll Use Constantly

Pandas gives you aggregate methods directly on columns (Series) or entire DataFrames. Here are the five core ones.

.count() and len(): How Many?

len(df) gives you the total number of rows. .count() counts non-null values per column.

Python
# Total rows
len(techcorp_workforce)

# Non-null values per column
techcorp_workforce.count()

# Non-null values in one column
techcorp_workforce["phone_number"].count()

# Unique values in a column
techcorp_workforce["department"].nunique()
len() vs .count() vs .nunique()

len(df) counts all rows including NaN. .count() counts non-null values. .nunique() counts unique non-null values. Getting these confused is one of the most common sources of incorrect numbers in reports.

.sum(): Add It Up

Table: techcorp_workforce
idfirst_namelast_namedepartmentsalaryphone_numberjoining_date
1SarahMitchellHR95000555-01012021-03-15
2MichaelChenHR88000555-01022022-06-01
3EmilyRodriguezHR825002021-09-20
4DavidParkHR80000555-01042023-01-10
5LisaThompsonHR650002021-04-05
1
Calculate Total Payroll

Calculate the total payroll by summing all salaries.

Tables: techcorp_workforce

.sum() ignores NaN values by default. If you have salaries of 50000, 60000, and NaN, you get 110000, not an error.

.mean(): The Average

Python
techcorp_workforce["salary"].mean()
.mean() Ignores NaN

.mean() skips NaN values. If you have values 100, 200, and NaN, the mean is 150 (300 / 2), not 100 (300 / 3). This is usually what you want, but be aware of it.

They work on text too — .min() gives you the first alphabetically, .max() gives the last. And on dates: .min() is the earliest, .max() is the most recent.

Multiple Aggregates at Once with .agg()

Instead of calling each method separately, .agg() lets you run multiple aggregations in one call:

Python
orders["total_order_cost"].agg(["sum", "mean", "min", "max"])

Pass a list of method names as strings. The result is a Series with one value per aggregation.

Multiple Aggregations at Once

2
Summarize Salaries

Use `.agg()` to calculate the sum, mean, min, and max salary in one expression.

Tables: techcorp_workforce
.describe() for a Quick Summary

Remember .describe() from Module 1? It’s essentially .agg() with a preset list of statistics: count, mean, std, min, 25%, 50%, 75%, max. Use .describe() for exploration, .agg() when you need specific aggregations.

Samantha's and Lisa's Total Sales Revenue

Use .sum() to add up values for specific conditions.

Table: sales_performance
salespersonwidget_salessales_revenueid
Jim810405001
Bobby661330502
Samantha1006503003
Taylor984492004
Tom403201505
3
Calculate Samantha's and Lisa's total sales revenue
View solution

What is the total sales revenue of Samantha and Lisa?

Tables: sales_performance

Olympics Events List By Age

Combine .min(), .mean(), and .max() to summarize a column.

Table: olympics_athletes_events
idnamesexageheightweightteamnocgamesyearseasoncitysporteventmedal
3520Guillermo J. AmparanMMexicoMEX1924 Summer1924SummerParisAthleticsAthletics Men's 800 metres
35394Henry John FinchettMGreat BritainGBR1924 Summer1924SummerParisGymnasticsGymnastics Men's Rings
21918Georg Frederik Ahrensborg ClausenM28DenmarkDEN1924 Summer1924SummerParisCyclingCycling Men's Road Race Individual
110345Marinus Cornelis Dick SigmondM26NetherlandsNED1924 Summer1924SummerParisFootballFootball Men's Football
54193Thodore Tho JeitzM26LuxembourgLUX1924 Summer1924SummerParisGymnasticsGymnastics Men's Individual All-Around
4
Olympics Events List By Age
View solution

Find the lowest, average, and the highest ages of athletes across all Olympics. HINT: If athlete participated in more than one discipline at one Olympic games, consider it as a separate athlete, no need to remove such edge cases.

Tables: olympics_athletes_events

Hour Of Highest Gas Expense

Table: lyft_rides
indexweatherhourtravel_distancegasoline_cost
0cloudy724.471.13
1cloudy2323.671.99
2sunny1720.930.86
3rainy229.580.85
4rainy716.110.95
5
Hour Of Highest Gas Expense
View solution

Find the hour with the highest gasoline cost. Assume there's only 1 hour with the highest gas cost.

Tables: lyft_rides

Key Takeaways

  • .sum(), .mean(), .min(), .max(), .count() are your core aggregate methods.
  • All of them skip NaN values by default.
  • len(df) counts all rows; .count() counts non-null; .nunique() counts unique.
  • .agg([...]) runs multiple aggregations in one call.
  • Ask yourself: am I counting rows or unique entities?

What’s Next

Right now, you’re getting one number for the entire DataFrame. But what if you need revenue by region? Headcount by department? That’s where .groupby() comes in.