Module 2: Aggregation & Grouping•35 min

Introduction to Aggregate Methods

Progress Tracking

Beyond Individual Rows

In Module 1, you learned to retrieve and filter individual rows. That’s useful, but it’s not how most business questions are phrased.

Nobody asks “Show me row 47 from the orders table.” They ask:

How many customers do we have?
What’s our total revenue this quarter?
What’s the average order value?

These questions need you to crunch multiple rows down into a single answer. That’s what aggregate methods do.

The Methods You’ll Use Constantly

Pandas gives you aggregate methods directly on columns (Series) or entire DataFrames. Here are the five core ones.

.count() and len(): How Many?

len(df) gives you the total number of rows. .count() counts non-null values per column.

Python

# Total rows
len(techcorp_workforce)

# Non-null values per column
techcorp_workforce.count()

# Non-null values in one column
techcorp_workforce["phone_number"].count()

# Unique values in a column
techcorp_workforce["department"].nunique()

len() vs .count() vs .nunique()

len(df) counts all rows including NaN. .count() counts non-null values. .nunique() counts unique non-null values. Getting these confused is one of the most common sources of incorrect numbers in reports.

.sum(): Add It Up

Table: techcorp_workforce

id	first_name	last_name	department	salary	phone_number	joining_date
1	Sarah	Mitchell	HR	95000	555-0101	2021-03-15
2	Michael	Chen	HR	88000	555-0102	2022-06-01
3	Emily	Rodriguez	HR	82500		2021-09-20
4	David	Park	HR	80000	555-0104	2023-01-10
5	Lisa	Thompson	HR	65000		2021-04-05

Tables: techcorp_workforce

.sum() ignores NaN values by default. If you have salaries of 50000, 60000, and NaN, you get 110000, not an error.

.mean(): The Average

Python

techcorp_workforce["salary"].mean()

.mean() Ignores NaN

.mean() skips NaN values. If you have values 100, 200, and NaN, the mean is 150 (300 / 2), not 100 (300 / 3). This is usually what you want, but be aware of it.

They work on text too — .min() gives you the first alphabetically, .max() gives the last. And on dates: .min() is the earliest, .max() is the most recent.

Multiple Aggregates at Once with .agg()

Instead of calling each method separately, .agg() lets you run multiple aggregations in one call:

Python

orders["total_order_cost"].agg(["sum", "mean", "min", "max"])

Pass a list of method names as strings. The result is a Series with one value per aggregation.

Multiple Aggregations at Once

Tables: techcorp_workforce

.describe() for a Quick Summary

Remember .describe() from Module 1? It’s essentially .agg() with a preset list of statistics: count, mean, std, min, 25%, 50%, 75%, max. Use .describe() for exploration, .agg() when you need specific aggregations.

Samantha's and Lisa's Total Sales Revenue

Use .sum() to add up values for specific conditions.

Table: sales_performance

salesperson	widget_sales	sales_revenue	id
Jim	810	40500	1
Bobby	661	33050	2
Samantha	1006	50300	3
Taylor	984	49200	4
Tom	403	20150	5

Tables: sales_performance

Olympics Events List By Age

Combine .min(), .mean(), and .max() to summarize a column.

Table: olympics_athletes_events

id	name	sex	age	team	noc	games	year	season	city	sport	event
3520	Guillermo J. Amparan	M		Mexico	MEX	1924 Summer	1924	Summer	Paris	Athletics	Athletics Men's 800 metres
35394	Henry John Finchett	M		Great Britain	GBR	1924 Summer	1924	Summer	Paris	Gymnastics	Gymnastics Men's Rings
21918	Georg Frederik Ahrensborg Clausen	M	28	Denmark	DEN	1924 Summer	1924	Summer	Paris	Cycling	Cycling Men's Road Race Individual
110345	Marinus Cornelis Dick Sigmond	M	26	Netherlands	NED	1924 Summer	1924	Summer	Paris	Football	Football Men's Football
54193	Thodore Tho Jeitz	M	26	Luxembourg	LUX	1924 Summer	1924	Summer	Paris	Gymnastics	Gymnastics Men's Individual All-Around

Tables: olympics_athletes_events

Hour Of Highest Gas Expense

Table: lyft_rides

index	weather	hour	travel_distance	gasoline_cost
0	cloudy	7	24.47	1.13
1	cloudy	23	23.67	1.99
2	sunny	17	20.93	0.86
3	rainy	2	29.58	0.85
4	rainy	7	16.11	0.95

Tables: lyft_rides

Key Takeaways

.sum(), .mean(), .min(), .max(), .count() are your core aggregate methods.
All of them skip NaN values by default.
len(df) counts all rows; .count() counts non-null; .nunique() counts unique.
.agg([...]) runs multiple aggregations in one call.
Ask yourself: am I counting rows or unique entities?

What’s Next

Right now, you’re getting one number for the entire DataFrame. But what if you need revenue by region? Headcount by department? That’s where .groupby() comes in.

Next upContinue →

Grouping Data

35 min