Module 2: Aggregation & Grouping•35 min

Conditional Aggregation

Progress Tracking

Sometimes you need to count or sum only the rows that meet a condition — “what percentage of users are active?” or “how much revenue came from each region?” In SQL, you’d use CASE WHEN inside an aggregate. In pandas, you create a boolean mask and aggregate it directly. The pattern is elegant once you see it: True counts as 1, False as 0.

When Basic Aggregates Aren’t Enough

So far, your aggregates operate on all rows. But what if you need to count only certain values? Sum only specific items? Get multiple conditional counts in one result?

In SQL, you’d use CASE WHEN inside SUM(). In pandas, you use boolean masks — which are simpler and more Pythonic.

The Pattern: Boolean Mask + .sum()

A boolean comparison returns True/False. Since Python treats True as 1 and False as 0, you can .sum() a boolean mask to count matches:

Table: techcorp_workforce

id	first_name	last_name	department	salary	phone_number	joining_date
1	Sarah	Mitchell	HR	95000	555-0101	2021-03-15
2	Michael	Chen	HR	88000	555-0102	2022-06-01
3	Emily	Rodriguez	HR	82500		2021-09-20
4	David	Park	HR	80000	555-0104	2023-01-10
5	Lisa	Thompson	HR	65000		2021-04-05

Python

# Count employees in each department
hr_count = (techcorp_workforce["department"] == "HR").sum()
admin_count = (techcorp_workforce["department"] == "Admin").sum()
total = len(techcorp_workforce)

pd.DataFrame({
    "total": [total],
    "hr_count": [hr_count],
    "admin_count": [admin_count]
})

One result with multiple conditional counts — no groupby needed.

Python

high_earners = (techcorp_workforce["salary"] > 100000).sum()
total = len(techcorp_workforce)
round(100.0 * high_earners / total, 2)

Use 100.0 for Percentages

In Python 3, integer division already returns a float (3 / 4 = 0.75). But multiplying by 100.0 makes the intent explicit and readable.

Computing a Ratio

Tables: techcorp_workforce

Conditional Sums

You can sum values conditionally too, not just count:

Tables: techcorp_workforce

Pivot-Style Cross-Tabulations

For breakdowns by two dimensions, pd.crosstab() is the pandas equivalent of a pivot table:

Python

pd.crosstab(titanic["survived"], titanic["pclass"])

One line, and you’ve got survival counts broken down by passenger class.

Inspections by Risk Category

Table: sf_restaurant_health_violations

business_id	business_name	business_address	business_city	business_state	business_postal_code	business_latitude	business_longitude	business_location	inspection_id	inspection_date	inspection_score	inspection_type	violation_id	violation_description	risk_category
5800	John Chin Elementary School	350 Broadway St	San Francisco	CA	94133	37.8	-122.4	{'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'}	5800_20171017	2017-10-17	98	Routine - Unscheduled	5800_20171017_103149	Wiping cloths not clean or properly stored or inadequate sanitizer	Low Risk
64236	Sutter Pub and Restaurant	700 Sutter St	San Francisco	CA	94102	37.79	-122.41	{'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'}	64236_20170725	2017-07-25	88	Routine - Unscheduled	64236_20170725_103133	Foods not protected from contamination	Moderate Risk
1991	SRI THAI CUISINE	4621 LINCOLN Way	San Francisco	CA	94122	37.76	-122.51	{'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'}	1991_20171129	2017-11-29	86	Routine - Unscheduled	1991_20171129_103139	Improper food storage	Low Risk
3816	Washington Bakery & Restaurant	733 Washington St	San Francisco	CA	94108	37.8	-122.41	{'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'}	3816_20160728	2016-07-28	67	Routine - Unscheduled	3816_20160728_103108	Contaminated or adulterated food	High Risk
39119	Brothers Restaurant	4128 GEARY Blvd	San Francisco	CA	94118	37.78	-122.46	{'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'}	39119_20160718	2016-07-18	79	Routine - Unscheduled	39119_20160718_103133	Foods not protected from contamination	Moderate Risk

Find the number of inspections for each risk category by inspection type

View solution

Find the number of inspections that resulted in each risk category per each inspection type. Consider the records with no risk category value belongs to a separate category. Output the result along with the corresponding inspection type and the corresponding total number of inspections per that type. The output should be pivoted, meaning that each risk category + total number should be a separate column. Order the result based on the number of inspections per inspection type in descending order.

Tables: sf_restaurant_health_violations

Share of Active Users

Table: fb_active_users

user_id	name	status	country
33	Amanda Leon	open	Australia
27	Jessica Farrell	open	Luxembourg
18	Wanda Ramirez	open	USA
50	Samuel Miller	closed	Brazil
16	Jacob York	open	Australia

Tables: fb_active_users

Processed Ticket Rate By Type

Table: facebook_complaints

complaint_id	type	processed
0	0	TRUE
1	0	TRUE
2	0	FALSE
3	1	TRUE
4	1	TRUE

Tables: facebook_complaints

Key Takeaways

Boolean masks + .sum() count matching rows: (df["col"] > x).sum().
Use [condition]["col"].sum() for conditional sums.
pd.crosstab() creates pivot-style cross-tabulations in one line.
For percentages: conditional count / total count * 100.
These techniques replace SQL’s SUM(CASE WHEN ...) pattern.

What’s Next

You’ve now got all the building blocks. The final lesson pulls everything together with complex analyses that combine filtering, grouping, and conditional aggregation.

Next upContinue →

Combining Aggregation Techniques

30 min