Conditional Aggregation
Progress Tracking
Log in to save this lesson and continue from where you left off.
Sometimes you need to count or sum only the rows that meet a condition — “what percentage of users are active?” or “how much revenue came from each region?” In SQL, you’d use CASE WHEN inside an aggregate. In pandas, you create a boolean mask and aggregate it directly. The pattern is elegant once you see it: True counts as 1, False as 0.
When Basic Aggregates Aren’t Enough
So far, your aggregates operate on all rows. But what if you need to count only certain values? Sum only specific items? Get multiple conditional counts in one result?
In SQL, you’d use CASE WHEN inside SUM(). In pandas, you use boolean masks — which are simpler and more Pythonic.
The Pattern: Boolean Mask + .sum()
A boolean comparison returns True/False. Since Python treats True as 1 and False as 0, you can .sum() a boolean mask to count matches:
| id | first_name | last_name | department | salary | phone_number | joining_date |
|---|---|---|---|---|---|---|
| 1 | Sarah | Mitchell | HR | 95000 | 555-0101 | 2021-03-15 |
| 2 | Michael | Chen | HR | 88000 | 555-0102 | 2022-06-01 |
| 3 | Emily | Rodriguez | HR | 82500 | 2021-09-20 | |
| 4 | David | Park | HR | 80000 | 555-0104 | 2023-01-10 |
| 5 | Lisa | Thompson | HR | 65000 | 2021-04-05 |
# Count employees in each department
hr_count = (techcorp_workforce["department"] == "HR").sum()
admin_count = (techcorp_workforce["department"] == "Admin").sum()
total = len(techcorp_workforce)
pd.DataFrame({
"total": [total],
"hr_count": [hr_count],
"admin_count": [admin_count]
})One result with multiple conditional counts — no groupby needed.
high_earners = (techcorp_workforce["salary"] > 100000).sum()
total = len(techcorp_workforce)
round(100.0 * high_earners / total, 2)In Python 3, integer division already returns a float (3 / 4 = 0.75). But multiplying by 100.0 makes the intent explicit and readable.
Computing a Ratio
Calculate what percentage of employees are in the HR department.
Conditional Sums
You can sum values conditionally too, not just count:
Calculate total payroll, then HR payroll separately. Add Admin payroll.
Pivot-Style Cross-Tabulations
For breakdowns by two dimensions, pd.crosstab() is the pandas equivalent of a pivot table:
pd.crosstab(titanic["survived"], titanic["pclass"])One line, and you’ve got survival counts broken down by passenger class.
Inspections by Risk Category
| business_id | business_name | business_address | business_city | business_state | business_postal_code | business_latitude | business_longitude | business_location | business_phone_number | inspection_id | inspection_date | inspection_score | inspection_type | violation_id | violation_description | risk_category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5800 | John Chin Elementary School | 350 Broadway St | San Francisco | CA | 94133 | 37.8 | -122.4 | {'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 5800_20171017 | 2017-10-17 | 98 | Routine - Unscheduled | 5800_20171017_103149 | Wiping cloths not clean or properly stored or inadequate sanitizer | Low Risk | |
| 64236 | Sutter Pub and Restaurant | 700 Sutter St | San Francisco | CA | 94102 | 37.79 | -122.41 | {'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 64236_20170725 | 2017-07-25 | 88 | Routine - Unscheduled | 64236_20170725_103133 | Foods not protected from contamination | Moderate Risk | |
| 1991 | SRI THAI CUISINE | 4621 LINCOLN Way | San Francisco | CA | 94122 | 37.76 | -122.51 | {'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 1991_20171129 | 2017-11-29 | 86 | Routine - Unscheduled | 1991_20171129_103139 | Improper food storage | Low Risk | |
| 3816 | Washington Bakery & Restaurant | 733 Washington St | San Francisco | CA | 94108 | 37.8 | -122.41 | {'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 3816_20160728 | 2016-07-28 | 67 | Routine - Unscheduled | 3816_20160728_103108 | Contaminated or adulterated food | High Risk | |
| 39119 | Brothers Restaurant | 4128 GEARY Blvd | San Francisco | CA | 94118 | 37.78 | -122.46 | {'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 39119_20160718 | 2016-07-18 | 79 | Routine - Unscheduled | 39119_20160718_103133 | Foods not protected from contamination | Moderate Risk |
Find the number of inspections that resulted in each risk category per each inspection type. Consider the records with no risk category value belongs to a separate category. Output the result along with the corresponding inspection type and the corresponding total number of inspections per that type. The output should be pivoted, meaning that each risk category + total number should be a separate column. Order the result based on the number of inspections per inspection type in descending order.
Share of Active Users
| user_id | name | status | country |
|---|---|---|---|
| 33 | Amanda Leon | open | Australia |
| 27 | Jessica Farrell | open | Luxembourg |
| 18 | Wanda Ramirez | open | USA |
| 50 | Samuel Miller | closed | Brazil |
| 16 | Jacob York | open | Australia |
Calculate the percentage of users who are both from the US and have an 'open' status, as indicated in the `fb_active_users` table.
Processed Ticket Rate By Type
| complaint_id | type | processed |
|---|---|---|
| 0 | 0 | TRUE |
| 1 | 0 | TRUE |
| 2 | 0 | FALSE |
| 3 | 1 | TRUE |
| 4 | 1 | TRUE |
Find the processed rate of tickets for each `type`. The processed rate is defined as the number of processed tickets divided by the total number of tickets for that type. Round this result to two decimal places.
Key Takeaways
- Boolean masks +
.sum()count matching rows:(df["col"] > x).sum(). - Use
[condition]["col"].sum()for conditional sums. pd.crosstab()creates pivot-style cross-tabulations in one line.- For percentages: conditional count / total count * 100.
- These techniques replace SQL’s
SUM(CASE WHEN ...)pattern.
What’s Next
You’ve now got all the building blocks. The final lesson pulls everything together with complex analyses that combine filtering, grouping, and conditional aggregation.