Grouping Data
Progress Tracking
Log in to save this lesson and continue from where you left off.
One Number Per Group
In the last lesson, you got a single total for the whole DataFrame. Useful, but limited. Real business questions need breakdowns: revenue by month, customers by region, headcount by department.
.groupby() is how you get there. It splits your data into groups, then applies an aggregate method to each group separately.
Your First .groupby()
(
orders.groupby("cust_id")["total_order_cost"]
.sum()
.reset_index(name="total_spent")
)| id | first_name | last_name | department | salary | phone_number | joining_date |
|---|---|---|---|---|---|---|
| 1 | Sarah | Mitchell | HR | 95000 | 555-0101 | 2021-03-15 |
| 2 | Michael | Chen | HR | 88000 | 555-0102 | 2022-06-01 |
| 3 | Emily | Rodriguez | HR | 82500 | 2021-09-20 | |
| 4 | David | Park | HR | 80000 | 555-0104 | 2023-01-10 |
| 5 | Lisa | Thompson | HR | 65000 | 2021-04-05 |
Group `techcorp_workforce` by department and count the employees in each.
Instead of one count for everyone, you get one count per department. Pandas groups all HR employees, counts them, then does the same for Admin, Account, and so on.
After .groupby().count(), the group column (department) becomes the DataFrame's index instead of a regular column. The index is like a row label — useful internally, but awkward for further analysis. .reset_index() moves it back to a normal column. The name= parameter renames the aggregated column: .reset_index(name="employee_count") turns the unnamed count into a descriptive column name. You'll chain .reset_index() after almost every groupby.
Every .groupby() follows the same three-step pattern:
- Split:
.groupby("column")— which column to group by - Aggregate:
.sum(),.mean(),.count(), etc. — what calculation to apply - Reset:
.reset_index()— clean up the output
Using .agg() for Multiple Aggregations
To compute several statistics per group, use .agg():
Group by department and calculate the sum and mean salary.
Sorting Grouped Results
.groupby() results come back in no particular order. Sort them:
Count employees per department and sort by count, highest first.
MacBookPro User Event Count
Group by event name and count.
| user_id | occurred_at | event_type | event_name | location | device |
|---|---|---|---|---|---|
| 6991 | 2014-06-09 18:26:54 | engagement | home_page | United States | iphone 5 |
| 18851 | 2014-08-29 13:18:38 | signup_flow | enter_info | Russia | asus chromebook |
| 14998 | 2014-07-01 12:47:56 | engagement | login | France | hp pavilion desktop |
| 8186 | 2014-05-23 10:44:16 | engagement | home_page | Italy | macbook pro |
| 9626 | 2014-07-31 17:15:14 | engagement | login | Russia | nexus 7 |
Count the number of user events performed by MacBookPro users. Output the result along with the event name. Sort the result based on the event count in the descending order.
Artist Appearance Count
Group by artist and count appearances.
| id | position | trackname | artist | streams | url | stream_date | region |
|---|---|---|---|---|---|---|---|
| 303651 | 52 | Heart Won't Forget | Matoma | 28047 | https://open.spotify.com/track/2of2DM5LqTh7ohmmVXUKsH | 2017-02-04 | no |
| 85559 | 160 | Someone In The Crowd - From "La La Land" Soundtrack | Emma Stone | 17134 | https://open.spotify.com/track/7xE4vKvjqUTtHyJ9zi0k1q | 2017-02-26 | fr |
| 1046089 | 175 | The Greatest | Sia | 10060 | https://open.spotify.com/track/7xHWNBFm6ObGEQPaUxHuKO | 2017-03-06 | cl |
| 350824 | 25 | Unforgettable | French Montana | 46603 | https://open.spotify.com/track/3B54sVLJ402zGa6Xm4YGNe | 2017-10-01 | no |
| 776822 | 1 | Bad and Boujee (feat. Lil Uzi Vert) | Migos | 1823391 | https://open.spotify.com/track/4Km5HrUvYTaSUfiSGPJeQR | 2017-01-27 | us |
Find how many times each artist appeared on the Spotify ranking list. Output the artist name along with the corresponding number of occurrences. Order records by the number of occurrences in descending order.
Top Ranked Songs
Filter first, then group.
| id | position | trackname | artist | streams | url | stream_date | region |
|---|---|---|---|---|---|---|---|
| 303651 | 52 | Heart Won't Forget | Matoma | 28047 | https://open.spotify.com/track/2of2DM5LqTh7ohmmVXUKsH | 2017-02-04 | no |
| 85559 | 160 | Someone In The Crowd - From "La La Land" Soundtrack | Emma Stone | 17134 | https://open.spotify.com/track/7xE4vKvjqUTtHyJ9zi0k1q | 2017-02-26 | fr |
| 1046089 | 175 | The Greatest | Sia | 10060 | https://open.spotify.com/track/7xHWNBFm6ObGEQPaUxHuKO | 2017-03-06 | cl |
| 350824 | 25 | Unforgettable | French Montana | 46603 | https://open.spotify.com/track/3B54sVLJ402zGa6Xm4YGNe | 2017-10-01 | no |
| 776822 | 1 | Bad and Boujee (feat. Lil Uzi Vert) | Migos | 1823391 | https://open.spotify.com/track/4Km5HrUvYTaSUfiSGPJeQR | 2017-01-27 | us |
Find songs that have ranked in the top position. Output the track name and the number of times it ranked at the top. Sort your records by the number of times the song was in the top position in descending order.
Key Takeaways
.groupby()splits data into groups before aggregating.- Pattern:
df.groupby("col")["value"].method().reset_index(). - Use
.agg([...])for multiple aggregations per group. - Always chain
.reset_index()for clean output. - Sort with
.sort_values()after grouping.
What’s Next
.groupby() creates your groups, but what if you only want some of them? Say, only departments with more than five employees. You can’t filter before grouping for this — you need to filter after. That’s next.