Module 2: Aggregation & Grouping35 min

Grouping Data

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

One Number Per Group

In the last lesson, you got a single total for the whole DataFrame. Useful, but limited. Real business questions need breakdowns: revenue by month, customers by region, headcount by department.

.groupby() is how you get there. It splits your data into groups, then applies an aggregate method to each group separately.

Your First .groupby()

Python
(
    orders.groupby("cust_id")["total_order_cost"]
    .sum()
    .reset_index(name="total_spent")
)
Table: techcorp_workforce
idfirst_namelast_namedepartmentsalaryphone_numberjoining_date
1SarahMitchellHR95000555-01012021-03-15
2MichaelChenHR88000555-01022022-06-01
3EmilyRodriguezHR825002021-09-20
4DavidParkHR80000555-01042023-01-10
5LisaThompsonHR650002021-04-05
1
Count Employees per Department

Group `techcorp_workforce` by department and count the employees in each.

Tables: techcorp_workforce

Instead of one count for everyone, you get one count per department. Pandas groups all HR employees, counts them, then does the same for Admin, Account, and so on.

.reset_index() — Why You Need It

After .groupby().count(), the group column (department) becomes the DataFrame's index instead of a regular column. The index is like a row label — useful internally, but awkward for further analysis. .reset_index() moves it back to a normal column. The name= parameter renames the aggregated column: .reset_index(name="employee_count") turns the unnamed count into a descriptive column name. You'll chain .reset_index() after almost every groupby.

Every .groupby() follows the same three-step pattern:

  • Split: .groupby("column") — which column to group by
  • Aggregate: .sum(), .mean(), .count(), etc. — what calculation to apply
  • Reset: .reset_index() — clean up the output

Using .agg() for Multiple Aggregations

To compute several statistics per group, use .agg():

2
Department Salary Stats

Group by department and calculate the sum and mean salary.

Tables: techcorp_workforce

Sorting Grouped Results

.groupby() results come back in no particular order. Sort them:

3
Sort by Headcount

Count employees per department and sort by count, highest first.

Tables: techcorp_workforce

MacBookPro User Event Count

Group by event name and count.

Table: playbook_events
user_idoccurred_atevent_typeevent_namelocationdevice
69912014-06-09 18:26:54engagementhome_pageUnited Statesiphone 5
188512014-08-29 13:18:38signup_flowenter_infoRussiaasus chromebook
149982014-07-01 12:47:56engagementloginFrancehp pavilion desktop
81862014-05-23 10:44:16engagementhome_pageItalymacbook pro
96262014-07-31 17:15:14engagementloginRussianexus 7
4
MacBookPro User Event Count
View solution

Count the number of user events performed by MacBookPro users. Output the result along with the event name. Sort the result based on the event count in the descending order.

Tables: playbook_events

Artist Appearance Count

Group by artist and count appearances.

Table: spotify_worldwide_daily_song_ranking
idpositiontracknameartiststreamsurlstream_dateregion
30365152Heart Won't ForgetMatoma28047https://open.spotify.com/track/2of2DM5LqTh7ohmmVXUKsH2017-02-04no
85559160Someone In The Crowd - From "La La Land" SoundtrackEmma Stone17134https://open.spotify.com/track/7xE4vKvjqUTtHyJ9zi0k1q2017-02-26fr
1046089175The GreatestSia10060https://open.spotify.com/track/7xHWNBFm6ObGEQPaUxHuKO2017-03-06cl
35082425UnforgettableFrench Montana46603https://open.spotify.com/track/3B54sVLJ402zGa6Xm4YGNe2017-10-01no
7768221Bad and Boujee (feat. Lil Uzi Vert)Migos1823391https://open.spotify.com/track/4Km5HrUvYTaSUfiSGPJeQR2017-01-27us
5
Artist Appearance Count
View solution

Find how many times each artist appeared on the Spotify ranking list. Output the artist name along with the corresponding number of occurrences. Order records by the number of occurrences in descending order.

Tables: spotify_worldwide_daily_song_ranking

Top Ranked Songs

Filter first, then group.

Table: spotify_worldwide_daily_song_ranking
idpositiontracknameartiststreamsurlstream_dateregion
30365152Heart Won't ForgetMatoma28047https://open.spotify.com/track/2of2DM5LqTh7ohmmVXUKsH2017-02-04no
85559160Someone In The Crowd - From "La La Land" SoundtrackEmma Stone17134https://open.spotify.com/track/7xE4vKvjqUTtHyJ9zi0k1q2017-02-26fr
1046089175The GreatestSia10060https://open.spotify.com/track/7xHWNBFm6ObGEQPaUxHuKO2017-03-06cl
35082425UnforgettableFrench Montana46603https://open.spotify.com/track/3B54sVLJ402zGa6Xm4YGNe2017-10-01no
7768221Bad and Boujee (feat. Lil Uzi Vert)Migos1823391https://open.spotify.com/track/4Km5HrUvYTaSUfiSGPJeQR2017-01-27us
6
Top Ranked Songs
View solution

Find songs that have ranked in the top position. Output the track name and the number of times it ranked at the top. Sort your records by the number of times the song was in the top position in descending order.

Tables: spotify_worldwide_daily_song_ranking

Key Takeaways

  • .groupby() splits data into groups before aggregating.
  • Pattern: df.groupby("col")["value"].method().reset_index().
  • Use .agg([...]) for multiple aggregations per group.
  • Always chain .reset_index() for clean output.
  • Sort with .sort_values() after grouping.

What’s Next

.groupby() creates your groups, but what if you only want some of them? Say, only departments with more than five employees. You can’t filter before grouping for this — you need to filter after. That’s next.