Advanced Pandas Aggregations for Data Analysts and Scientists

Categories
Enhance your data analysis toolkit with complex pandas aggregation techniques for deeper insights
Pandas aggregations are a powerful tool for data analysis, allowing you to quickly summarize data and identify trends. If you're familiar with the basics of the group by aggregation method, then you're off to a good start. However, as a data analyst or scientist, you'll often be faced with more complex and customized aggregation tasks. These could include performing multiple calculations for a single metric or using different pandas aggregations across different columns.
In this blog, we'll explore the following techniques that will make your life easier as a data analyst/scientist:
- applying multiple aggregations on a single column,
- applying different aggregations to different columns,
- using custom aggregations, and
- renaming columns after an aggregation.
Performing multiple aggregations on a single column using pandas
As data analysts, we find ourselves wanting to gain a more comprehensive view of the data. Sometimes, understanding the average is just not enough so let’s check out how to perform multiple aggregations on a single column through this example from Airbnb.
Interview Question Date: February 2018
You’re given a table of rental property searches by users. The table consists of search results and outputs host information for searchers. Find the minimum, average, maximum rental prices for each host’s popularity rating. The host’s popularity rating is defined as below: 0 reviews: New 1 to 5 reviews: Rising 6 to 15 reviews: Trending Up 16 to 40 reviews: Popular more than 40 reviews: Hot
Tip: The id
column in the table refers to the search ID. You'll need to create your own host_id by concating price, room_type, host_since, zipcode, and number_of_reviews.
Output host popularity rating and their minimum, average and maximum rental prices.
We are tasked to find the minimum, average, and maximum rental prices for each host popularity rating. However, based on the table provided below, the host popularity rating and host ID are not readily available so let’s prepare this information first by:
- Creating the host ID by combining the price, room_type, host_since, zipcode, and number_of_reviews columns, and
- Assigning popularity rating to each host based on number of reviews received.
id | price | property_type | room_type | amenities | accommodates | bathrooms | bed_type | cancellation_policy | cleaning_fee | city | host_identity_verified | host_response_rate | host_since | neighbourhood | number_of_reviews | review_scores_rating | zipcode | bedrooms | beds |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8284881 | 621.46 | House | Entire home/apt | {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Pool,Kitchen,"Free parking on premises",Gym,"Hot tub","Indoor fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Safety card","Fire extinguisher",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace"} | 8 | 3 | Real Bed | strict | TRUE | LA | f | 100% | 2016-11-01 | Pacific Palisades | 1 | 90272 | 4 | 6 | |
8284882 | 621.46 | House | Entire home/apt | {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Pool,Kitchen,"Free parking on premises",Gym,"Hot tub","Indoor fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Safety card","Fire extinguisher",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace"} | 8 | 3 | Real Bed | strict | TRUE | LA | f | 100% | 2016-11-01 | Pacific Palisades | 1 | 90272 | 4 | 6 | |
9479348 | 598.9 | Apartment | Entire home/apt | {"Wireless Internet","Air conditioning",Kitchen,Heating,"Smoke detector","Carbon monoxide detector",Essentials,Shampoo,Hangers,Iron,"translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"} | 7 | 2 | Real Bed | strict | FALSE | NYC | f | 100% | 2017-07-03 | Hell's Kitchen | 1 | 60 | 10036 | 3 | 4 |
8596057 | 420.47 | House | Private room | {"Wireless Internet","Air conditioning",Pool,Kitchen,"Free parking on premises",Breakfast,"Family/kid friendly",Washer,Dryer,Essentials,Shampoo,Hangers,"Hair dryer","Self Check-In","Doorman Entry"} | 1 | 2 | Real Bed | flexible | FALSE | LA | f | 100% | 2016-04-20 | 0 | 91748 | 1 | 1 | ||
11525500 | 478.75 | Apartment | Entire home/apt | {"Wireless Internet","Air conditioning",Heating,Washer,Dryer,Essentials,"Laptop friendly workspace",Microwave,Refrigerator,Dishwasher,"Dishes and silverware","Cooking basics",Oven,Stove,"Host greets you"} | 2 | 1 | Real Bed | flexible | TRUE | NYC | f | 100% | 2017-10-07 | Williamsburg | 2 | 100 | 11206 | 1 | 1 |
To get started, let's import our necessary libraries and load in our reference table in a variable called ‘df’.
Step 1: Create the host ID
To do so, we will have to convert each of the relevant columns into a string before combining them together.
import numpy as np
import pandas as pd
df=airbnb_host_searches
df['host_id'] = df['price'].map(str) + df['room_type'].map(str) + df['host_since'].map(str) + df['zipcode'].map(str)+ df['number_of_reviews'].map(str)
Now, we’ve just assigned a host ID for every property search!
Step 2: Assign a popularity rating to each host based on the number of reviews they've received.
Because our table lists every property search, we will have duplicate host IDs whenever a property is searched more than once so let’s remove duplicates to ensure we have only 1 row per host ID.
df1 = df[['host_id','number_of_reviews','price']].drop_duplicates()
Then, we'll use a lambda function which uses an if-then logic to assign a popularity rating according to the number of reviews.
df1['host_popularity'] = df1['number_of_reviews'].apply(lambda x:'New' if x<1 else 'Rising' if x<=5 else 'Trending Up' if x<=15 else 'Popular' if x<=40 else 'Hot')
number_of_reviews |
---|
Rising |
Rising |
New |
Rising |
New |
Now that we have the host popularity rating, we can perform our pandas aggregation.
Step 3: Calculate the minimum, average, and maximum rental price for each host popularity rating
We’ll select the price column and perform the following functions: min, mean and max to this column by providing it as a list inside the agg() function.
Info box:
To perform multiple pandas aggregations on a column, follow this syntax:
df.groupby(‘col’)[‘select_col’].agg([aggfunc1, aggfunc2, aggfunc3]).reset_index()
df1.groupby('host_popularity')['price'].agg(['min', 'mean', 'max']).reset_index()
host_popularity | min_price | avg_price | max_price |
---|---|---|---|
Hot | 340.12 | 464.233 | 633.51 |
New | 313.55 | 515.92 | 741.76 |
Popular | 270.81 | 472.815 | 667.83 |
Rising | 355.53 | 503.847 | 717.01 |
Trending Up | 361.09 | 476.277 | 685.65 |
Assigning column names after aggregations

Renaming columns or assigning column names upon aggregation makes our results easier to understand and work with. When we aggregate data, we often end up with new column names that can be cryptic or difficult to understand. Additionally, when working with multiple data sources, we may need to rename columns to align with a standardized naming convention.
To assign column names upon aggregation, provide the desired column name and then assign the aggregation function for that column in the .agg() method.
Info box:
To assign column names upon aggregation, use the following syntax:
df.groupby(‘col’)[‘select_col’].agg(col_name = ‘aggfunc1’, col_name2 = ‘agg_func2’, col_name3 = ‘agg_func3’)
Following from our previous result, let’s make our column names more descriptive. We’ll replace the last line of the previous code with this line:
df1.groupby('host_popularity')['price'].agg(min_price='min', avg_price='mean', max_price='max').reset_index()
Different pandas aggregations for different columns
Now, let’s take it a step further and learn how to do the same for multiple column aggregations, which will be necessary as you work with larger datasets.
To perform multiple pandas aggregations on a function, we can pass a dictionary to the .agg method. In this dictionary, the keys of the dictionary represent the column names that you want to aggregate, and the value is the pandas aggregation function you want to apply to that column.
Info box:
To use different aggregations on different columns, use the following syntax:
df.groupby(‘col1’).agg({‘col2’: agg_func1, ‘col3’: aggfunc2})
Let's work through an example to make this clearer. Say we have a list of employees and their salaries, and we want to find the number of male and female employees per department and their corresponding total salaries.
Interview Question Date: May 2019
Find the number of male and female employees per department and also their corresponding total salaries. Output department names along with the corresponding number of female employees, the total salary of female employees, the number of male employees, and the total salary of male employees.
We’re working with the table employee to do this.
id | first_name | last_name | age | sex | employee_title | department | salary | target | bonus | city | address | manager_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Max | George | 26 | M | Sales | Sales | 1300 | 200 | 150 | Max@company.com | California | 2638 Richards Avenue | 1 |
13 | Katty | Bond | 56 | F | Manager | Management | 150000 | 0 | 300 | Katty@company.com | Arizona | 1 | |
11 | Richerd | Gear | 57 | M | Manager | Management | 250000 | 0 | 300 | Richerd@company.com | Alabama | 1 | |
10 | Jennifer | Dion | 34 | F | Sales | Sales | 1000 | 200 | 150 | Jennifer@company.com | Alabama | 13 | |
19 | George | Joe | 50 | M | Manager | Management | 100000 | 0 | 300 | George@company.com | Florida | 1003 Wyatt Street | 1 |
We first group the data by department and sex, and then apply different pandas aggregations to different columns. For the 'id' column, we want to count the number of employees, while for the 'salary' column, we want the sum of salaries.
Following the syntax:
import pandas as pd
import numpy as np
result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': 'sum'})
id | salary |
---|---|
2 | 1700 |
2 | 2100 |
2 | 350000 |
2 | 350000 |
9 | 11600 |
The column names ‘count’ and ‘sum’ are quite vague so let’s try to rename the columns once again. One way to do this is to specify the column names in the .agg() method and assign a tuple to it, where the first element is the name of the column to be aggregated and the second element is the aggregate function to use.
Info box:
To perform aggregations and rename columns directly, try the following syntax:
df.groupby(‘col1’).agg(assigned_colname2 = (‘col2’, ‘agg_func2’), assigned_colname2 = (‘col3’, ‘agg_func3’))
result = employee.groupby(['department', 'sex']).agg(num_employees= ('id','count'), total_salary = ('salary', 'sum')).reset_index()
department | sex | num_employees | total_salary |
---|---|---|---|
Audit | F | 2 | 1700 |
Audit | M | 2 | 2100 |
Management | F | 2 | 350000 |
Management | M | 2 | 350000 |
Sales | F | 9 | 11600 |
Now, we’ve seen two cases:
- One column requiring multiple aggregations
- Multiple columns requiring a different aggregation
But what if we wanted to apply multiple aggregations to multiple columns? We can easily handle this by specifying a list of pandas aggregate functions in the dictionary.
Info box:
To perform multiple aggregations to multiple columns, pass a dictionary to the .agg() method:
df.groupby(‘col1’).agg({‘col2’: [agg_func1, aggfunc2, aggfunc3], ‘col3’: [agg_func1, aggfunc2, aggfunc3]})
So let’s say you also wanted to understand what is the minimum and maximum salary per department and gender, we could use:
result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': ['min', 'max', 'sum']}).reset_index()
department | sex | num_employees | total_salary |
---|---|---|---|
Audit | F | 2 | 1700 |
Audit | M | 2 | 2100 |
Management | F | 2 | 350000 |
Management | M | 2 | 350000 |
Sales | F | 9 | 11600 |
To rename the columns, we can use the same technique of providing a tuple with the column name and the aggregate function. So let’s replace these aggregate functions.
Info box:
To assign column names for multiple aggregations, use the following syntax:
df.groupby(‘col1’).agg({‘col2’: [(‘col_name’, agg_func1),
(‘col_name2’, aggfunc2),
(‘col_name3’, aggfunc3)]})
result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': [('least', 'min'), ('highest', 'max'), ('expense_total','sum')]}).reset_index()
department | sex | idcount | salaryleast | salaryhighest | salaryexpense_total |
---|---|---|---|---|---|
Audit | F | 2 | 700 | 1000 | 1700 |
Audit | M | 2 | 1000 | 1100 | 2100 |
Management | F | 2 | 150000 | 200000 | 350000 |
Management | M | 2 | 100000 | 250000 | 350000 |
Sales | F | 9 | 1000 | 2000 | 11600 |
This gives us two columns which show the least salary and the total salary, with department and sex as indices of the table.
Let’s go back to the question. The required table needs the following columns: department, number of female employees, the total salary of female employees, the number of male employees, and the total salary of male employees. Therefore, we will have to unstack the ‘sex’ column in the index to have each of its unique values as a separate column. Then, reset the index so we can see the department as a column of the table.
result = result.unstack(level=1).reset_index()
department | idF | idM | salaryF | salaryM |
---|---|---|---|---|
Audit | 2 | 2 | 1700 | 2100 |
Management | 2 | 2 | 350000 | 350000 |
Sales | 9 | 13 | 11600 | 17800 |
Then, let’s rename and reorder the results before checking the solution.
result.columns = ['department', 'females', 'males', 'fem_sal', 'mal_sal']
result[['department', 'females', 'fem_sal', 'males', 'mal_sal']]
department | females | fem_sal | males | mal_sal |
---|---|---|---|---|
Audit | 2 | 1700 | 2 | 2100 |
Management | 2 | 350000 | 2 | 350000 |
Sales | 9 | 11600 | 13 | 17800 |
department | idF | idM | salaryF | salaryM |
---|---|---|---|---|
Audit | 2 | 2 | 1700 | 2100 |
Management | 2 | 2 | 350000 | 350000 |
Sales | 9 | 13 | 11600 | 17800 |
Custom pandas aggregations
Sometimes, the built-in aggregation functions in Pandas are not sufficient for the task and this is where we can use the lambda function to apply complex operations to our data.
In this question, we have a dataset containing the results of restaurant health inspections in Los Angeles, and our task is to identify the quartile scores for each company, which are provided under the 'owner_name' column.
Interview Question Date: April 2018
Find the scores of 4 quartiles of each company
Output the company name along with the corresponding score of each quartile.
Order records based on the average score of all quartiles in ascending order.
serial_number | activity_date | facility_name | score | grade | service_code | service_description | employee_id | facility_address | facility_city | facility_id | facility_state | facility_zip | owner_id | owner_name | pe_description | program_element_pe | program_name | program_status | record_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DAQHRSETQ | 2017-06-08 | MARGARITAS CAFE | 93 | A | 1 | ROUTINE INSPECTION | EE0000006 | 5026 S CRENSHAW BLVD | LOS ANGELES | FA0023656 | CA | 90043 | OW0004133 | BAZAN, ASCENCION | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | MARGARITAS CAFE | ACTIVE | PR0011718 |
DA2GQRJOS | 2017-03-07 | LAS MOLENDERAS | 97 | A | 1 | ROUTINE INSPECTION | EE0000997 | 2635 WHITTIER BLVD | LOS ANGELES | FA0160416 | CA | 90023 | OW0125379 | MARISOL FEREGRINO | RESTAURANT (0-30) SEATS HIGH RISK | 1632 | LAS MOLENDERAS | INACTIVE | PR0148504 |
DAMQTA46T | 2016-03-22 | SANDRA'S TAMALES | 93 | A | 1 | ROUTINE INSPECTION | EE0001049 | 5390 WHITTIER BLVD | LOS ANGELES | FA0171769 | CA | 90022-4032 | OW0178828 | SANDRA'S TAMALES INC. | RESTAURANT (0-30) SEATS MODERATE RISK | 1631 | SANDRA'S TAMALES | ACTIVE | PR0164225 |
DAXMBTIRZ | 2018-02-12 | CAFE GRATITUDE | 97 | A | 1 | ROUTINE INSPECTION | EE0000828 | 639 N LARCHMONT BLVD STE #102 | LOS ANGELES | FA0058921 | CA | 90004 | OW0005704 | CAFE GRATITUDE LARCHMONT LLC | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | CAFE GRATITUDE | ACTIVE | PR0019854 |
DAK8TBMS0 | 2015-09-10 | THE WAFFLE | 90 | A | 1 | ROUTINE INSPECTION | EE0000709 | 6255 W SUNSET BLVD STE #105 | LOS ANGELES | FA0051830 | CA | 90028 | OW0035796 | THE WAFFLE, LLC | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | THE WAFFLE | ACTIVE | PR0010922 |
To solve this problem, we will 1) group the dataset by owner and 2) calculate the 25th, 50th, 75th, and 100th percentile scores for each owner. As the quantile method is not readily available as a built-in function, we'll need to create a lambda function. If you're not yet familiar with lambda functions, check out this lambda functions tutorial.
But before I show you how to do it with a lambda function, let me first demonstrate what we are trying to do by calculating the quantile scores for one owner. Let’s take 'A XY Enterprises' as an example.
Here are the scores for this owner:
import pandas as pd
los_angeles_restaurant_health_inspections[los_angeles_restaurant_health_inspections.owner_name == 'AXY ENTERPRISES']['score'].sort_values()
score |
---|
80 |
93 |
93 |
96 |
96 |
We can calculate the 25th percentile for this owner by using the .quantile() like this:
los_angeles_restaurant_health_inspections[los_angeles_restaurant_health_inspections.owner_name == 'AXY ENTERPRISES']['score'].quantile(0.25)
Ultimately, we want a lambda function which will take each owner's scores and return the nth quantile.
lambda x: x.quantile(0.25)
owner_name | q1 | q2 | q3 | q4 |
---|---|---|---|---|
110 SUNSHINE SMOOTHIES, INC | 91 | 91 | 92 | 96 |
110 SUNSHINE SMOOTHIES, INC. | 92 | 95 | 95.5 | 96 |
7111 MELROSE LLC | 94 | 94 | 94 | 94 |
9 STONES, LLC | 97 | 97 | 97 | 97 |
99 CENT ONLY STORES, LLC | 99 | 99 | 99 | 99 |
Now that we have all the quantiles, we can calculate the average of these and sort the results by their average quantile.
df['mean'] = df.mean(axis = 1)
result = df.sort_values('mean')[['owner_name','q1','q2','q3','q4']]
Final code:
import pandas as pd
df = los_angeles_restaurant_health_inspections.groupby(['owner_name'])['score'].agg(
q1 = lambda x: x.quantile(0.25),
q2 = lambda x: x.quantile(0.50),
q3 = lambda x: x.quantile(0.75),
q4 = lambda x: x.quantile(1.0)).reset_index()
df['mean'] = df.mean(axis = 1)
result = df.sort_values('mean')[['owner_name','q1','q2','q3','q4']]
owner_name | q1 | q2 | q3 | q4 |
---|---|---|---|---|
THE TALPA INC | 70 | 70 | 70 | 70 |
JELLY FISH INC | 72 | 72 | 72 | 72 |
ASHLIAN INC | 72 | 72 | 72 | 72 |
JI, KIL HO | 74 | 74 | 74 | 74 |
NAM, KWANG B | 75 | 75 | 75 | 75 |
Check out our post “Pandas Interview Questions” to find more such questions for your practice.
Summary
We’ve covered several helpful techniques which are:
Info box:
- Performing pandas aggregations
- Multiple aggregations for a single column:
df.groupby('col')['select_col'].agg([aggfunc1, aggfunc2, aggfunc3]) - Different aggregations for different columns:
df.groupby('col1').agg({'col2': agg_func1, 'col3': aggfunc2}) - Multiple aggregations for multiple columns:
df.groupby('col1').agg({'col2': [agg_func1, aggfunc2, aggfunc3], 'col3': [agg_func1, aggfunc2, aggfunc3]})
- Multiple aggregations for a single column:
- Renaming columns upon aggregation
- Multiple aggregations for a single column
df.groupby('col')['select_col'].agg(col_name='aggfunc1', col_name2='agg_func2', col_name3='agg_func3') - Multiple aggregations for multiple columns
df.groupby('col1').agg({'col2': [('col_name', agg_func1), ('col_name2', aggfunc2), ('col_name3', aggfunc3)]})
- Multiple aggregations for a single column
Lastly, don't forget that you can use your defined functions or lambda expressions in place of the aggregate functions if needed! Explore more pandas functions in our “Pandas Cheat Sheet”.
