Advanced Pandas Aggregations for Data Analysts and Scientists

Advanced Pandas Aggregations for Data Analysts and Scientists


Enhance your data analysis toolkit with complex pandas aggregation techniques for deeper insights

Pandas aggregations are a powerful tool for data analysis, allowing you to quickly summarize data and identify trends. If you're familiar with the basics of the group by aggregation method, then you're off to a good start. However, as a data analyst or scientist, you'll often be faced with more complex and customized aggregation tasks. These could include performing multiple calculations for a single metric or using different pandas aggregations across different columns.

In this blog, we'll explore the following techniques that will make your life easier as a data analyst/scientist:

  • applying multiple aggregations on a single column,
  • applying different aggregations to different columns,
  • using custom aggregations, and
  • renaming columns after an aggregation.

Performing multiple aggregations on a single column using pandas

As data analysts, we find ourselves wanting to gain a more comprehensive view of the data. Sometimes, understanding the average is just not enough so let’s check out how to perform multiple aggregations on a single column through this example from Airbnb.


DataFrame: airbnb_host_searches
Expected Output Type: pandas.DataFrame

We are tasked to find the minimum, average, and maximum rental prices for each host popularity rating. However, based on the table provided below, the host popularity rating and host ID are not readily available so let’s prepare this information first by:

  1. Creating the host ID by combining the price, room_type, host_since, zipcode, and number_of_reviews columns, and
  2. Assigning popularity rating to each host based on number of reviews received.

Table: airbnb_host_searches
idpriceproperty_typeroom_typeamenitiesaccommodatesbathroomsbed_typecancellation_policycleaning_feecityhost_identity_verifiedhost_response_ratehost_sinceneighbourhoodnumber_of_reviewsreview_scores_ratingzipcodebedroomsbeds
8284881621.46HouseEntire home/apt{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Pool,Kitchen,"Free parking on premises",Gym,"Hot tub","Indoor fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Safety card","Fire extinguisher",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace"}83Real BedstrictTRUELAf100%2016-11-01Pacific Palisades19027246
8284882621.46HouseEntire home/apt{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Pool,Kitchen,"Free parking on premises",Gym,"Hot tub","Indoor fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Safety card","Fire extinguisher",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace"}83Real BedstrictTRUELAf100%2016-11-01Pacific Palisades19027246
9479348598.9ApartmentEntire home/apt{"Wireless Internet","Air conditioning",Kitchen,Heating,"Smoke detector","Carbon monoxide detector",Essentials,Shampoo,Hangers,Iron,"translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}72Real BedstrictFALSENYCf100%2017-07-03Hell's Kitchen1601003634
8596057420.47HousePrivate room{"Wireless Internet","Air conditioning",Pool,Kitchen,"Free parking on premises",Breakfast,"Family/kid friendly",Washer,Dryer,Essentials,Shampoo,Hangers,"Hair dryer","Self Check-In","Doorman Entry"}12Real BedflexibleFALSELAf100%2016-04-2009174811
11525500478.75ApartmentEntire home/apt{"Wireless Internet","Air conditioning",Heating,Washer,Dryer,Essentials,"Laptop friendly workspace",Microwave,Refrigerator,Dishwasher,"Dishes and silverware","Cooking basics",Oven,Stove,"Host greets you"}21Real BedflexibleTRUENYCf100%2017-10-07Williamsburg21001120611

To get started, let's import our necessary libraries and load in our reference table in a variable called ‘df’.

Step 1: Create the host ID

To do so, we will have to convert each of the relevant columns into a string before combining them together.

import numpy as np
import pandas as pd
df=airbnb_host_searches
df['host_id'] = df['price'].map(str) + df['room_type'].map(str) + df['host_since'].map(str) + df['zipcode'].map(str)+ df['number_of_reviews'].map(str)

Now, we’ve just assigned a host ID for every property search!

Step 2: Assign a popularity rating to each host based on the number of reviews they've received.

Because our table lists every property search, we will have duplicate host IDs whenever a property is searched more than once so let’s remove duplicates to ensure we have only 1 row per host ID.

df1 = df[['host_id','number_of_reviews','price']].drop_duplicates()

Then, we'll use a lambda function which uses an if-then logic to assign a popularity rating according to the number of reviews.

df1['host_popularity'] = df1['number_of_reviews'].apply(lambda x:'New' if x<1 else 'Rising' if x<=5 else 'Trending Up' if x<=15 else 'Popular' if x<=40 else 'Hot')

All required columns and the first 5 rows of the solution are shown

number_of_reviews
Rising
Rising
New
Rising
New

Now that we have the host popularity rating, we can perform our pandas aggregation.

Step 3: Calculate the minimum, average, and maximum rental price for each host popularity rating

We’ll select the price column and perform the following functions: min, mean and max to this column by providing it as a list inside the agg() function.

Info box:
To perform multiple pandas aggregations on a column, follow this syntax:
df.groupby(‘col’)[‘select_col’].agg([aggfunc1, aggfunc2, aggfunc3]).reset_index()

df1.groupby('host_popularity')['price'].agg(['min', 'mean', 'max']).reset_index()

All required columns and the first 5 rows of the solution are shown

host_popularitymin_priceavg_pricemax_price
Hot340.12464.233633.51
New313.55515.92741.76
Popular270.81472.815667.83
Rising355.53503.847717.01
Trending Up361.09476.277685.65

Assigning column names after aggregations

Assigning column names after pandas aggregations

Renaming columns or assigning column names upon aggregation makes our results easier to understand and work with. When we aggregate data, we often end up with new column names that can be cryptic or difficult to understand. Additionally, when working with multiple data sources, we may need to rename columns to align with a standardized naming convention.

To assign column names upon aggregation, provide the desired column name and then assign the aggregation function for that column in the .agg() method.

Info box:
To assign column names upon aggregation, use the following syntax:
df.groupby(‘col’)[‘select_col’].agg(col_name = ‘aggfunc1’, col_name2 = ‘agg_func2’, col_name3 = ‘agg_func3’)

Following from our previous result, let’s make our column names more descriptive. We’ll replace the last line of the previous code with this line:

df1.groupby('host_popularity')['price'].agg(min_price='min', avg_price='mean', max_price='max').reset_index()

Different pandas aggregations for different columns

Now, let’s take it a step further and learn how to do the same for multiple column aggregations, which will be necessary as you work with larger datasets.

To perform multiple pandas aggregations on a function, we can pass a dictionary to the .agg method. In this dictionary, the keys of the dictionary represent the column names that you want to aggregate, and the value is the pandas aggregation function you want to apply to that column.

Info box:
To use different aggregations on different columns, use the following syntax:
df.groupby(‘col1’).agg({‘col2’: agg_func1, ‘col3’: aggfunc2})

Let's work through an example to make this clearer. Say we have a list of employees and their salaries, and we want to find the number of male and female employees per department and their corresponding total salaries.


DataFrame: employee
Expected Output Type: pandas.DataFrame

We’re working with the table employee to do this.

Table: employee
idfirst_namelast_nameagesexemployee_titledepartmentsalarytargetbonusemailcityaddressmanager_id
5MaxGeorge26MSalesSales1300200150Max@company.comCalifornia2638 Richards Avenue1
13KattyBond56FManagerManagement1500000300Katty@company.comArizona1
11RicherdGear57MManagerManagement2500000300Richerd@company.comAlabama1
10JenniferDion34FSalesSales1000200150Jennifer@company.comAlabama13
19GeorgeJoe50MManagerManagement1000000300George@company.comFlorida1003 Wyatt Street1

We first group the data by department and sex, and then apply different pandas aggregations to different columns. For the 'id' column, we want to count the number of employees, while for the 'salary' column, we want the sum of salaries.

Following the syntax:

import pandas as pd
import numpy as np

result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': 'sum'})

All required columns and the first 5 rows of the solution are shown

idsalary
21700
22100
2350000
2350000
911600

The column names ‘count’ and ‘sum’ are quite vague so let’s try to rename the columns once again. One way to do this is to specify the column names in the .agg() method and assign a tuple to it, where the first element is the name of the column to be aggregated and the second element is the aggregate function to use.

Info box:
To perform aggregations and rename columns directly, try the following syntax:
df.groupby(‘col1’).agg(assigned_colname2 = (‘col2’, ‘agg_func2’), assigned_colname2 = (‘col3’, ‘agg_func3’))

result = employee.groupby(['department', 'sex']).agg(num_employees= ('id','count'), total_salary = ('salary', 'sum')).reset_index()

All required columns and the first 5 rows of the solution are shown

departmentsexnum_employeestotal_salary
AuditF21700
AuditM22100
ManagementF2350000
ManagementM2350000
SalesF911600

Now, we’ve seen two cases:

  1. One column requiring multiple aggregations
  2. Multiple columns requiring a different aggregation

But what if we wanted to apply multiple aggregations to multiple columns? We can easily handle this by specifying a list of pandas aggregate functions in the dictionary.

Info box:
To perform multiple aggregations to multiple columns, pass a dictionary to the .agg() method:
df.groupby(‘col1’).agg({‘col2’: [agg_func1, aggfunc2, aggfunc3], ‘col3’: [agg_func1, aggfunc2, aggfunc3]})

So let’s say you also wanted to understand what is the minimum and maximum salary per department and gender, we could use:

result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': ['min', 'max', 'sum']}).reset_index()

All required columns and the first 5 rows of the solution are shown

departmentsexnum_employeestotal_salary
AuditF21700
AuditM22100
ManagementF2350000
ManagementM2350000
SalesF911600

To rename the columns, we can use the same technique of providing a tuple with the column name and the aggregate function. So let’s replace these aggregate functions.

Info box:
To assign column names for multiple aggregations, use the following syntax:
df.groupby(‘col1’).agg({‘col2’: [(‘col_name’, agg_func1),
(‘col_name2’, aggfunc2),
(‘col_name3’, aggfunc3)]})

result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': [('least', 'min'), ('highest', 'max'), ('expense_total','sum')]}).reset_index()

All required columns and the first 5 rows of the solution are shown

departmentsexidcountsalaryleastsalaryhighestsalaryexpense_total
AuditF270010001700
AuditM2100011002100
ManagementF2150000200000350000
ManagementM2100000250000350000
SalesF91000200011600

This gives us two columns which show the least salary and the total salary, with department and sex as indices of the table.

Let’s go back to the question. The required table needs the following columns: department, number of female employees, the total salary of female employees, the number of male employees, and the total salary of male employees. Therefore, we will have to unstack the ‘sex’ column in the index to have each of its unique values as a separate column. Then, reset the index so we can see the department as a column of the table.

result = result.unstack(level=1).reset_index()

All required columns and the first 5 rows of the solution are shown

departmentidFidMsalaryFsalaryM
Audit2217002100
Management22350000350000
Sales9131160017800

Then, let’s rename and reorder the results before checking the solution.

result.columns = ['department', 'females', 'males', 'fem_sal', 'mal_sal']
result[['department', 'females', 'fem_sal', 'males', 'mal_sal']]

All required columns and the first 5 rows of the solution are shown

departmentfemalesfem_salmalesmal_sal
Audit2170022100
Management23500002350000
Sales9116001317800

All required columns and the first 5 rows of the solution are shown

departmentidFidMsalaryFsalaryM
Audit2217002100
Management22350000350000
Sales9131160017800

Custom pandas aggregations

Sometimes, the built-in aggregation functions in Pandas are not sufficient for the task and this is where we can use the lambda function to apply complex operations to our data.

In this question, we have a dataset containing the results of restaurant health inspections in Los Angeles, and our task is to identify the quartile scores for each company, which are provided under the 'owner_name' column.


DataFrame: los_angeles_restaurant_health_inspections
Expected Output Type: pandas.DataFrame

Table: los_angeles_restaurant_health_inspections
serial_numberactivity_datefacility_namescoregradeservice_codeservice_descriptionemployee_idfacility_addressfacility_cityfacility_idfacility_statefacility_zipowner_idowner_namepe_descriptionprogram_element_peprogram_nameprogram_statusrecord_id
DAQHRSETQ2017-06-08MARGARITAS CAFE93A1ROUTINE INSPECTIONEE00000065026 S CRENSHAW BLVDLOS ANGELESFA0023656CA90043OW0004133BAZAN, ASCENCIONRESTAURANT (61-150) SEATS HIGH RISK1638MARGARITAS CAFEACTIVEPR0011718
DA2GQRJOS2017-03-07LAS MOLENDERAS97A1ROUTINE INSPECTIONEE00009972635 WHITTIER BLVDLOS ANGELESFA0160416CA90023OW0125379MARISOL FEREGRINORESTAURANT (0-30) SEATS HIGH RISK1632LAS MOLENDERASINACTIVEPR0148504
DAMQTA46T2016-03-22SANDRA'S TAMALES93A1ROUTINE INSPECTIONEE00010495390 WHITTIER BLVDLOS ANGELESFA0171769CA90022-4032OW0178828SANDRA'S TAMALES INC.RESTAURANT (0-30) SEATS MODERATE RISK1631SANDRA'S TAMALESACTIVEPR0164225
DAXMBTIRZ2018-02-12CAFE GRATITUDE97A1ROUTINE INSPECTIONEE0000828639 N LARCHMONT BLVD STE #102LOS ANGELESFA0058921CA90004OW0005704CAFE GRATITUDE LARCHMONT LLCRESTAURANT (61-150) SEATS HIGH RISK1638CAFE GRATITUDEACTIVEPR0019854
DAK8TBMS02015-09-10THE WAFFLE90A1ROUTINE INSPECTIONEE00007096255 W SUNSET BLVD STE #105LOS ANGELESFA0051830CA90028OW0035796THE WAFFLE, LLCRESTAURANT (61-150) SEATS HIGH RISK1638THE WAFFLEACTIVEPR0010922

To solve this problem, we will 1) group the dataset by owner and 2) calculate the 25th, 50th, 75th, and 100th percentile scores for each owner. As the quantile method is not readily available as a built-in function, we'll need to create a lambda function. If you're not yet familiar with lambda functions, check out this lambda functions tutorial.

But before I show you how to do it with a lambda function, let me first demonstrate what we are trying to do by calculating the quantile scores for one owner. Let’s take 'A XY Enterprises' as an example.

Here are the scores for this owner:

import pandas as pd 
los_angeles_restaurant_health_inspections[los_angeles_restaurant_health_inspections.owner_name == 'AXY ENTERPRISES']['score'].sort_values()

All required columns and the first 5 rows of the solution are shown

score
80
93
93
96
96

We can calculate the 25th percentile for this owner by using the .quantile() like this:

los_angeles_restaurant_health_inspections[los_angeles_restaurant_health_inspections.owner_name == 'AXY ENTERPRISES']['score'].quantile(0.25)

Ultimately, we want a lambda function which will take each owner's scores and return the nth quantile.

lambda x: x.quantile(0.25)

All required columns and the first 5 rows of the solution are shown

owner_nameq1q2q3q4
110 SUNSHINE SMOOTHIES, INC91919296
110 SUNSHINE SMOOTHIES, INC.929595.596
7111 MELROSE LLC94949494
9 STONES, LLC97979797
99 CENT ONLY STORES, LLC99999999

Now that we have all the quantiles, we can calculate the average of these and sort the results by their average quantile.

df['mean'] = df.mean(axis = 1) 
result = df.sort_values('mean')[['owner_name','q1','q2','q3','q4']]

Final code:

import pandas as pd
df = los_angeles_restaurant_health_inspections.groupby(['owner_name'])['score'].agg(
q1 = lambda x: x.quantile(0.25), 
q2 = lambda x: x.quantile(0.50), 
q3 = lambda x: x.quantile(0.75), 
q4 = lambda x: x.quantile(1.0)).reset_index()
df['mean'] = df.mean(axis = 1) 
result = df.sort_values('mean')[['owner_name','q1','q2','q3','q4']]

All required columns and the first 5 rows of the solution are shown

owner_nameq1q2q3q4
THE TALPA INC70707070
JELLY FISH INC72727272
ASHLIAN INC72727272
JI, KIL HO74747474
NAM, KWANG B75757575

Check out our post “Pandas Interview Questions” to find more such questions for your practice.

Summary

We’ve covered several helpful techniques which are:

Info box:

  1. Performing pandas aggregations
    1. Multiple aggregations for a single column:
      df.groupby('col')['select_col'].agg([aggfunc1, aggfunc2, aggfunc3])
    2. Different aggregations for different columns:
      df.groupby('col1').agg({'col2': agg_func1, 'col3': aggfunc2})
    3. Multiple aggregations for multiple columns:
      df.groupby('col1').agg({'col2': [agg_func1, aggfunc2, aggfunc3], 'col3': [agg_func1, aggfunc2, aggfunc3]})
  2. Renaming columns upon aggregation
    1. Multiple aggregations for a single column
      df.groupby('col')['select_col'].agg(col_name='aggfunc1', col_name2='agg_func2', col_name3='agg_func3')
    2. Multiple aggregations for multiple columns
      df.groupby('col1').agg({'col2': [('col_name', agg_func1), ('col_name2', aggfunc2), ('col_name3', aggfunc3)]})

Lastly, don't forget that you can use your defined functions or lambda expressions in place of the aggregate functions if needed! Explore more pandas functions in our “Pandas Cheat Sheet”.

Advanced Pandas Aggregations for Data Analysts and Scientists


Become a data expert. Subscribe to our newsletter.