Advanced Pandas Aggregations for Data Analysts and Scientists

Published: August 15, 2023

Categories:

Written by:
Nathan Rosidi

Enhance your data analysis toolkit with complex pandas aggregation techniques for deeper insights

Pandas aggregations are a powerful tool for data analysis, allowing you to quickly summarize data and identify trends. If you're familiar with the basics of the group by aggregation method, then you're off to a good start. However, as a data analyst or scientist, you'll often be faced with more complex and customized aggregation tasks. These could include performing multiple calculations for a single metric or using different pandas aggregations across different columns.

In this blog, we'll explore the following techniques that will make your life easier as a data analyst/scientist:

applying multiple aggregations on a single column,
applying different aggregations to different columns,
using custom aggregations, and
renaming columns after an aggregation.

Performing multiple aggregations on a single column using pandas

As data analysts, we find ourselves wanting to gain a more comprehensive view of the data. Sometimes, understanding the average is just not enough so let’s check out how to perform multiple aggregations on a single column through this example from Airbnb.

Host Popularity Rental Prices

Last Updated: February 2018

Airbnb

MediumID 9632

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

You are given a table named airbnb_host_searches that contains data for rental property searches made by users. Determine the minimum, average, and maximum rental prices for each popularity-rating bucket. A popularity-rating bucket should be assigned to every record based on its number_of_reviews (see rules below).

The host’s popularity rating is defined as below: • 0 reviews: "New" • 1 to 5 reviews: "Rising" • 6 to 15 reviews: "Trending Up" • 16 to 40 reviews: "Popular" • More than 40 reviews: "Hot"

Tip: The id column in the table refers to the search ID.

Output host popularity rating and their minimum, average and maximum rental prices. Order the solution by the minimum price.

DataFrame: airbnb_host_searches

Expected Output Type: pandas.DataFrame

We are tasked to find the minimum, average, and maximum rental prices for each host popularity rating. However, based on the table provided below, the host popularity rating and host ID are not readily available so let’s prepare this information first by:

Creating the host ID by combining the price, room_type, host_since, zipcode, and number_of_reviews columns, and
Assigning popularity rating to each host based on number of reviews received.

Table: airbnb_host_searches

To get started, let's import our necessary libraries and load in our reference table in a variable called ‘df’.

Step 1: Create the host ID

To do so, we will have to convert each of the relevant columns into a string before combining them together.

import numpy as np
import pandas as pd
df=airbnb_host_searches
df['host_id'] = df['price'].map(str) + df['room_type'].map(str) + df['host_since'].map(str) + df['zipcode'].map(str)+ df['number_of_reviews'].map(str)

Now, we’ve just assigned a host ID for every property search!

Step 2: Assign a popularity rating to each host based on the number of reviews they've received.

Because our table lists every property search, we will have duplicate host IDs whenever a property is searched more than once so let’s remove duplicates to ensure we have only 1 row per host ID.

df1 = df[['host_id','number_of_reviews','price']].drop_duplicates()

Then, we'll use a lambda function which uses an if-then logic to assign a popularity rating according to the number of reviews.

df1['host_popularity'] = df1['number_of_reviews'].apply(lambda x:'New' if x<1 else 'Rising' if x<=5 else 'Trending Up' if x<=15 else 'Popular' if x<=40 else 'Hot')

Now that we have the host popularity rating, we can perform our pandas aggregation.

Step 3: Calculate the minimum, average, and maximum rental price for each host popularity rating

We’ll select the price column and perform the following functions: min, mean and max to this column by providing it as a list inside the agg() function.

Info box:
To perform multiple pandas aggregations on a column, follow this syntax:
df.groupby(‘col’)[‘select_col’].agg([aggfunc1, aggfunc2, aggfunc3]).reset_index()

df1.groupby('host_popularity')['price'].agg(['min', 'mean', 'max']).reset_index()

An error loading the results has occurred

Assigning column names after aggregations

Assigning column names after pandas aggregations

Renaming columns or assigning column names upon aggregation makes our results easier to understand and work with. When we aggregate data, we often end up with new column names that can be cryptic or difficult to understand. Additionally, when working with multiple data sources, we may need to rename columns to align with a standardized naming convention.

To assign column names upon aggregation, provide the desired column name and then assign the aggregation function for that column in the .agg() method.

Info box:
To assign column names upon aggregation, use the following syntax:
df.groupby(‘col’)[‘select_col’].agg(col_name = ‘aggfunc1’, col_name2 = ‘agg_func2’, col_name3 = ‘agg_func3’)

Following from our previous result, let’s make our column names more descriptive. We’ll replace the last line of the previous code with this line:

df1.groupby('host_popularity')['price'].agg(min_price='min', avg_price='mean', max_price='max').reset_index()

Different pandas aggregations for different columns

Now, let’s take it a step further and learn how to do the same for multiple column aggregations, which will be necessary as you work with larger datasets.

To perform multiple pandas aggregations on a function, we can pass a dictionary to the .agg method. In this dictionary, the keys of the dictionary represent the column names that you want to aggregate, and the value is the pandas aggregation function you want to apply to that column.

Info box:
To use different aggregations on different columns, use the following syntax:
df.groupby(‘col1’).agg({‘col2’: agg_func1, ‘col3’: aggfunc2})

Let's work through an example to make this clearer. Say we have a list of employees and their salaries, and we want to find the number of male and female employees per department and their corresponding total salaries.

Department Salaries

Last Updated: May 2019

Glassdoor

Apple

MediumID 9921

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the number of male and female employees per department and also their corresponding total salaries. Output department names along with the corresponding number of female employees, the total salary of female employees, the number of male employees, and the total salary of male employees.

DataFrame: employee

Expected Output Type: pandas.DataFrame

We’re working with the table employee to do this.

Table: employee

We first group the data by department and sex, and then apply different pandas aggregations to different columns. For the 'id' column, we want to count the number of employees, while for the 'salary' column, we want the sum of salaries.

Following the syntax:

import pandas as pd
import numpy as np

result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': 'sum'})

The column names ‘count’ and ‘sum’ are quite vague so let’s try to rename the columns once again. One way to do this is to specify the column names in the .agg() method and assign a tuple to it, where the first element is the name of the column to be aggregated and the second element is the aggregate function to use.

Info box:
To perform aggregations and rename columns directly, try the following syntax:
df.groupby(‘col1’).agg(assigned_colname2 = (‘col2’, ‘agg_func2’), assigned_colname2 = (‘col3’, ‘agg_func3’))

result = employee.groupby(['department', 'sex']).agg(num_employees= ('id','count'), total_salary = ('salary', 'sum')).reset_index()

Now, we’ve seen two cases:

One column requiring multiple aggregations
Multiple columns requiring a different aggregation

But what if we wanted to apply multiple aggregations to multiple columns? We can easily handle this by specifying a list of pandas aggregate functions in the dictionary.

Info box:
To perform multiple aggregations to multiple columns, pass a dictionary to the .agg() method:
df.groupby(‘col1’).agg({‘col2’: [agg_func1, aggfunc2, aggfunc3], ‘col3’: [agg_func1, aggfunc2, aggfunc3]})

So let’s say you also wanted to understand what is the minimum and maximum salary per department and gender, we could use:

result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': ['min', 'max', 'sum']}).reset_index()

To rename the columns, we can use the same technique of providing a tuple with the column name and the aggregate function. So let’s replace these aggregate functions.

Info box:
To assign column names for multiple aggregations, use the following syntax:
df.groupby(‘col1’).agg({‘col2’: [(‘col_name’, agg_func1),
(‘col_name2’, aggfunc2),
(‘col_name3’, aggfunc3)]})

result = employee.groupby(['department', 'sex']).agg({'id': 'count', 'salary': [('least', 'min'), ('highest', 'max'), ('expense_total','sum')]}).reset_index()

This gives us two columns which show the least salary and the total salary, with department and sex as indices of the table.

Let’s go back to the question. The required table needs the following columns: department, number of female employees, the total salary of female employees, the number of male employees, and the total salary of male employees. Therefore, we will have to unstack the ‘sex’ column in the index to have each of its unique values as a separate column. Then, reset the index so we can see the department as a column of the table.

result = result.unstack(level=1).reset_index()

Then, let’s rename and reorder the results before checking the solution.

result.columns = ['department', 'females', 'males', 'fem_sal', 'mal_sal']
result[['department', 'females', 'fem_sal', 'males', 'mal_sal']]

An error loading the results has occurred

Custom pandas aggregations

Sometimes, the built-in aggregation functions in Pandas are not sufficient for the task and this is where we can use the lambda function to apply complex operations to our data.

In this question, we have a dataset containing the results of restaurant health inspections in Los Angeles, and our task is to identify the quartile scores for each company, which are provided under the 'owner_name' column.

Find the scores of 4 quartiles of each company

Last Updated: April 2018

City of Los Angeles

HardID 9713

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the scores of 4 quartiles of each company
Output the company's owner name along with the corresponding score of each quartile. Order records based on the average score of all quartiles in ascending order.

DataFrame: los_angeles_restaurant_health_inspections

Expected Output Type: pandas.DataFrame

Table: los_angeles_restaurant_health_inspections

To solve this problem, we will 1) group the dataset by owner and 2) calculate the 25th, 50th, 75th, and 100th percentile scores for each owner. As the quantile method is not readily available as a built-in function, we'll need to create a lambda function. If you're not yet familiar with lambda functions, check out this lambda functions tutorial.

But before I show you how to do it with a lambda function, let me first demonstrate what we are trying to do by calculating the quantile scores for one owner. Let’s take 'A XY Enterprises' as an example.

Here are the scores for this owner:

import pandas as pd 
los_angeles_restaurant_health_inspections[los_angeles_restaurant_health_inspections.owner_name == 'AXY ENTERPRISES']['score'].sort_values()

We can calculate the 25th percentile for this owner by using the .quantile() like this:

los_angeles_restaurant_health_inspections[los_angeles_restaurant_health_inspections.owner_name == 'AXY ENTERPRISES']['score'].quantile(0.25)

Ultimately, we want a lambda function which will take each owner's scores and return the nth quantile.

lambda x: x.quantile(0.25)

Now that we have all the quantiles, we can calculate the average of these and sort the results by their average quantile.

df['mean'] = df.mean(axis = 1) 
result = df.sort_values('mean')[['owner_name','q1','q2','q3','q4']]

Final code:

import pandas as pd
df = los_angeles_restaurant_health_inspections.groupby(['owner_name'])['score'].agg(
q1 = lambda x: x.quantile(0.25), 
q2 = lambda x: x.quantile(0.50), 
q3 = lambda x: x.quantile(0.75), 
q4 = lambda x: x.quantile(1.0)).reset_index()
df['mean'] = df.mean(axis = 1) 
result = df.sort_values('mean')[['owner_name','q1','q2','q3','q4']]

An error loading the results has occurred

Check out our post “Pandas Interview Questions” to find more such questions for your practice.

Summary

We’ve covered several helpful techniques which are:

Info box:

Performing pandas aggregations
1. Multiple aggregations for a single column:
  df.groupby('col')['select_col'].agg([aggfunc1, aggfunc2, aggfunc3])
2. Different aggregations for different columns:
  df.groupby('col1').agg({'col2': agg_func1, 'col3': aggfunc2})
3. Multiple aggregations for multiple columns:
  df.groupby('col1').agg({'col2': [agg_func1, aggfunc2, aggfunc3], 'col3': [agg_func1, aggfunc2, aggfunc3]})
Renaming columns upon aggregation
1. Multiple aggregations for a single column
  df.groupby('col')['select_col'].agg(col_name='aggfunc1', col_name2='agg_func2', col_name3='agg_func3')
2. Multiple aggregations for multiple columns
  df.groupby('col1').agg({'col2': [('col_name', agg_func1), ('col_name2', aggfunc2), ('col_name3', aggfunc3)]})

Lastly, don't forget that you can use your defined functions or lambda expressions in place of the aggregate functions if needed! Explore more pandas functions in our “Pandas Cheat Sheet”.