Methods for Ranking in Pandas

Basics of Pandas Ranking

You can create a ranking in Python Pandas by passing the .rank() function to the column used as the basis for ranking.

Rank guests based on their ages

Airbnb

EasyID 10160

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Rank guests based on their ages. Output the guest id along with the corresponding rank. Order records by the age in descending order.

DataFrame: airbnb_guests

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10160-rank-guests-based-on-their-ages

For example, in this simple exercise from Airbnb, we can rank the guests based on their ages by:

airbnb_guests['age'].rank()

By default, this will create a ranking based on the ascending order such that the youngest visitor gets Rank 1.

import pandas as pd
import numpy as np
airbnb_guests['rank'] = airbnb_guests['age'].rank()
airbnb_guests[['age', 'rank']].sort_values(by='age')

All required columns and the first 5 rows of the solution are shown

age	rank
19	1
21	2
23	3
24	4.5
24	4.5

However, our ranking should be in descending order, so we’ll have to specify it as a parameter.

airbnb_guests['age'].rank(ascending=False)

This reverses the ranking such that the oldest guest (i.e., highest age value) gets Rank 1.

import pandas as pd
import numpy as np
airbnb_guests['rank'] = airbnb_guests['age'].rank(ascending=False)
airbnb_guests[['age', 'rank']].sort_values(by='age', ascending=False)

If you inspect the results, you’ll notice some tied groups at rank 6.5 with 2 guests aged 27 and at rank 8.5 with 2 guests aged 24.

Observing ties is common, so you should anticipate this and determine how to deal with them. Depending on your choice, there are different methods available in Python.

Methods of Ranking in Python Pandas

1. Average Ranking in Pandas

Info box:

Ranking methods give you options for dealing with ties.

Let’s talk about the default option, which is the ‘average’ ranking. Looking back at our output, the 2 guests aged 27 occupy ranks 6 and 7. Since they are now ‘sharing’ a rank, our function returns an average of 6 and 7 - 6.5.

Similarly, ranks 8 and 9 are shared between two guests aged 24. Ranks 8 and 9 are effectively shared, and the average of 8.5 is awarded to both guests.

Info box:

‘Average’ method (default option)

Tied elements are given the same rank.
The rank for the tied group is calculated as the average of the elements’ positions.
Because an average is used, ranks may have decimal places.

Because an average is calculated, expect rankings with decimal places.

Whole number rankings like ‘1st’ and ‘2nd’ are more intuitive than ‘1.33rd’ or ‘2.5th’, so let me present you with other options that will always return an integer ranking.

2. Minimum and Maximum Ranking in Pandas

Like average ranking, minimum and maximum Pandas ranking methods will assign the same rank to elements of the same value. However, instead of taking the average of the tied elements’ positions, we take either the group's minimum or maximum rank.

Find the top 5 cities with the most 5 star businesses

Yelp

MediumID 10148

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the top 5 cities with the most 5-star businesses. Output the city name along with the number of 5-star businesses. Include both open and closed businesses. In the case of multiple cities having the same number of 5-star businesses, use the ranking function returning the lowest rank in the group and output cities with a rank smaller than or equal to 5.

DataFrame: yelp_business

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10148-find-the-top-10-cities-with-the-most-5-star-businesses

In this example, we’re asked to return the top cities with the most 5-star businesses, where tied groups are given the lowest instead of the average rank.

import pandas as pd
import numpy as np

stars = yelp_business[yelp_business['stars'] == 5]
stars = stars.groupby(['city'])['stars'].count().to_frame('count_of_5_stars').reset_index()
stars['rank_min'] = stars['count_of_5_stars'].rank(method='min', ascending=False)
stars.sort_values(by='rank_min')

All required columns and the first 5 rows of the solution are shown

city	count_of_5_stars	rank_min
Phoenix	5	1
Las Vegas	4	2
Toronto	4	2
Scottsdale	2	4
Madison	2	4

In this Pandas ranking method, the tied elements inherit the lowest ranking in the group. The rank after this is determined by incrementing the rank by the number of tied elements. For example, if two cities (in positions 2 and 3) are tied, they will be both ranked 2, which is the minimum rank for the group. The next rank to be awarded is the previous rank (2) plus the number of tied elements (2), so 4. At rank 4, we also have 5 tied elements, so the rank after this group is the previous rank, 4, plus 5 (number of tied elements), resulting in rank 9. Hence, the rankings are 1, 2, 4, and 9.

Why don’t you try using the max method and see how it changes the values?

For comparison, here’s how rankings differ under the various methods.

Las Vegas and Toronto are in positions 2 and 3. The average of the two is 2.5 (rank_avg). The minimum is 2 (rank_min), and the maximum is 3 (rank_max)
Edinburgh, Gilbert, Madison, Scottsdale, and Urbana occupy positions 4,5,6,7 and 8. The average position is 6, the minimum is 4, and the maximum is 8.

Notice that there is a gap in the ranking across these three methods.

import pandas as pd
import numpy as np

stars = yelp_business[yelp_business['stars'] == 5]
stars = stars.groupby(['city'])['stars'].count().to_frame('count_of_5_stars').reset_index()
stars['rank_avg'] = stars['count_of_5_stars'].rank(ascending=False)
stars['rank_min'] = stars['count_of_5_stars'].rank(method='min', ascending=False)
stars['rank_max'] = stars['count_of_5_stars'].rank(method='max', ascending=False)
stars.sort_values(by=['rank_avg', 'rank_min'])

All required columns and the first 5 rows of the solution are shown

city	count_of_5_stars	rank_avg	rank_min	rank_max
Phoenix	5	1	1	1
Las Vegas	4	2.5	2	3
Toronto	4	2.5	2	3
Edinburgh	2	6	4	8
Gilbert	2	6	4	8

Info box:

‘Minimum’ and ‘Maximum’ methods (default option)

Tied elements are given the same rank.
The rank for the tied group is calculated as the minimum or maximum position of the group.
There will be ‘skipped’ ranks in the case of ties.

If skipping ranks is not your preferred route, check out these next two methods - dense or first rank.

3. Dense Ranking in Pandas

The Pandas dense ranking is a method of assigning ranks to items in which consecutive integers are used with no gaps, even if there are ties. In other words, the next rank will always be the succeeding integer value, regardless of how many elements are tied for the current rank or the rank/s before it. Therefore, you can always expect a rank 1, 2, 3, and so on without worrying about ties.

Let’s use this to solve a problem from Airbnb.

The question asks:

Ranking Hosts By Beds

Interview Question Date: July 2020

Airbnb

MediumID 10161

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Rank each host based on the number of beds they have listed. The host with the most beds should be ranked 1 and the host with the least number of beds should be ranked last. Hosts that have the same number of beds should have the same rank but there should be no gaps between ranking values. A host can also own multiple properties. Output the host ID, number of beds, and rank from highest rank to lowest.

DataFrame: airbnb_apartments

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10161-ranking-hosts-by-beds

After aggregating the total number of beds from each host, we can create a new column called ‘rank_dense’ to create the rankings of hosts based on the most number of beds using the dense ranking system.

result = airbnb_apartments.groupby('host_id')['n_beds'].sum().to_frame('number_of_beds').reset_index()
result['rank_dense'] = result['number_of_beds'].rank(method = 'dense',ascending = False)

For comparison, I’ll also include the results for the ‘average’ and ‘minimum’ Pandas ranking methods. There are several tied groups:

a) At 4 beds, hosts 1, 7, and 9 are tied. They occupy positions 5, 6, and 7.

b) At 3 beds, hosts 0 and 2 are tied. They occupy positions 8 and 9.

c) At 2 beds, hosts 4, 8, and 11 are tied. They occupy positions 10, 11, and 12.

Can you try to work out how the dense, average, and minimum ranks were calculated for each of these groups?

import pandas as pd

result = airbnb_apartments.groupby('host_id')['n_beds'].sum().to_frame('number_of_beds').reset_index()
result['rank_dense'] = result['number_of_beds'].rank(method = 'dense',ascending = False)
result['rank_avg'] = result['number_of_beds'].rank(method = 'average',ascending = False)
result['rank_min'] = result['number_of_beds'].rank(method = 'min',ascending = False)
result = result.sort_values(by='rank_min')

All required columns and the first 5 rows of the solution are shown

host_id	number_of_beds	rank_dense	rank_avg	rank_min
10	16	1	1	1
3	8	2	2	2
6	6	3	3	3
5	5	4	4	4
1	4	5	6	5

Another way of deriving the dense ranks is to:

Take the unique values of the column - in this case, the total number of beds which are 16, 8, 6, 5, 4, 3, and 2.
Rank these values accordingly so the rankings will be 1, 2, 3, 4, 5, 6, and 7, respectively.
Then assign a label to each row (i.e., host) according to the rank value assigned to their ‘total number of beds’.

When ranks are determined by using averages, minimums, and maximums, rankings are effectively skipped because all groups receive the same value. Under the dense ranking method, tied groups are given the same value, but the next rank is always the next consecutive integer (regardless of how many elements are tied for the current rank or the ranks before it).

We have the first tied group ranking as 5th and the second tied group ranking as 6th, and this will be the same regardless of how many hosts tie for each of these places. In the same way, if we had several hosts tying for first or second place, hosts 1, 7, and 9 would still be ranked 5, and hosts 0 and 2 would still rank 6th. Meanwhile, under other Pandas ranking methods, the ranking of these tied groups will change depending on how many hosts are ranked above them.

Info box:

‘Dense’ Ranking methods

Tied elements are given the same rank.
Rankings use consecutive integers. In other words, no integer is ‘skipped’ during ranking.

Tied groups are assigned the same ranking for all the methods we’ve covered so far.

What if you needed each record to have a unique ranking?

4. ‘First’ Ranking in Pandas

Let’s take this interview question from Google which asks us to return a unique ranking for each user, even if multiple users have the same number of emails.

Activity Rank

Interview Question Date: July 2021

Google

MediumID 10351

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the email activity rank for each user. Email activity rank is defined by the total number of emails sent. The user with the highest number of emails sent will have a rank of 1, and so on. Output the user, total emails, and their activity rank. Order records by the total emails in descending order. Sort users with the same number of emails in alphabetical order. In your rankings, return a unique value (i.e., a unique rank) even if multiple users have the same number of emails. For tie breaker use alphabetical order of the user usernames.

DataFrame: google_gmail_emails

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10351-activity-rank

If you need to force a distinct ranking for every record, use method='first'.

After you’ve computed the total emails per user, derive the user ranking based on the highest total emails using the ‘first’ method.

result['rank'] = result['total_emails'].rank(method='first', ascending=False

import pandas as pd
import numpy as np

result = google_gmail_emails.groupby(
    ['from_user']).size().to_frame('total_emails').reset_index()
result['rank'] = result['total_emails'].rank(method='first', ascending=False)
result = result.sort_values(by=['total_emails', 'from_user'], ascending=[False, True])

All required columns and the first 5 rows of the solution are shown

from_user	total_emails	rank
32ded68d89443e808	19	1
ef5fe98c6b9f313075	19	2
5b8754928306a18b68	18	3
55e60cfcc9dc49c17e	16	4
91f59516cb9dee1e88	16	5

The method='first' option for ranking in Pandas assigns ranks based on the order of appearance of the values in the original list. This means that if there are tied values in the list, each value will be assigned a different rank based on its position in the list. The first value in the list will be given a rank of 1, the second value in the list will be given a rank of 2, and so on, regardless of whether there are any ties.

In our data, there are two users with 19 emails each. However, the user appearing first in the DataFrame will be ranked 1, and the other user will receive rank 2. In contrast to other ranking methods that assign an average rank to tied values, the 'first' method ensures there are no repeated ranks despite having ties at the same values.

The 'first' ranking method in Pandas is particularly useful when you want to return a fixed number of records or ensure only one row is returned per rank. By using this Pandas ranking method, you can ensure that each value is assigned a unique rank.

Info box:

‘First’ Ranking methods

Each record is given a unique rank (even in the case of ties).
Rankings are based on consecutive integers.

Conclusion

To summarize, rankings in Pandas are created by calling the .rank() function on the relevant column. By default, values are ranked in ascending order such that the lowest value is Rank 1. In the case of ties, the average ranking for the tied group is also used.

However, there are other approaches to ranking, namely:

Minimum: tied elements are ranked based on the lowest position of the group
Maximum: tied elements are ranked based on the highest position of the group
Dense: rankings are based on consecutive integers, regardless of ties
First: unique rankings are returned for each row

Watch out for our next blog, where we will focus on options for dealing with nulls and duplicates when ranking. As a bonus, we will also map out the SQL ranking function equivalents of these ranking methods, so stay tuned! Meanwhile, we recommend this “Pandas Cheat Sheet” that explains the functions any data scientist should know.

Methods for Ranking in Pandas

Categories

Basics of Pandas Ranking

Rank guests based on their ages

Methods of Ranking in Python Pandas

1. Average Ranking in Pandas

2. Minimum and Maximum Ranking in Pandas

Find the top 5 cities with the most 5 star businesses

3. Dense Ranking in Pandas

Ranking Hosts By Beds

4. ‘First’ Ranking in Pandas

Activity Rank

Conclusion

Latest Posts:

From Arrays to Lists: Converting Numpy Arrays with Python

Tips for Crafting an Impressive Data Analyst Portfolio

PySpark GroupBy Guide: Super Simple Way to Group Data

Share

Follow

Categories