Methods for Ranking in Pandas
Categories
Learn how to rank data in Python Pandas using different methods. We use the Airbnb, Google, and Yelp datasets for practical examples.
Ranking data in Pandas is one of the most commonly asked Pandas interview questions. If you’ve been preparing for interviews, you might already be familiar with ranking operations using SQL. In fact, our team has already put together an SQL guide to ranking functions.
This is the Python version, where I’ll take you through how to rank in Pandas and the various methods of ranking available.
You can also watch the video tutorial for these methods on our YouTube channel.
Basics of Pandas Ranking
You can create a ranking in Python Pandas by passing the .rank() function to the column used as the basis for ranking.
Rank guests based on their ages. Output the guest id along with the corresponding rank. Order records by the age in descending order.
Link to the question: https://platform.stratascratch.com/coding/10160-rank-guests-based-on-their-ages
For example, in this simple exercise from Airbnb, we can rank the guests based on their ages by:
airbnb_guests['age'].rank()
By default, this will create a ranking based on the ascending order such that the youngest visitor gets Rank 1.
import pandas as pd
import numpy as np
airbnb_guests['rank'] = airbnb_guests['age'].rank()
airbnb_guests[['age', 'rank']].sort_values(by='age')
age | rank |
---|---|
19 | 1 |
21 | 2 |
23 | 3 |
24 | 4.5 |
24 | 4.5 |
However, our ranking should be in descending order, so we’ll have to specify it as a parameter.
airbnb_guests['age'].rank(ascending=False)
This reverses the ranking such that the oldest guest (i.e., highest age value) gets Rank 1.
import pandas as pd
import numpy as np
airbnb_guests['rank'] = airbnb_guests['age'].rank(ascending=False)
airbnb_guests[['age', 'rank']].sort_values(by='age', ascending=False)
If you inspect the results, you’ll notice some tied groups at rank 6.5 with 2 guests aged 27 and at rank 8.5 with 2 guests aged 24.
Observing ties is common, so you should anticipate this and determine how to deal with them. Depending on your choice, there are different methods available in Python.
Methods of Ranking in Python Pandas
1. Average Ranking in Pandas
Info box:
- Ranking methods give you options for dealing with ties.
Let’s talk about the default option, which is the ‘average’ ranking. Looking back at our output, the 2 guests aged 27 occupy ranks 6 and 7. Since they are now ‘sharing’ a rank, our function returns an average of 6 and 7 - 6.5.
Similarly, ranks 8 and 9 are shared between two guests aged 24. Ranks 8 and 9 are effectively shared, and the average of 8.5 is awarded to both guests.
Info box:
‘Average’ method (default option)
- Tied elements are given the same rank.
- The rank for the tied group is calculated as the average of the elements’ positions.
- Because an average is used, ranks may have decimal places.
Because an average is calculated, expect rankings with decimal places.
Whole number rankings like ‘1st’ and ‘2nd’ are more intuitive than ‘1.33rd’ or ‘2.5th’, so let me present you with other options that will always return an integer ranking.
2. Minimum and Maximum Ranking in Pandas
Like average ranking, minimum and maximum Pandas ranking methods will assign the same rank to elements of the same value. However, instead of taking the average of the tied elements’ positions, we take either the group's minimum or maximum rank.
Find the top 5 cities with the most 5-star businesses. Output the city name along with the number of 5-star businesses. Include both open and closed businesses. In the case of multiple cities having the same number of 5-star businesses, use the ranking function returning the lowest rank in the group and output cities with a rank smaller than or equal to 5.
Link to the question: https://platform.stratascratch.com/coding/10148-find-the-top-10-cities-with-the-most-5-star-businesses
In this example, we’re asked to return the top cities with the most 5-star businesses, where tied groups are given the lowest instead of the average rank.
import pandas as pd
import numpy as np
stars = yelp_business[yelp_business['stars'] == 5]
stars = stars.groupby(['city'])['stars'].count().to_frame('count_of_5_stars').reset_index()
stars['rank_min'] = stars['count_of_5_stars'].rank(method='min', ascending=False)
stars.sort_values(by='rank_min')
city | count_of_5_stars | rank_min |
---|---|---|
Phoenix | 5 | 1 |
Las Vegas | 4 | 2 |
Toronto | 4 | 2 |
Scottsdale | 2 | 4 |
Madison | 2 | 4 |
In this Pandas ranking method, the tied elements inherit the lowest ranking in the group. The rank after this is determined by incrementing the rank by the number of tied elements. For example, if two cities (in positions 2 and 3) are tied, they will be both ranked 2, which is the minimum rank for the group. The next rank to be awarded is the previous rank (2) plus the number of tied elements (2), so 4. At rank 4, we also have 5 tied elements, so the rank after this group is the previous rank, 4, plus 5 (number of tied elements), resulting in rank 9. Hence, the rankings are 1, 2, 4, and 9.
Why don’t you try using the max method and see how it changes the values?
For comparison, here’s how rankings differ under the various methods.
- Las Vegas and Toronto are in positions 2 and 3. The average of the two is 2.5 (rank_avg). The minimum is 2 (rank_min), and the maximum is 3 (rank_max)
- Edinburgh, Gilbert, Madison, Scottsdale, and Urbana occupy positions 4,5,6,7 and 8. The average position is 6, the minimum is 4, and the maximum is 8.
Notice that there is a gap in the ranking across these three methods.
import pandas as pd
import numpy as np
stars = yelp_business[yelp_business['stars'] == 5]
stars = stars.groupby(['city'])['stars'].count().to_frame('count_of_5_stars').reset_index()
stars['rank_avg'] = stars['count_of_5_stars'].rank(ascending=False)
stars['rank_min'] = stars['count_of_5_stars'].rank(method='min', ascending=False)
stars['rank_max'] = stars['count_of_5_stars'].rank(method='max', ascending=False)
stars.sort_values(by=['rank_avg', 'rank_min'])
city | count_of_5_stars | rank_avg | rank_min | rank_max |
---|---|---|---|---|
Phoenix | 5 | 1 | 1 | 1 |
Las Vegas | 4 | 2.5 | 2 | 3 |
Toronto | 4 | 2.5 | 2 | 3 |
Edinburgh | 2 | 6 | 4 | 8 |
Gilbert | 2 | 6 | 4 | 8 |
Info box:
‘Minimum’ and ‘Maximum’ methods (default option)
- Tied elements are given the same rank.
- The rank for the tied group is calculated as the minimum or maximum position of the group.
- There will be ‘skipped’ ranks in the case of ties.
If skipping ranks is not your preferred route, check out these next two methods - dense or first rank.
3. Dense Ranking in Pandas
The Pandas dense ranking is a method of assigning ranks to items in which consecutive integers are used with no gaps, even if there are ties. In other words, the next rank will always be the succeeding integer value, regardless of how many elements are tied for the current rank or the rank/s before it. Therefore, you can always expect a rank 1, 2, 3, and so on without worrying about ties.
Let’s use this to solve a problem from Airbnb.
The question asks:
Interview Question Date: July 2020
Rank each host based on the number of beds they have listed. The host with the most beds should be ranked 1 and the host with the least number of beds should be ranked last. Hosts that have the same number of beds should have the same rank but there should be no gaps between ranking values. A host can also own multiple properties. Output the host ID, number of beds, and rank from highest rank to lowest.
Link to the question: https://platform.stratascratch.com/coding/10161-ranking-hosts-by-beds
After aggregating the total number of beds from each host, we can create a new column called ‘rank_dense’ to create the rankings of hosts based on the most number of beds using the dense ranking system.
result = airbnb_apartments.groupby('host_id')['n_beds'].sum().to_frame('number_of_beds').reset_index()
result['rank_dense'] = result['number_of_beds'].rank(method = 'dense',ascending = False)
For comparison, I’ll also include the results for the ‘average’ and ‘minimum’ Pandas ranking methods. There are several tied groups:
a) At 4 beds, hosts 1, 7, and 9 are tied. They occupy positions 5, 6, and 7.
b) At 3 beds, hosts 0 and 2 are tied. They occupy positions 8 and 9.
c) At 2 beds, hosts 4, 8, and 11 are tied. They occupy positions 10, 11, and 12.
Can you try to work out how the dense, average, and minimum ranks were calculated for each of these groups?
import pandas as pd
result = airbnb_apartments.groupby('host_id')['n_beds'].sum().to_frame('number_of_beds').reset_index()
result['rank_dense'] = result['number_of_beds'].rank(method = 'dense',ascending = False)
result['rank_avg'] = result['number_of_beds'].rank(method = 'average',ascending = False)
result['rank_min'] = result['number_of_beds'].rank(method = 'min',ascending = False)
result = result.sort_values(by='rank_min')
host_id | number_of_beds | rank_dense | rank_avg | rank_min |
---|---|---|---|---|
10 | 16 | 1 | 1 | 1 |
3 | 8 | 2 | 2 | 2 |
6 | 6 | 3 | 3 | 3 |
5 | 5 | 4 | 4 | 4 |
1 | 4 | 5 | 6 | 5 |
Another way of deriving the dense ranks is to:
- Take the unique values of the column - in this case, the total number of beds which are 16, 8, 6, 5, 4, 3, and 2.
- Rank these values accordingly so the rankings will be 1, 2, 3, 4, 5, 6, and 7, respectively.
- Then assign a label to each row (i.e., host) according to the rank value assigned to their ‘total number of beds’.
When ranks are determined by using averages, minimums, and maximums, rankings are effectively skipped because all groups receive the same value. Under the dense ranking method, tied groups are given the same value, but the next rank is always the next consecutive integer (regardless of how many elements are tied for the current rank or the ranks before it).
We have the first tied group ranking as 5th and the second tied group ranking as 6th, and this will be the same regardless of how many hosts tie for each of these places. In the same way, if we had several hosts tying for first or second place, hosts 1, 7, and 9 would still be ranked 5, and hosts 0 and 2 would still rank 6th. Meanwhile, under other Pandas ranking methods, the ranking of these tied groups will change depending on how many hosts are ranked above them.
Info box:
‘Dense’ Ranking methods
- Tied elements are given the same rank.
- Rankings use consecutive integers. In other words, no integer is ‘skipped’ during ranking.
Tied groups are assigned the same ranking for all the methods we’ve covered so far.
What if you needed each record to have a unique ranking?
4. ‘First’ Ranking in Pandas
Let’s take this interview question from Google which asks us to return a unique ranking for each user, even if multiple users have the same number of emails.
Interview Question Date: July 2021
Find the email activity rank for each user. Email activity rank is defined by the total number of emails sent. The user with the highest number of emails sent will have a rank of 1, and so on. Output the user, total emails, and their activity rank. Order records by the total emails in descending order. Sort users with the same number of emails in alphabetical order. In your rankings, return a unique value (i.e., a unique rank) even if multiple users have the same number of emails. For tie breaker use alphabetical order of the user usernames.
Link to the question: https://platform.stratascratch.com/coding/10351-activity-rank
If you need to force a distinct ranking for every record, use method='first'.
After you’ve computed the total emails per user, derive the user ranking based on the highest total emails using the ‘first’ method.
result['rank'] = result['total_emails'].rank(method='first', ascending=False
import pandas as pd
import numpy as np
result = google_gmail_emails.groupby(
['from_user']).size().to_frame('total_emails').reset_index()
result['rank'] = result['total_emails'].rank(method='first', ascending=False)
result = result.sort_values(by=['total_emails', 'from_user'], ascending=[False, True])
from_user | total_emails | rank |
---|---|---|
32ded68d89443e808 | 19 | 1 |
ef5fe98c6b9f313075 | 19 | 2 |
5b8754928306a18b68 | 18 | 3 |
55e60cfcc9dc49c17e | 16 | 4 |
91f59516cb9dee1e88 | 16 | 5 |
The method='first' option for ranking in Pandas assigns ranks based on the order of appearance of the values in the original list. This means that if there are tied values in the list, each value will be assigned a different rank based on its position in the list. The first value in the list will be given a rank of 1, the second value in the list will be given a rank of 2, and so on, regardless of whether there are any ties.
In our data, there are two users with 19 emails each. However, the user appearing first in the DataFrame will be ranked 1, and the other user will receive rank 2. In contrast to other ranking methods that assign an average rank to tied values, the 'first' method ensures there are no repeated ranks despite having ties at the same values.
The 'first' ranking method in Pandas is particularly useful when you want to return a fixed number of records or ensure only one row is returned per rank. By using this Pandas ranking method, you can ensure that each value is assigned a unique rank.
Info box:
‘First’ Ranking methods
- Each record is given a unique rank (even in the case of ties).
- Rankings are based on consecutive integers.
Conclusion
To summarize, rankings in Pandas are created by calling the .rank() function on the relevant column. By default, values are ranked in ascending order such that the lowest value is Rank 1. In the case of ties, the average ranking for the tied group is also used.
However, there are other approaches to ranking, namely:
- Minimum: tied elements are ranked based on the lowest position of the group
- Maximum: tied elements are ranked based on the highest position of the group
- Dense: rankings are based on consecutive integers, regardless of ties
- First: unique rankings are returned for each row
Watch out for our next blog, where we will focus on options for dealing with nulls and duplicates when ranking. As a bonus, we will also map out the SQL ranking function equivalents of these ranking methods, so stay tuned! Meanwhile, we recommend this “Pandas Cheat Sheet” that explains the functions any data scientist should know.