Methods for Ranking in Pandas

Methods for Ranking in Pandas


Learn how to rank data in Python Pandas using different methods. We use the Airbnb, Google, and Yelp datasets for practical examples.

Ranking data in Pandas is one of the most commonly asked Pandas interview questions. If you’ve been preparing for interviews, you might already be familiar with ranking operations using SQL. In fact, our team has already put together an SQL guide to ranking functions.

This is the Python version, where I’ll take you through how to rank in Pandas and the various methods of ranking available.

You can also watch the video tutorial for these methods on our YouTube channel.

Basics of Pandas Ranking

You can create a ranking in Python Pandas by passing the .rank() function to the column used as the basis for ranking.


DataFrame: airbnb_guests
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10160-rank-guests-based-on-their-ages

For example, in this simple exercise from Airbnb, we can rank the guests based on their ages by:

airbnb_guests['age'].rank()

By default, this will create a ranking based on the ascending order such that the youngest visitor gets Rank 1.

import pandas as pd
import numpy as np
airbnb_guests['rank'] = airbnb_guests['age'].rank()
airbnb_guests[['age', 'rank']].sort_values(by='age')

All required columns and the first 5 rows of the solution are shown

agerank
191
212
233
244.5
244.5

However, our ranking should be in descending order, so we’ll have to specify it as a parameter.

airbnb_guests['age'].rank(ascending=False)

This reverses the ranking such that the oldest guest (i.e., highest age value) gets Rank 1.

import pandas as pd
import numpy as np
airbnb_guests['rank'] = airbnb_guests['age'].rank(ascending=False)
airbnb_guests[['age', 'rank']].sort_values(by='age', ascending=False)

If you inspect the results, you’ll notice some tied groups at rank 6.5 with 2 guests aged 27 and at rank 8.5 with 2 guests aged 24.

Observing ties is common, so you should anticipate this and determine how to deal with them. Depending on your choice, there are different methods available in Python.

Methods of Ranking in Python Pandas

Basics of Pandas Ranking

1. Average Ranking in Pandas

Info box:

  • Ranking methods give you options for dealing with ties.

Let’s talk about the default option, which is the ‘average’ ranking. Looking back at our output, the 2 guests aged 27 occupy ranks 6 and 7. Since they are now ‘sharing’ a rank, our function returns an average of 6 and 7 - 6.5.

Similarly, ranks 8 and 9 are shared between two guests aged 24. Ranks 8 and 9 are effectively shared, and the average of 8.5 is awarded to both guests.

Info box:

‘Average’ method (default option)

  • Tied elements are given the same rank.
  • The rank for the tied group is calculated as the average of the elements’ positions.
  • Because an average is used, ranks may have decimal places.

Because an average is calculated, expect rankings with decimal places.

Whole number rankings like ‘1st’ and ‘2nd’ are more intuitive than ‘1.33rd’ or ‘2.5th’, so let me present you with other options that will always return an integer ranking.

2. Minimum and Maximum Ranking in Pandas

Like average ranking, minimum and maximum Pandas ranking methods will assign the same rank to elements of the same value. However, instead of taking the average of the tied elements’ positions, we take either the group's minimum or maximum rank.


DataFrame: yelp_business
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10148-find-the-top-10-cities-with-the-most-5-star-businesses

In this example, we’re asked to return the top cities with the most 5-star businesses, where tied groups are given the lowest instead of the average rank.

import pandas as pd
import numpy as np

stars = yelp_business[yelp_business['stars'] == 5]
stars = stars.groupby(['city'])['stars'].count().to_frame('count_of_5_stars').reset_index()
stars['rank_min'] = stars['count_of_5_stars'].rank(method='min', ascending=False)
stars.sort_values(by='rank_min')

All required columns and the first 5 rows of the solution are shown

citycount_of_5_starsrank_min
Phoenix51
Las Vegas42
Toronto42
Scottsdale24
Madison24

In this Pandas ranking method, the tied elements inherit the lowest ranking in the group. The rank after this is determined by incrementing the rank by the number of tied elements. For example, if two cities (in positions 2 and 3) are tied, they will be both ranked 2, which is the minimum rank for the group. The next rank to be awarded is the previous rank (2) plus the number of tied elements (2), so 4. At rank 4, we also have 5 tied elements, so the rank after this group is the previous rank, 4, plus 5 (number of tied elements), resulting in rank 9. Hence, the rankings are 1, 2, 4, and 9.

Why don’t you try using the max method and see how it changes the values?

For comparison, here’s how rankings differ under the various methods.

  • Las Vegas and Toronto are in positions 2 and 3. The average of the two is 2.5 (rank_avg). The minimum is 2 (rank_min), and the maximum is 3 (rank_max)
  • Edinburgh, Gilbert, Madison, Scottsdale, and Urbana occupy positions 4,5,6,7 and 8. The average position is 6, the minimum is 4, and the maximum is 8.

Notice that there is a gap in the ranking across these three methods.

import pandas as pd
import numpy as np

stars = yelp_business[yelp_business['stars'] == 5]
stars = stars.groupby(['city'])['stars'].count().to_frame('count_of_5_stars').reset_index()
stars['rank_avg'] = stars['count_of_5_stars'].rank(ascending=False)
stars['rank_min'] = stars['count_of_5_stars'].rank(method='min', ascending=False)
stars['rank_max'] = stars['count_of_5_stars'].rank(method='max', ascending=False)
stars.sort_values(by=['rank_avg', 'rank_min'])

All required columns and the first 5 rows of the solution are shown

citycount_of_5_starsrank_avgrank_minrank_max
Phoenix5111
Las Vegas42.523
Toronto42.523
Edinburgh2648
Gilbert2648

Info box:

‘Minimum’ and ‘Maximum’ methods (default option)

  • Tied elements are given the same rank.
  • The rank for the tied group is calculated as the minimum or maximum position of the group.
  • There will be ‘skipped’ ranks in the case of ties.

If skipping ranks is not your preferred route, check out these next two methods - dense or first rank.

3. Dense Ranking in Pandas

The Pandas dense ranking is a method of assigning ranks to items in which consecutive integers are used with no gaps, even if there are ties. In other words, the next rank will always be the succeeding integer value, regardless of how many elements are tied for the current rank or the rank/s before it. Therefore, you can always expect a rank 1, 2, 3, and so on without worrying about ties.

Let’s use this to solve a problem from Airbnb.


The question asks:


DataFrame: airbnb_apartments
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10161-ranking-hosts-by-beds

After aggregating the total number of beds from each host, we can create a new column called ‘rank_dense’ to create the rankings of hosts based on the most number of beds using the dense ranking system.

result = airbnb_apartments.groupby('host_id')['n_beds'].sum().to_frame('number_of_beds').reset_index()
result['rank_dense'] = result['number_of_beds'].rank(method = 'dense',ascending = False)

For comparison, I’ll also include the results for the ‘average’ and ‘minimum’ Pandas ranking methods. There are several tied groups:

a) At 4 beds, hosts 1, 7, and 9 are tied. They occupy positions 5, 6, and 7.

b) At 3 beds, hosts 0 and 2 are tied. They occupy positions 8 and 9.

c) At 2 beds, hosts 4, 8, and 11 are tied. They occupy positions 10, 11, and 12.

Can you try to work out how the dense, average, and minimum ranks were calculated for each of these groups?

import pandas as pd

result = airbnb_apartments.groupby('host_id')['n_beds'].sum().to_frame('number_of_beds').reset_index()
result['rank_dense'] = result['number_of_beds'].rank(method = 'dense',ascending = False)
result['rank_avg'] = result['number_of_beds'].rank(method = 'average',ascending = False)
result['rank_min'] = result['number_of_beds'].rank(method = 'min',ascending = False)
result = result.sort_values(by='rank_min')

All required columns and the first 5 rows of the solution are shown

host_idnumber_of_bedsrank_denserank_avgrank_min
1016111
38222
66333
55444
14565

Another way of deriving the dense ranks is to:

  1. Take the unique values of the column - in this case, the total number of beds which are 16, 8, 6, 5, 4, 3, and 2.
  2. Rank these values accordingly so the rankings will be 1, 2, 3, 4, 5, 6, and 7, respectively.
  3. Then assign a label to each row (i.e., host) according to the rank value assigned to their ‘total number of beds’.

When ranks are determined by using averages, minimums, and maximums, rankings are effectively skipped because all groups receive the same value. Under the dense ranking method, tied groups are given the same value, but the next rank is always the next consecutive integer (regardless of how many elements are tied for the current rank or the ranks before it).

We have the first tied group ranking as 5th and the second tied group ranking as 6th, and this will be the same regardless of how many hosts tie for each of these places. In the same way, if we had several hosts tying for first or second place, hosts 1, 7, and 9 would still be ranked 5, and hosts 0 and 2 would still rank 6th. Meanwhile, under other Pandas ranking methods, the ranking of these tied groups will change depending on how many hosts are ranked above them.

Info box:

‘Dense’ Ranking methods

  • Tied elements are given the same rank.
  • Rankings use consecutive integers. In other words, no integer is ‘skipped’ during ranking.

Tied groups are assigned the same ranking for all the methods we’ve covered so far.

What if you needed each record to have a unique ranking?

4. ‘First’ Ranking in Pandas

Let’s take this interview question from Google which asks us to return a unique ranking for each user, even if multiple users have the same number of emails.


DataFrame: google_gmail_emails
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10351-activity-rank

If you need to force a distinct ranking for every record, use method='first'.

After you’ve computed the total emails per user, derive the user ranking based on the highest total emails using the ‘first’ method.

result['rank'] = result['total_emails'].rank(method='first', ascending=False

import pandas as pd
import numpy as np

result = google_gmail_emails.groupby(
    ['from_user']).size().to_frame('total_emails').reset_index()
result['rank'] = result['total_emails'].rank(method='first', ascending=False)
result = result.sort_values(by=['total_emails', 'from_user'], ascending=[False, True])

All required columns and the first 5 rows of the solution are shown

from_usertotal_emailsrank
32ded68d89443e808191
ef5fe98c6b9f313075192
5b8754928306a18b68183
55e60cfcc9dc49c17e164
91f59516cb9dee1e88165

The method='first' option for ranking in Pandas assigns ranks based on the order of appearance of the values in the original list. This means that if there are tied values in the list, each value will be assigned a different rank based on its position in the list. The first value in the list will be given a rank of 1, the second value in the list will be given a rank of 2, and so on, regardless of whether there are any ties.

In our data, there are two users with 19 emails each. However, the user appearing first in the DataFrame will be ranked 1, and the other user will receive rank 2. In contrast to other ranking methods that assign an average rank to tied values, the 'first' method ensures there are no repeated ranks despite having ties at the same values.

The 'first' ranking method in Pandas is particularly useful when you want to return a fixed number of records or ensure only one row is returned per rank. By using this Pandas ranking method, you can ensure that each value is assigned a unique rank.

Info box:

‘First’ Ranking methods

  • Each record is given a unique rank (even in the case of ties).
  • Rankings are based on consecutive integers.

Conclusion

To summarize, rankings in Pandas are created by calling the .rank() function on the relevant column. By default, values are ranked in ascending order such that the lowest value is Rank 1. In the case of ties, the average ranking for the tied group is also used.

However, there are other approaches to ranking, namely:

  • Minimum: tied elements are ranked based on the lowest position of the group
  • Maximum: tied elements are ranked based on the highest position of the group
  • Dense: rankings are based on consecutive integers, regardless of ties
  • First: unique rankings are returned for each row

Watch out for our next blog, where we will focus on options for dealing with nulls and duplicates when ranking. As a bonus, we will also map out the SQL ranking function equivalents of these ranking methods, so stay tuned! Meanwhile, we recommend this “Pandas Cheat Sheet” that explains the functions any data scientist should know.

Methods for Ranking in Pandas


Become a data expert. Subscribe to our newsletter.