Ranking in Python and SQL

Parameters of the ranking function

The Pandas rank() function has several parameters that allow you to customize your ranking based on your dataset. These parameters include:

pct: to return percentile rankings
na_option: to control how nulls are dealt with during ranking
axis: to rank rows or columns.
numeric_only: to rank numeric or non-numeric columns

pandas.DataFrame.rank

DataFrame.rank(axis=0, method=’average’, numeric_only=false, na_option=’keep’, ascending=True, pct=False)

Pct = False

Did you know that in pandas, you can obtain percentage rankings instead of integer rankings? By default, the .rank() method sets ‘pct’ to False and returns integer rankings. However, when set to True, .rank() computes percentage rankings that are normalized from 0 to 1.

Percentage rankings are useful when you want to compare rankings across different columns, regardless of their actual values. They provide a standardized way of measuring an individual's performance or status relative to the rest of the group. This is why percentage rankings are commonly used in fields like education and healthcare.

So, what exactly is a percentage ranking and how is it calculated?

What is percentile ranking?

percentage of values in the distribution that are less than or equal to the given value or have a lower rank

Essentially, the percentile ranking tells you the percentage of values in the distribution that are less than or equal to the given value or have a lower rank. For example, if a student scored 80 out of 100 on a test and the distribution of scores in the class is such that 60% of the students scored below 80, then the percentile ranking of the student's score is 60%.

Let’s look at a question example from General Assembly.

Find the 80th percentile of hours studied

General Assembly

Kaplan

Google

MediumID 9611

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the 80th percentile of hours studied. Output hours studied value at specified percentile.

DataFrame: sat_scores

Expected Output Type: pandas.DataFrame

Table: sat_scores

school	teacher	student_id	sat_writing	sat_verbal	sat_math	hrs_studied	id	average_sat
Washington HS	Frederickson	1	583	307	528	190	1	583
Washington HS	Frederickson	2	401	791	248	149	2	401
Washington HS	Frederickson	3	523	445	756	166	3	523
Washington HS	Frederickson	4	306	269	327	137	4	306
Washington HS	Frederickson	5	300	539	743	115	5	300

The question asks us to find the 80th percentile of hours studied. In other words, “80% of the students studied less than x hours?”

But let’s expand this requirement. To show how ranking works, we’ll list the hours studied, rank them, and show the percentile rank.

To solve this, we will have to use the ranking function, setting pct to True.

# Import your libraries
import pandas as pd

# Start writing code
sat_scores.head()

sat_scores['percentile_rank'] = sat_scores['hrs_studied'].rank(axis = 0 , pct = True)
sat_scores['rank'] = sat_scores['hrs_studied'].rank(axis = 0)
sat_scores[['hrs_studied', 'rank', 'percentile_rank']].sort_values(by='percentile_rank', ascending=False)

An error loading the results has occurred

The above output shows the first 28 rows, which is enough to demonstrate everything.

As you can see, the top value (i.e. 200 hours) gets a percentile rank of 1 because 100% of the values are ranking below this (while the lowest will get 0 because it is already the lowest rank).

In case of ties here, rankings are determined by the average percentile rank of the tied group.

Regarding the 80th percentile, it’s at 163 hours. The next amount of hours studied (162) is below the 80th percentile.

na_option = ‘keep’

The ‘na_option’, which allows you to control how nulls are dealt with during ranking. By default, na_option is set to 'keep', which means that there is no ranking returned for these records. However, you can set the na_option to 'bottom' or 'top' which assigns either the lowest or highest rank to these rows.

# Import your libraries
import pandas as pd
import numpy as np 
# Start writing code
sat_scores.head()

sat_scores['rank_default'] = sat_scores['hrs_studied'].rank()
sat_scores['rank_natop'] = sat_scores['hrs_studied'].rank(na_option='top')
sat_scores['rank_nabottom'] = sat_scores['hrs_studied'].rank(na_option='bottom')

sat_scores.sort_values(by='rank_natop')

An error loading the results has occurred

As you can see here, the default option leaves the rank null but we can change this easily to give it the top or bottom rank.

axis = 0

The axis tells you whether to rank rows or columns. Most of the time we are ranking rows and the default option caters to this. However, if you ever find yourself with another dataset type, like a time series of KPIs, you might want to rank columns instead to identify the trends over time for each KPI (represented by a column). To do this, set the axis to 1 instead of 0.

numeric_only = True

Again, with the ‘numeric_only’ option, the default option is set to the usual use case which is ranking based on numeric values. However, if you want to rank both numeric and non-numeric values like arranging records by name as well, you can do so by setting this to False.

Ranking in SQL and Python

Data analysts and scientists typically must be proficient in SQL and Python so now that you know the variations of ranking in Python, let’s discover their equivalent functions in SQL!

Percentile Ranking

In SQL, you can achieve this by using the percent_rank() window function. Why don’t you try it on the same question example?

Find the 80th percentile of hours studied

General Assembly

Kaplan

Google

MediumID 9611

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the 80th percentile of hours studied. Output hours studied value at specified percentile.

Table: sat_scores

SELECT *,
percent_rank() OVER (ORDER BY hrs_studied) as study_percentile
from sat_scores
Order by study_percentile

An error loading the results has occurred

By default, nulls are left at the bottom and therefore, get the largest ranking (i.e. close to 1). This is unlike the Python rank function where nulls were not given any ranking by default unless we change the na_option parameter.

To rank nulls, you can specify this in the order by clause. For example:

SELECT *,
percent_rank() OVER (ORDER BY hrs_studied NULLS FIRST) as study_percentile
from sat_scores
Order by study_percentile

Why don’t you try this?

Ranking (method=’min’)

Using this guest dataset from Airbnb, we will rank guests according to their age with the eldest receiving the top rank.

Rank guests based on their ages

Airbnb

EasyID 10160

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Rank guests based on their ages. Output the guest id along with the corresponding rank. Order records by the age in descending order.

Table: airbnb_guests

guest_id	nationality	gender	age
0	Mali	M	21
1	China	F	23
2	Mali	F	27
3	Australia	F	24
4	Luxembourg	M	19

Recall that in Python, we used the .rank() function and set ascending to False.

airbnb_guests['rank'] = airbnb_guests['age'].rank(method='min', ascending = False)

An error loading the results has occurred

In SQL, we’ll use the rank window function and specify that age should be ordered in descending order.

SELECT 
    guest_id,
    RANK() OVER(ORDER BY age DESC)
FROM airbnb_guests

An error loading the results has occurred

Look at the ties and notice that the RANK() window function in SQL is using the minimum method to rank tied groups. In case you haven’t watched our previous ranking video, the minimum method gives the lowest possible rank to the tied values. For example, guests with ages 27 occupy positions 6 and 7. Because of the tie, 6 is now used as the rank of both guests.

Ranking (method=’dense’)

In Python, we use the rank() function with method='dense' to achieve a ranking with no gaps between ranks. In SQL, we use the dense_rank() window function to achieve the same result.

Let’s attempt this question in SQL.

Ranking Hosts By Beds

Interview Question Date: July 2020

Airbnb

MediumID 10161

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Rank each host based on the number of beds they have listed. The host with the most beds should be ranked 1 and the host with the least number of beds should be ranked last. Hosts that have the same number of beds should have the same rank but there should be no gaps between ranking values. A host can also own multiple properties. Output the host ID, number of beds, and rank from highest rank to lowest.

Table: airbnb_apartments

host_id	apartment_id	apartment_type	n_beds	n_bedrooms	country	city
0	A1	Room	1	1	USA	New York
0	A2	Room	1	1	USA	New Jersey
0	A3	Room	1	1	USA	New Jersey
1	A4	Apartment	2	1	USA	Houston
1	A5	Apartment	2	1	USA	Las Vegas

To start with, let’s get the number of beds listed by each host.

SELECT 
    host_id, 
    sum(n_beds) as number_of_bedsFROM airbnb_apartments
GROUP BY host_id

Then, let’s add in the ranking:

SELECT 
    host_id, 
    sum(n_beds) as number_of_beds,
    DENSE_RANK() OVER(ORDER BY sum(n_beds) DESC) as rank
FROM airbnb_apartments
GROUP BY host_id
ORDER BY number_of_beds desc

An error loading the results has occurred

And this achieves the ranking method we want, which is to not have gaps in the ranking.

Ranking (method=’first’)

In Python, we use the rank() function with method='first' to assign a unique rank to every record. In SQL, we use the row_number() window function to achieve the same result.

Activity Rank

Interview Question Date: July 2021

Google

MediumID 10351

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the email activity rank for each user. Email activity rank is defined by the total number of emails sent. The user with the highest number of emails sent will have a rank of 1, and so on. Output the user, total emails, and their activity rank. Order records by the total emails in descending order. Sort users with the same number of emails in alphabetical order. In your rankings, return a unique value (i.e., a unique rank) even if multiple users have the same number of emails. For tie breaker use alphabetical order of the user usernames.

Table: google_gmail_emails

id	from_user	to_user	day
0	6edf0be4b2267df1fa	75d295377a46f83236	10
1	6edf0be4b2267df1fa	32ded68d89443e808	6
2	6edf0be4b2267df1fa	55e60cfcc9dc49c17e	10
3	6edf0be4b2267df1fa	e0e0defbb9ec47f6f7	6
4	6edf0be4b2267df1fa	47be2887786891367e	1

First, let’s get the total number of emails sent out by each user.

SELECT  from_user, 
        count(*) as total_emails
FROM google_gmail_emails 
    GROUP BY from_user

Then, let’s use this as a basis for our ranking. We can use row_number() in SQL to create the unique rankings.

SELECT  from_user, 
        count(*) as total_emails, 
        row_number() OVER ( order by count(*) desc, from_user asc)
    FROM google_gmail_emails 
    GROUP BY from_user
    order by 2 DESC, 1

An error loading the results has occurred

And there you have it - ranking methods in Python and SQL.

Summary

In this post, we've covered other parameters of the Pandas rank function, such as pct, na_option, axis, and numeric_only. We also looked at several ranking methods in Python and discovered their equivalents in SQL. Here’s a cheat sheet for you:

Tip!:

If you need to change the way nulls are being ranked, specify this in the ORDER BY() clause. For example, RANK() OVER(ORDER BY col NULLS FIRST)

Ranking helps identify patterns and trends in datasets so it’s an important tool for data analysis. To master ranking in SQL, check out these two articles:

Ranking in Python and SQL

Categories

Parameters of the ranking function

Pct = False

Find the 80th percentile of hours studied

na_option = ‘keep’

axis = 0

numeric_only = True

Ranking in SQL and Python

Percentile Ranking

Find the 80th percentile of hours studied

Ranking (method=’min’)

Rank guests based on their ages

Ranking (method=’dense’)

Ranking Hosts By Beds

Ranking (method=’first’)

Activity Rank

Summary

Latest Posts:

How Does Active Learning Machine Learning Work?

How to Replace a Character in a Python String

Mastering Loop Iterations: Python For Loop Index Explained

Share

Follow

Categories