Spotify Advanced SQL Interview Question on PARTITION BY Clause

Published: January 10, 2023

Categories:

Written by:
Nathan Rosidi

A detailed solution walkthrough to a hard Spotify SQL interview question involving Joins, Aggregations, Case Statements, and Partition By Clause.

In this article, we’ll walk you through one of the Advanced SQL interview questions. This question is a Hard level problem and will test your advanced SQL skills such as SQL Joins, Aggregations, Partition By clauses, and Case statements. Follow along by clicking on the link to the question provided below. Let us solve this problem using our 3-step framework that you can use to solve any coding question anytime.

Spotify Advanced SQL Interview Question

Days At Number One

Spotify

HardID 10173

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the number of days a US track has stayed in the 1st position for both the US and worldwide rankings on the same day. Output the track name and the number of days in the 1st position. Order your output alphabetically by track name.

If the region 'US' appears in dataset, it should be included in the worldwide ranking.

Tables: spotify_daily_rankings_2017_us, spotify_worldwide_daily_song_ranking

Video Solution

The question is entitled “Days at Number One” and asks us to find the number of days a US soundtrack has stayed in the first position in the Spotify daily rankings tables for both the US and worldwide rankings. The question further clarifies that if the region ‘US’ appears in the worldwide ranking table, it counts as a worldwide track.

In the output, we are expected to display two columns, viz., track name and the number of days at number one. Now, let us work backward to the approach. But first, let us explore the dataset provided.

1. Exploring the Dataset

Spotify has provided us with two tables, namely, spotify_daily_rankings_2017_us and spotify_worldwide_daily_song_ranking.

The first dataset, spotify_daily_rankings_2017_us, contains the following columns:

Missing data

We can see from this table that it only contains tracks that have held the first position in the US daily rankings on various dates.

The second table is named spotify_worldwide_daily_song_ranking, and it has the following schema:

Missing data

One look into this table, and we can see that there are tracks of various positions in this table, and not just the number one tracks as we observed from the US rankings table. Another difference to notice from this table is that there is a column for the region the track belongs to. As the question clarified, any track that hails from the US will also be a part of the worldwide rankings table.

2. Writing Out the Approach

Once you are familiar with the datasets provided, it is time to write out the approach you are about to take to solve the problem.

Going back to the question, the key to finding the number of days a US track has stayed in that position in both the tables is the word “both”. Instinctively, you’ll go for a join, which is absolutely correct.

Step 1: Merge the two tables in an Inner Join on the track name and date columns.

In our case, we will specifically use an inner join in order to filter out the tracks that are US tracks present in both tables and on the same dates. So, the inner join must be made on the two common columns, track name and date.

Step 2: Filter out the US tracks that are in position #1.

Once we have identified the tracks, we will then filter out the tracks that were in the number one position in the US rankings table. A simple WHERE clause is apt for this.

Step 3: Define a new column using SUM() OVER (PARTITION BY ) clauses

Next, we will create a subset of the latest result table containing only the US track names. A tricky little thing we will be using to achieve it is using an OVER (PARTITION BY) clause.

An OVER (PARTITION BY) clause helps us specify the columns on which we will perform windows functions. In our case, we are going to use a SUM function to aggregate the data. We will perform aggregation on the track name column so that we can find the sum of the number of times a US number one track has been number one worldwide as well.

Before moving on to the next step, let us break this step apart into smaller steps.

3.1 Check if the US #1 track is also #1 in the worldwide rankings.

A CASE statement will do the trick. CASE statements are basically if-then statements. When the WHEN condition goes through, the THEN parameter is returned, otherwise, the ELSE parameter is returned.

We will need to put a condition on the ‘position’ column of the worldwide rankings table in that it will return the value ‘1’ when its ranking is indeed number 1; else we will return the value ‘0’. We are using numerical values in this CASE statement since we will end up adding them to get the total number of days the track has stayed in that position.

3.2 Get the sum of the number of times it has occurred.

We will save the value of the SUM() function as a new column, say ‘n_days_on_n1_position’. As a result, we will have a sum value of the number of tracks a number one US track has been number one in both the tables on the same dates. We will use this query as a temporary table before proceeding toward displaying the final table.

Step 4: Select the track name and perform the MAX() function on the temp table.

In this step, we will basically select the maximum value of the ‘n_days_on_n1_position’ column as well as the corresponding track name to be displayed in the final result table.

Also, since we are using an aggregate function, we’ll couple it with a GROUP BY clause at the end, and we are grouping by track name in our case. This way, we only have one row per track with the maximum value displayed beside it.

Step 5: Order the track name column alphabetically.

As the question has suggested, we will order the output table by track name alphabetically.

3. Coding the Solution

Let’s get right into coding without further ado.

Step 1: Merge the two tables in an Inner Join on the track name and date columns.

Let us begin with selecting the track names from the US rankings table and seeing it in the console.

SELECT trackname
FROM spotify_daily_rankings_2017_us

Let us now use an inner join on the common columns - trackname and date - to merge the two tables.

SELECT us.trackname
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date

The merged table looks like this:

An error loading the results has occurred

Step 2: Filter out the US tracks that are in position #1.

We can add a WHERE clause at the end of the query to filter out the tracks that were in the first position alone.

SELECT us.trackname
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date
WHERE us.position = 1

And the output appears to be the same, meaning that all four tracks that were common in both tables are all number one tracks.

An error loading the results has occurred

Step 3: Define a new column using SUM() OVER (PARTITION BY ) clauses

First, let us add a new column to the table named ‘n_days_on_n1_position’.

SELECT us.trackname,
       n_days_on_n1_position
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date
WHERE us.position = 1

3.1 Check if the US #1 track is also #1 in the worldwide rankings.

Write out the skeleton of the OVER (PARTITION BY ) clause first, and then we will add the conditions and parameters slowly.

SELECT us.trackname,
     	(CASE 
	WHEN 		THEN 
END)		
 OVER(PARTITION BY )  AS n_days_on_n1_position
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date
WHERE us.position = 1

3.2 Get the sum of the number of times it has occurred.

Insert windows function SUM() around the CASE statement.

SELECT us.trackname,
       SUM(CASE 
		WHEN 	THEN 
END) OVER(PARTITION BY )  AS n_days_on_n1_position
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date
WHERE us.position = 1

We’ve got the skeleton ready to add our logic in. So, when the position is number 1 from the worldwide rankings table, then we will return 1, or else we will return 0. Let us now insert these parameters into our CASE statement.

SELECT us.trackname,
       SUM(CASE
       WHEN world.position = 1 THEN 1
       ELSE 0
       END) OVER(PARTITION BY )  AS n_days_on_n1_position
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date
WHERE us.position = 1

Also, since we are performing these functions on the US track names that we had selected in the first line of the query, we will insert the column name within the PARTITION BY clause.

SELECT us.trackname,
       SUM(CASE
       WHEN world.position = 1 THEN 1
       ELSE 0
       END) OVER(PARTITION BY us.trackname)  AS n_days_on_n1_position
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date
WHERE us.position = 1

Now, we can run this query, and the output is as below:

An error loading the results has occurred

You can find multiple entries for the track ‘HUMBLE.’ because we have used the OVER(PARTITION BY column_name ) clause, which returns all the records upon which the function was performed, including the duplicates.

Step 4: Select the track name and perform the MAX() function on the temp table.

We definitely need to use a MAX() function on the ‘n_days_on_n1_position’ column to pick only the largest value for each track name, denoting the number of days in the number one position. But first, we need to convert the query we’ve drafted so far into a temporary table named ‘tmp’.

In addition to that, since we have used an aggregate query, we will group the table by track name.

SELECT tmp.trackname,
        MAX(n_days_on_n1_position) AS n_days_on_n1_position
FROM 
(SELECT us.trackname,
       SUM(CASE
       WHEN world.position = 1 THEN 1
       ELSE 0
       END) OVER(PARTITION BY us.trackname)  AS n_days_on_n1_position
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date
WHERE us.position = 1) tmp
GROUP BY trackname

Let us run the query now and take a look at the output.

An error loading the results has occurred

Step 5: Order the track name column alphabetically.

Finally, we will order the table by the track name alphabetically, as the question suggests we do.

SELECT tmp.trackname,
        MAX(n_days_on_n1_position) AS n_days_on_n1_position
FROM 
(SELECT us.trackname,
       SUM(CASE
       WHEN world.position = 1 THEN 1
       ELSE 0
       END) OVER(PARTITION BY us.trackname)  AS n_days_on_n1_position
FROM spotify_daily_rankings_2017_us us
INNER JOIN spotify_worldwide_daily_song_ranking world
ON world.trackname=us.trackname
AND world.date = us.date
WHERE us.position = 1) tmp
GROUP BY trackname
ORDER BY trackname

The final result is as shown below:

From the final result, we can infer that there are two tracks that were in the number one position in both the US and worldwide rankings tables. The first track, ‘Bad and Boujee (feat. Lil Uzi Vert)’, was number one for a day in both lists, and the second track, ‘HUMBLE.’, had three days of glory as number one.

Conclusion

It was a very interesting problem where we used multiple advanced SQL clauses such as windows function OVER (PARTITION BY) and CASE statements, join, and aggregations using SUM() and MAX() functions. I hope you enjoyed working on the problem as well. Explore our platform for more Data Science related SQL interview questions and walkthroughs.

Also, check out our article "Spotify Data Science Interview Questions" if you want to practice more questions from Spotify.