Spotify Data Science Interview Question Walkthrough

Spotify Data Science Interview Question

The Best Artist

Spotify

MediumID 9744

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the number of times an artist has been on the billboard top 100 in the past 20 years. Output the result alongside the artist's name and order records based on the founded count in descending order.

Table: billboard_top_100_year_end

Link to the Question: https://platform.stratascratch.com/coding/9744-artist-of-the-decade

Follow along with us at StrataScratch as we walk through how to first approach and then solve this Spotify data science interview question along with Frederik Müller. Please note that this interview question has been modified since the date of this recording. The approach is still the same, but the output accounts for the last 20 years of history instead of 10.

Skills Used in this Spotify Data Science Interview Question

The following skills will be used in the approach to solving this problem:

Arithmetic operations
COUNT() / GROUP BY
Built-in date functions

Approach

The first approach to any interview data science interview question involves a few initial steps:

Gathering and reviewing all data provided
Noting the expected output, and
Clarifying whether there are any uniquities or challenges that must be addressed in the solution

The data for this problem is all present in one table: billboard_top_100_year_end. This table provides a list of all songs that have hit the billboard top 100 along with the year, the artist name, the group name, and the song’s ranking.

Table: billboard_top_100_year_end

year	year_rank	group_name	artist	song_name	id
1956	1	Elvis Presley	Elvis Presley	Heartbreak Hotel	1
1956	2	Elvis Presley	Elvis Presley	Don't Be Cruel	2
1956	3	Nelson Riddle	Nelson Riddle	Lisbon Antigua	3
1956	4	Platters	Platters	My Prayer	4
1956	5	Gogi Grant	Gogi Grant	The Wayward Wind	5

Our expected output is a list of artist names along with the total number of times they showed up in the billboard top 100 chart within the last 20 years. The list should be ordered by the number of appearances, highest to lowest. In the data set provided, there is already a column of artist names present, which we will use as our first returned column. Because it’s possible that an artist may be part of multiple groups in addition to a solo career, this is the reason for choosing the artist column for our query instead of the group_name column. The second column returned by the query will be the number of times each artist has appeared in the list. This will require use of the COUNT() function. Here is the base knowledge we will use to build up our query.

Building the Query

Solving a Spotify Data Science Interview Question

As the query is built below, follow along with each step by running the code provided to see how the query evolves.

We first find all artists in the table and return the total number of times that each artist is present. The brief also tells us to order the results by frequency of appearance, so let’s order by column 2, which is the result of COUNT(*) :

SELECT artist,
       COUNT(*) as count_20yrs
FROM billboard_top_100_year_end
GROUP BY artist
ORDER BY 2 DESC

Since this Spotify data science interview question asks only for the results from the last 20 years*, a WHERE clause is added to filter the results. At this point the goal is to test for accuracy–the query will be refined later.

SELECT artist,
       COUNT(*) AS count_20yrs
FROM billboard_top_100_year_end
WHERE YEAR >= 2002
GROUP BY artist
ORDER BY 2 DESC

*Please note the year in the example is reliant upon the release date of this article.

Here are the results of the query as it currently exists:

Solution Output for Spotify Data Science Interview Question

Refining the Query

Now that the query has an accurate solution, it is time to refine the WHERE clause for better applicability. The goal of writing an effective query is to ensure that the code is “future-proof”, i.e. the query should work as expected as time moves forward. Our output should only include data from the last 20 years of billboard history, so we’ll isolate the current year in the query using DATE_PART() and CURRENT_DATE:

DATE_PART(‘YEAR’, CURRENT_DATE)

To find results from the last 20 years, we subtract the value in the year column from the current year using the - operator and only display the row if the result is <= 20. Finally, best practices include specifying the column name instead of the asterisk(*) [all columns] wherever possible. Though COUNT(*) works in this instance because all rows in the table are unique, specifying the artist column helps with readability and processing time:

SELECT artist,
       COUNT(artist) AS count_20yrs
FROM billboard_top_100_year_end
WHERE DATE_PART('year', CURRENT_DATE) - YEAR <= 20
GROUP BY artist
ORDER BY COUNT(artist) DESC

As we can see below, the output here is the same as the previous, unrefined code, though this modification ensures the results will remain accurate as time passes.

Solution for Spotify Data Science Interview Question

Conclusion

Through the course of this article and the affiliated video, we’ve walked through the steps taken to evaluate this Spotify data science interview question, the data provided, assumptions about those data, and worked through the resolution. Once the question was solved, we walked back through the solution and refined the query for future readability. We used COUNT() to find a count of all of the artists, DATE_PART() to isolate the year value from CURRENT_DATE, and the subtraction (-) and comparison (<=) operators to produce the final result. Check out our post "Data Science Interview Questions From Top Companies" to practice more such questions from other top companies.

Spotify Data Science Interview Question Walkthrough

Categories

Spotify Data Science Interview Question

The Best Artist

Skills Used in this Spotify Data Science Interview Question

Approach

Building the Query

Refining the Query

Conclusion

Latest Posts:

From Arrays to Lists: Converting Numpy Arrays with Python

Tips for Crafting an Impressive Data Analyst Portfolio

PySpark GroupBy Guide: Super Simple Way to Group Data

Share

Follow

Categories