YELP Data Science Interview Question
This SQL data science interview question was asked by Yelp.
The SQL solution for this Yelp data science interview question itself is easy to implement but there’s a tricky assumption most people get wrong. Can you spot it?
Usually, when trying to find the top result of something, most people will end their SQL solution with an ORDER BY and LIMIT 1 to get the top result. But there is an edge case where the top result is actually tied so there are many “top results”. A LIMIT 1 would not work in this case and an INNER JOIN is required to get to the right solution. This question covers concepts that are commonly found in data science interviews at Yelp, Facebook, and Google.
YELP Data Science Interview Question
Top Cool Votes
Find the business and the review_text that received the highest number of 'cool' votes. Output the business name along with the review text.
Below is the link to YELP Data Science Interview if you want to follow along with me
Exploring the Data
The first thing we should do is take a look at the underlying data.
And if we preview the data, we get something that looks like below:
If you look at the column names, you have the business name and the review_text which is what we want to output in our answer. And what we're trying to do is find the businesses that have the most 'cool' votes.
If we scroll down, we'll see that most of the businesses have zero, there's a few that have two, but there's probably businesses in the dataset that have many cool votes.
So there's an assumption to this Yelp data science interview question.
Assumptions and Tricky Edge Case
There could be multiple businesses with max 'cool' votes. There could be a tie across multiple businesses. So, what we want to do is ensure that we capture all of those businesses.
A lot of people will do an ORDER BY and then LIMIT 1, which gives you the top business with the most 'cool' votes but in this case, you can have multiple businesses with the same number of 'cool' votes. The control for that, we'll just use an INNER JOIN and I'll show you how to do that.
Developing the Solution Framework
So what we want to write first is the assumption that says that there could be multiple businesses with the same number of cool votes. Then we'll be using a subquery to identify the max number of cool votes first and then we're going to use an inner join to filter out businesses that don't have the max number of cool votes.
-- assumption: multiple businesses with the same number of cool votes. -- use subquery to identify the max number of cool votes first -- use an inner join to filter out businesses that don't have the max number of cool votes
SUBQUERY - SQL Query to Identify Max Cool Votes
To identify the maximum number of cool votes in our dataset, this is the SQL query that we're going to implement.
SELECT max(cool) as max_cool FROM yelp_reviews
It's as simple as this! If we run this code, we're going to see that the max cool boat is 10.
So, businesses with 10 cool votes have the highest number of cool votes compared to all of the other businesses in the dataset. This will be our subquery.
INNER JOIN to Filter Max Cool Votes
The next thing is to use an inner join to filter out businesses that don't have the max number of cool votes or conversely filter businesses that actually match 10 cool votes. So, now we'll implement the inner join:
FROM yelp_reviews yr INNER JOIN ( SELECT max(cool) as max_cool FROM yelp_reviews) mc
The INNER JOIN as I mentioned will be the subquery. We alias the subquery table with MC.
Now to complete the INNER JOIN, we'll specify the keys. What we need to do is join on basically the amount of cool votes. In this case, we have 10 as output, therefore we have to filter for businesses that have 10 cool votes.
FROM yelp_reviews yr INNER JOIN ( SELECT max(cool) as max_cool FROM yelp_reviews) mc ON yr.cool = mc.max_cool
In the above code, COOL is the column, and then in the MC (the subquery table), max_cool is the value 10. That's how we will be filtering out businesses.
Now as we have our JOIN clause done, all we have to do is display and output the business name along with the review text. So, we'll put the business_name along with the review_text.
SELECT business_name, review_text FROM yelp_reviews yr INNER JOIN ( SELECT max(cool) as max_cool FROM yelp_reviews) mc ON yr.cool = mc.max_cool
Let's run this line of code.
We got two businesses and their review texts. If we go to check the solution, it says our solution is correct.
If we want to double-check everything, what we could also do is output the number of cool votes. That could be:
SELECT business_name, review_text, yr.cool FROM yelp_reviews yr INNER JOIN ( SELECT max(cool) as max_cool FROM yelp_reviews) mc ON yr.cool = mc.max_cool
And then I see that the number of cool votes does in fact match at 10.
And the number of cool votes is in fact 10, which matches the max(cool) function where we showed that the output was 10.
It's a difficult question mainly because it's not obvious that you would want to use an INNER JOIN to filter out businesses, but because there are multiple businesses that could in fact have the same amount of max cool votes. You need to use an INNER JOIN so that you can filter out businesses that don't match that criteria and just keep the multiple businesses that do.