Yelp Data Scientist Interview Question Walkthrough
Helping you through the Yelp SQL interview question.
If you’ve ever visited a new town and wanted to find a good restaurant, you probably know about Yelp. As a business, Yelp is focused on creating and maintaining products that summarize a wealth of information to help customers make informed decisions. It tracks the public profile of thousands of different businesses from all corners of the world.
Organizing all this data is not an easy task, so Yelp is always looking for promising data scientists to join its ranks.
Data scientists at Yelp have to write queries of varying difficulty. To succeed at their job, data scientist candidates must have a strong foundational knowledge of basic and intermediate SQL concepts. Let’s look at the specific concepts you should know to answer technical interview questions.
Basic to Intermediate Concepts Tested in Interviews
The best candidates write queries that answer the question with the least lines of code. This is usually impossible without a deep knowledge of SQL concepts, statements, and clauses.
Going into an interview with solid foundational knowledge can help you land a job. More importantly, mastery over practical and theoretical SQL can help you keep your job once you have it. Employers want someone who writes fool-proof code. Yelp is not an exception to that rule. To write such queries, you need a deep understanding of SQL.
Here are the concepts you should know to answer data scientist interview questions of this type:
The SELECT and FROM statements are usually the first few concepts you learn when you start learning SQL. We use the FROM statement to specify the table we’ll be working with. We use the SELECT statement to view specific columns from the table.
These two statements are easy to master. SELECT and FROM are essential for writing any SQL query, regardless of the difficulty. So, if you have no working knowledge of SQL, these two statements are a good place to start.
The users of Yelp often want to see the most popular businesses. Filtering this data is a big part of Data Scientists' job at Yelp.
The WHERE statement allows you to set a condition. When writing SQL queries, its primary purpose is to ensure that only the records that satisfy the specified criterion will be returned.
There are many operators you can use with the WHERE statement. Specifically, you can use: equals(=), not equals(!=), greater than(>), less than(<), and other variations of the comparison operators. It's easy to use comparison operators for number values.
You should also know how to compare the non-numerical values (using the = or != comparison operators). To write the condition, you must use proper syntax rules. For instance, to compare the column values with a text value, you need single quotes ‘ ‘. To write the condition, you must use proper syntax rules. For instance, to compare the column values with a text value, you need single quotes ‘ ‘. Also, you should know how to sort based on alphabetical order, using letters and different operators.
For instance, you might need to remove the values that start with the letters A, B, or C. The WHERE statement in SQL can be very versatile and applied to many different tasks.
These and similar questions often test your knowledge of intermediate SQL concepts, such as MIN() and MAX() aggregation functions in SQL. In our example, the candidate must find the business with the most ‘cool’ votes. This is the kind of task where the MAX() aggregate function could be useful. It returns the row with the highest value in the specified column. MIN() returns the lowest.
For numerical values, working with these aggregation functions is intuitive. Also, it’s useful to know how MIN() and MAX() work when applied to non-numerical values. For instance, the MIN() function can be used to find the earliest date in a column or the words that start with A or other early letters of the alphabet. The MAX() can do the opposite. Learn more about aggregate functions by reading this article "basic SQL interview questions".
It is common to get SQL interview questions that ask you to find rows with the highest value in a specific column. Understanding this concept is essential to answer the majority of SQL questions asked in data science interviews today. ORDER BY allows you to specify the criteria for the arrangement of rows. This clause is often used with two additional keywords: DESC and ASC. These keywords help you specify whether you want the results to be ordered in an ascending or descending order.
When dealing with numbers, the default behavior of the ORDER BY clause is easy to predict: it’ll sort the values from the smallest number to the highest. A good candidate for a data scientist position should also know how to use ORDER BY with letters and date values.
JOINs are an integral concept in SQL. Any aspiring data scientist should master SQL JOINs before moving on to more advanced concepts. Because of their utility, INNER JOINs can be used to answer a wide variety of SQL questions.
First and foremost, candidates should understand the syntax of writing SQL queries. For instance, the fact that you can select columns from the combined table, not just the one after the FROM statement. Also, they should be able to demonstrate how to SELECT columns from just one table, if necessary.
Data scientists with the actual knowledge of writing SQL queries should understand the importance of writing aliases as well. They should also be familiar with the syntax for writing aliases in SQL, and understand their role in making SQL queries more readable.
Another important aspect of writing INNER JOINs is in the ON statement. Proficient data scientists should understand what it's for and be able to correctly map the values from one table to another. Choosing the correct Join Keys to achieve the desired result is also important.
Candidates should be able to explain the differences between multiple types of JOINs. What happens when the records in two tables don’t have any common values? What does INNER JOIN do? Depending on the level of seniority, a candidate should be able to differentiate between INNER and OUTER JOINs and choose the right type to solve the question at hand.
In this article, we’ll review a question and we'll use the key features of INNER JOIN to arrive at the answer.
Yelp Data Scientist Interview Question Walkthrough
In this article, we will focus on the question asked of the candidates for the data scientist position at Yelp.
Top Cool Votes
This question is currently marked as ‘Medium’ difficulty on the StrataScratch platform. The conditions are fairly simple and clear. The candidate must write a query to find a record with the highest number of ‘cool’ votes, and return it in a specific format.
In this question, the candidates have to work with a single table with 9 columns. They must write the SQL query to find the record (or records) with the highest integer value in the cool column. Then the interviewee must output the respective values of two columns, business_name, and review_text.
The task is simple and there are many ways to get the desired result. So your focus should be to write an optimized SQL query. As long as you understand the principles and statements outlined in this article, you should be able to find a solution that works well and isn’t overly verbose.
The first, and probably the most important step of solving any SQL question is studying available data.
The first thing to do is to scan the column names and their corresponding data type. You can use this opportunity to form a first impression and get into the mind-space to solve the question. If some things are unclear, you can ask the questions to check your assumptions.
Whenever possible, also try to look at the actual records from the table. Seeing the actual data can help you better understand what kind of values you’ll be working with. In addition to the actual data, paying attention to column names and the data types will give you enough information to formulate your plan.
If you’ve analyzed the data but things are still unclear, don’t be afraid to ask to clarify some points. You can get more information by asking specific, indicative questions.
Depending on the formulation of the question, you should be ready to cast the values, format them or manipulate data in any other way. Once you’re done analyzing the table, you should also have a good idea of which columns are important and which can be safely ignored.
Let’s look at the columns of the yelp_reviews table:
- The final solution of this question should return the value in the business_name column. We’ll use the SELECT statement to view this column.
- Our question mentions nothing about identifying every review, so we can ignore the review_id column.
- The question doesn’t require us to identify the users, so we can ignore the user_id column
- We don’t have to use the number of ‘stars’ as a criterion, so the stars column can be ignored.
- Since chronological order is not important to the solution, we can ignore the review_date column
- The final output must include the values from the review_text column, so we are going to need it.
- The question asks us to identify the business with the most ‘cool’ votes, so we will need to use the cool column. The other two columns - funny and useful can be ignored.
This specific question can be solved in many different ways. If you are limited to one solution, but can’t decide between a few good options, you should ask the interviewer. Sometimes the interviewer might prefer standard SQL over many of its dialects.
Once you’ve digested the question and the table with all of its data, you can easily answer even the difficult SQL questions like this one. First, we must formulate the approach for answering the question:
- According to the question description, we are required to output two columns - business_name and review_text. First, we must use the SELECT statement to view them.
- Then we must write the SELECT statement, but this time to view the record with the highest number of votes. We can find this record using the MAX() aggregate function.
- Finally, we write the ON statement to filter out all the businesses that don’t have the highest number of ‘cool’ votes.
You must have a good working knowledge of basic and intermediate SQL concepts to answer this Yelp data scientist interview question. In the first step, we’ll need to use the combination of SELECT and FROM statements. This is the most basic building block of any SQL query, so you should go through this step with ease.
In the next step, we use the MAX() aggregate function to find what is the highest number value in the cool column. The name of the function is self-descriptive: it returns the highest value from the column.
Please note that when performing any type of JOIN, it’s always a good idea to give your tables aliases. This way, you don’t have to type out the table name every time. Aliases can be assigned after each table name. You leave out one space and write the alias name for that table.
Looking at this question, one might wonder: what if two or more businesses share the highest number of ‘cool’ votes? In our solution, we use INNER JOINs to keep the businesses with the highest votes, and filter out the rest. If there are three businesses that share the highest number of votes, our final result will include all three of them.
Common Wrong Solution
As we previously mentioned, knowledgeable data scientists can find multiple ways to solve a problem. One of the most common mistakes is to use ORDER BY and LIMIT statements. With this approach, we write a query to order the values in the ‘cool’ column in decreasing order and use the LIMIT statement to display the first one.
The problem with this approach is that most likely, there will be multiple businesses with the highest number of ‘cool’ votes. According to the question, we need to return all of these businesses, not just one. The LIMIT statement is practically useless if we don’t know how many businesses share the highest number of votes.
SELECT business_name, review_text FROM yelp_reviews ORDER BY cool DESC LIMIT 1
The Right Solution Approach
Write The Subquery to Find Highest Number of ‘cool’ Votes
To begin with, we have to write a subquery that will return the highest value in the ‘cool’ column. We use the AS statement, so we can refer to this value as ‘max_cool’.
SELECT max(cool) AS max_cool FROM yelp_reviews
Running this code will return the following output:
This step helped us solve one piece of the puzzle - we know the highest possible number of ‘cool’ votes for businesses in the yelp_reviews table.
Grabbing the columns of interest
First, we must select the two fields included in our final output. We should return the values from two columns in the yelp_reviews table - these are the business_name and review_text columns. Once finished, the SQL code will look like this:
SELECT business_name, review_text FROM yelp_reviews
This query is going to return all the businesses with the corresponding review_text values. Let’s take a look:
Use INNER JOIN to Filter Out The Businesses
In the final step, we use the INNER JOIN to filter out the businesses that don’t have the maximum number of ‘cool’ votes.
To achieve this, we will INNER JOIN the yelp_reviews table with the previously created subquery.
We will give each table alias to make the query more readable. ‘yr’ for the yelp_reviews table, and ‘mc’ for the subquery.
SELECT business_name, review_text FROM yelp_reviews yr INNER JOIN ( SELECT max(cool) as max_cool FROM yelp_reviews) mc ON yr.cool = mc.max_cool
Finally, we must choose Join Keys and write the condition for the ON statement. We must refer to the cool column from the yelp_reviews table and set it equal to the maximum value.
This is the final answer. If we run the code, the output will look like this:
These two businesses each have the highest number of ‘cool’ votes in the table.
Another Right Solution
Using INNER JOINs to filter out the businesses with less than maximum number of votes is straightforward, but there’s another solution that could be considered even more simple.
The logic is more or less the same: we select the columns we want to view, and conditionally return the record with the highest number of ‘cool’ votes. Just like the solution with INNER JOIN, we use the max() aggregate function to find the highest number of votes. The main difference is that with this approach, we use the WHERE statement to filter the businesses.
This solution answers this Yelp data scientist interview question and handles the edge cases as well. Running this query will return any number of businesses with the highest number of votes. One could argue that since this solution involves writing less lines of code, it’s a more optimized solution.
SELECT business_name, review_text FROM yelp_reviews WHERE cool = (SELECT max(cool) FROM yelp_reviews)
Companies like Yelp are looking for data scientists who write simple, yet effective SQL queries. Preparing well for the interview can significantly increase your chances of getting the data science job. We recommend our post "Best Practices to Write SQL Queries" if you want to know how to structure your code, in particular when it comes to performance and readability.
In this article, we tackled an interesting question asked to the candidates interviewing for data scientist positions at Yelp. It’s not the easiest question to solve, but interviewers use it to gauge the depth of a candidate's SQL knowledge.