Data Science SQL Interview Questions from FAANG Companies
Categories
Key concepts and ideas you need to know to prepare for data science SQL interview questions with a focus on the companies of FAANG
Structured Querying Language, more commonly referred to as SQL, is one of the most powerful data tools out there, giving us the ability to work with and analyze large amounts of data at scale. As such, any company that uses large amounts of data will benefit from having employees with strong SQL fundamentals. SQL can have a widespread use at every level in a company, and being able to use SQL establishes someone's basic competence when working in data. In any role that is even tangentially related to data, even basic SQL capability creates additional value in a candidate: being able to query data for your individual needs is a valuable skill, even for less data-specific focused roles such as product manager or business analyst. Meanwhile, in any Data Science role, strong SQL skills is almost a hard requirement.
Of course, there are many other methods to store and retrieve data enmasse, including Spark, Hadoop/HIVE, Snowflake, the Python library Pandas, and even Excel and Google Sheets. However, each one comes with their own set of limitations, and none of them have been able to surpass SQL in its flexibility, scalability, and ease of use as the Data Science industry standard.
There is a large range of types of questions that will show in an interview setting, but SQL is a staple you can almost always expect to be tested on, particularly for data-focused roles. We're going to go over some key concepts and ideas you need to know to prepare for those data science interview questions, with a focus on the companies of FAANG, before applying them as we walk through a few SQL questions asked in actual interviews.
Let's get into it.
Data Science SQL Interview Concepts and Questions
Companies of Big Tech, otherwise referred to as FAANG (Facebook, Apple, Amazon, Netflix, Google), are some of the biggest employers of Data Science roles, and anyone interested in working in the data industry will naturally look at roles at these companies. But how are the SQL questions asked in interviews by FAANG different from those asked at other companies? Generally, FAANG companies organize their interview questions in case studies that test your analytical skills in the context of their product. For example, Google might have SQL questions analyzing usage on their search engines or email accounts, while Facebook might cover platform activity such as user comments and friend relationships.
At any company you interview in, it's important to make sure that coming in, you are at least moderately comfortable in using any of the platforms and products of the company you are interviewing for. For FAANG companies with this specific emphasis, that is doubly true. While you may be able to reasonably answer a standard SQL interview question with your sheer knowledge of the code, being able to understand the content you are working in, and thus discuss the scenario, trade-offs, and even potential edge cases in your solution, which can create a stronger impression in your interview.
Types of Data Science SQL Interview Questions
Let's drill down into SQL specific questions asked in interviews. The SQL interview questions can generally be broken down into three different categories: basic definitions, reporting, and abstract problem solving.
Basic SQL Definitions
The most basic SQL questions asked in interviews are definitions based. These are straightforward questions that test your knowledge of SQL basics, such as using common functions like SUM, MIN/MAX, and COUNT. Commonly, these questions ask for basic definitions about SQL functionality, such as differentiating between the different types of joins, or how joins are different from unions. These questions can also involve asking you to write certain to short queries, where you may be pulling and compiling basic information from a small set of tables to do simple functions.
Oftentimes, these earliest questions walk you through creating the basic queries that become the framework that can build into more complicated questions further down the line.
Reporting and Insights
Next up are reporting questions with SQL, which often test your ability to write common queries that you would need in day-to-day work in data analysis, with queries that identify and pull key metrics and data, such as common reporting Key Performance Indicators (KPI's) like daily active users (DAU), monthly active users (MAU), growth rates, percentage changes over time, and retention.
The earlier basic SQL definitions, including aggregations such as MIN and MAX, will be required, as well as the ability to join datasets and create your own views with Common Table Expressions (CTEs) and subqueries. At a more advanced level in reporting, percentiles and rankings are often tested, with functions such as NTILE and RANK. Usually, these are coupled with window partitions, which are advanced technical reporting concepts.
Depending on the structure of the interview, this can build upon earlier questions on basic SQL definitions, building upon the complexity of those previously covered concepts. In these questions, it is important to understand what data you are given, and output you are looking for. If given multiple tables, they will require some joins, while limiting factors in the question means you need to understand what filters should be applied.
In a previous blog post, we went more in-depth into the types of SQL Interview Questions to prepare for. We also previously wrote about Types of Window Functions in SQL asked by some popular companies.
Problem Solving with SQL
At the highest, most advanced level are the problem solving SQL questions. These are open ended, often ambiguous questions that test how you would approach solving a problem with SQL. Somewhat similar to the previous reporting SQL interview questions, these come with the additional added complexity of ambiguity, where you need to consider not only the SQL output you want to report, but also how to pull certain data and exactly what the data is that you want to pull.
These types of interview questions can often be further subcategorized as ETL and database design questions. ETL stands for "extract, transform, load", which is a general data processing procedure of copying data from one source into a different source or context, with the requisite data extraction, transformation, and upload. Database design is the organization of your data according to a database model, which is created by the classification of data and the identification of the relationships between those data points.
ETL can be the start of building a data science infrastructure, with the first step of collecting data and transforming it into a usable format. While these concepts are most often tested in data engineering specific interviews, most data scientists are still expected to have solid fundamentals on how to, at minimum, create, insert, and update tables. Working with APIs would be an additional step in skill development, and lays the groundwork for the next step in the data science workflow of creating models and dashboards.
Problem solving SQL questions are an order of magnitude more complicated than the previous reporting questions, where beyond just querying and returning data with established parameters, you must also take additional steps in transforming the data between different sources and contexts, sometimes even designing an entire data pipeline. Of course, given the limited timeframe of the interview setting, the complexity of any answer to these questions will be naturally limited, covering only general steps or broad theoretical strokes.
Check out our video on Solving Complex SQL Problems. We also talk about the Data Science infrastructure pipeline in our video on The One and Only Project You Need.
Finally, remember that there are also non-SQL questions asked throughout the interview process. These include personal or behavioral questions, product sense and business cases, data analysis and coding in other languages such as R and Python, and artificial intelligence and modeling questions. We cover the overarching interview process for companies at FAANG in our Ultimate Guide to the Top 5 FAANG Companies.
Data Science SQL Interview Questions Asked at Individual Companies of FAANG
Obviously each individual company has different focuses and emphases depending on the values and products of the company itself, not to mention the team and the role itself. We will be walking through a few of the questions asked at interviews with FAANG, some of the largest data science employers in the industry. While we will go over the solutions for some of the questions, be sure to try them out for yourself before looking at the answer, or at least take a minute to fully consider how you would approach the problem, before diving into the solutions.
The social media conglomerate asks a lot of product sense questions in addition to their SQL. The various social media platforms under the Facebook umbrella means their interview questions will focus on metrics vital to those platforms, particularly user activity. Let's take a look:
Number of Comments Per User in Past 30 days
Interview Question Date: January 2021
Return the total number of comments received for each user in the 30 or less days before 2020-02-10. Don't output users who haven't received any comment in the defined time period.
Link to the question: https://platform.stratascratch.com/coding/2004-number-of-comments-per-user-in-past-30-days
This data science SQL interview question from Facebook tests your ability to filter out data based on date conditionals. Check out the below video to find some tips on how to approach the solution.
Share of Active Users
Interview Question Date: February 2021
Output share of US users that are active. Active users are the ones with an "open" status in the table.
Link to the question: https://platform.stratascratch.com/coding/2005-share-of-active-users
Rank Variance Per Country
Interview Question Date: February 2021
Which countries have risen in the rankings based on the number of comments between Dec 2019 vs Jan 2020? Hint: Avoid gaps between ranks when ranking countries.
Link to the question: https://platform.stratascratch.com/coding/2007-rank-variance-per-country
First, let's take a look at the tables we are given:
user_id | created_at | number_of_comments |
---|---|---|
18 | 2019-12-29 | 1 |
25 | 2019-12-21 | 1 |
78 | 2020-01-04 | 1 |
37 | 2020-02-01 | 1 |
41 | 2019-12-23 | 1 |
user_id | name | status | country |
---|---|---|---|
33 | Amanda Leon | open | Australia |
27 | Jessica Farrell | open | Luxembourg |
18 | Wanda Ramirez | open | USA |
50 | Samuel Miller | closed | Brazil |
16 | Jacob York | open | Australia |
Is there anything we would need to clarify in order to answer this data science SQL interview question? The phrasing can be a little confusing, so in an interview setting, if you are even a little bit unsure, be sure to ask clarifying questions. "Risen in the rankings based on the number of comments" between two months would mean the country that had an increase in the number of comments made by users from one month to the other. For example, the number of comments made by Australian users would be 5 in December 2019, but a larger number of 6 in January 2020, and thus Australia would have an increase in rank. (Check the tables out for yourself on the StrataScratch platform!)
Aside from that, consider if there are any edge cases that we might need to deal with. One big issue might be if there isn't any location data for a user, resulting in null values for the country column. We can handle this by filtering out rows where that is the case. There may also be countries that had no comments at all in the former month (December 2019), but did have comments in the latter (January 2020). Depending on how we create the joins, this will create an additional null value that we will need to account for in the final steps.
Now, let's think about what the output should look like. We're looking for a simple list of countries, but that list will be based on hidden numbers that filter for the condition of having "risen in the rankings based on the number of comments" between two months. So what do we need to get there?
First, we would need to isolate the number of comments made in each individual month. Fortunately, the number of comments have already been aggregated for us in the table fb_comments_count, so this initial step will be a straightforward join between the two tables. Since the individual users don't matter, we will group by the countries and sum together the number of comments.
SELECT
country,
sum(number_of_comments) as number_of_comments_dec,
FROM fb_active_users as a
LEFT JOIN fb_comments_count as b
on a.user_id = b.user_id
WHERE country IS NOT NULL
GROUP BY country
Note that given the phrasing of the question, and for simplicity's sake, we will be assigning rankings to each country based on their summed number of comments. Note that from the hint, we are told to avoid gaps between the ranks, so instead of the RANK function, we will use the DENSE_RANK function. While both RANK and DENSE_RANK will usually assign the grades the same rank depending on how they fall compared to the other values, DENSE_RANK will use the next chronological ranking value as opposed to skipping to the following value like RANK does. For example, RANK will go [1, 2, 2, 4, 5], while DENSE_RANK will go [1, 2, 2, 3, 4].
Let's add that to the query now.
SELECT
country,
sum(number_of_comments) as number_of_comments_dec,
dense_rank() over(order by sum(number_of_comments) DESC) as country_rank
FROM fb_active_users as a
LEFT JOIN fb_comments_count as b
on a.user_id = b.user_id
WHERE country IS NOT NULL
GROUP BY country
However, this current query actually returns an aggregate of all user comment activity, while we actually want to divide and compare two separate specific timeframes. Next up, we will separate the comments of December 2019 and January 2020. We can use the WITH header to create two separate tables of the individual months' user comment activity. For example, the December 2019 activity would look like:
SELECT
country,
sum(number_of_comments) as number_of_comments_dec,
dense_rank() over(order by sum(number_of_comments) DESC) as country_rank
FROM fb_active_users as a
LEFT JOIN fb_comments_count as b
on a.user_id = b.user_id
WHERE created_at <= '2019-12-31' and created_at >= '2019-12-01'
AND country IS NOT NULL
GROUP BY country
The next obvious step is to carry the above code from December 2019 to January 2020 by adjusting the date values filtered by our WHERE statement, and combine the two segments together. Before we get there, we want to build out the full skeleton of our query. Thus, we will temporarily use placeholder pseudocode of the individual months' WITH statements for readability. Note that the December and January tables will have the same columns and can be joined on the unique country name. In addition, we are looking for the times when the January 2020 ranking is higher than the December 2019 rank, which in numerical terms means a lesser value.
WITH dec_summary as (CODE BLOCK),
jan_summary as (CODE BLOCK)
SELECT j.country
FROM jan_summary j
LEFT JOIN dec_summary d on d.country = j.country
WHERE (j.country_rank < d.country_rank)
Now one last step before we put all the fully written code back together: as we mentioned earlier, we also want to capture the edge case where there is potentially no user comment activity in the earlier month, December 2019. Based on how our tables are joined, there would be no column values from the December 2019 dec_summary table. Since we start FROM the jan.summary table, this means the December values that we LEFT JOIN onto the January summary will result in null December values. Therefore, we must add that additional null condition as an OR, which results in:
WITH dec_summary as (CODE BLOCK),
jan_summary as (CODE BLOCK)
SELECT j.country
FROM jan_summary j
LEFT JOIN dec_summary d on d.country = j.country
WHERE (j.country_rank < d.country_rank)
OR d.country is NULL
Finally, let's put that all together!
with dec_summary as (
SELECT
country,
sum(number_of_comments) as number_of_comments_dec,
dense_rank() over(order by sum(number_of_comments) DESC) as country_rank
FROM fb_active_users as a
LEFT JOIN fb_comments_count as b
on a.user_id = b.user_id
WHERE created_at <= '2019-12-31' and created_at >= '2019-12-01'
AND country IS NOT NULL
GROUP BY country
),
jan_summary as (
SELECT
country,
sum(number_of_comments) as number_of_comments_jan,
dense_rank() over(order by sum(number_of_comments) DESC) as country_rank
FROM fb_active_users as a
LEFT JOIN fb_comments_count as b
on a.user_id = b.user_id
WHERE created_at <= '2020-01-31' and created_at >= '2020-01-01'
AND country IS NOT NULL
GROUP BY country
)
SELECT j.country
FROM jan_summary j
LEFT JOIN dec_summary d on d.country = j.country
WHERE (j.country_rank < d.country_rank)
OR d.country is NULL
Amazon
The multinational technology company is said to put a strong emphasis on modeling in their interview questions, such as capturing customer behavior and retention. Amazon also has a wide range of products under its company umbrella, though of course the questions you would realistically be asked in an interview will vary depending on the exact team or position you are applying for. That said, let's take a look at a few data science SQL interview questions on the most well-known Amazon product of eCommerce:
Total Cost Of Orders
Interview Question Date: July 2020
Find the total cost of each customer's orders. Output customer's id, first name, and the total order cost. Order records by customer's first name alphabetically.
Link to the question: https://platform.stratascratch.com/coding/10183-total-cost-of-orders
First, let's take a look at the tables we are given:
id | first_name | last_name | city | address | phone_number |
---|---|---|---|---|---|
8 | John | Joseph | San Francisco | 928-386-8164 | |
7 | Jill | Michael | Austin | 813-297-0692 | |
4 | William | Daniel | Denver | 813-368-1200 | |
5 | Henry | Jackson | Miami | 808-601-7513 | |
13 | Emma | Isaac | Miami | 808-690-5201 |
id | cust_id | order_date | order_details | total_order_cost |
---|---|---|---|---|
1 | 3 | 2019-03-04 | Coat | 100 |
2 | 3 | 2019-03-01 | Shoes | 80 |
3 | 3 | 2019-03-07 | Skirt | 30 |
4 | 7 | 2019-02-01 | Coat | 25 |
5 | 7 | 2019-03-10 | Shoes | 80 |
As always, start off by asking yourself some key questions: What should the output to this question look like? Are there any clarifications we want to double check? The question is explicit about what it wants: we are getting the customer ID, customer name, and total cost of their orders, sorted alphabetically. This should be a straightforward join of our given tables of customers and orders, where we can group by the customer ID and name, and then sum the total cost of the orders that they have.
Let's put together the join first:
SELECT customers.id,
customers.first_name,
FROM orders
JOIN customers ON customers.id = orders.cust_id
Next, we want to aggregate the total costs of each customer's orders, which here is the SUM of the total_order_cost. Thus, we would be grouping the other relevant columns, and then grabbing the SUM of our order cost field.
SELECT customers.id,
customers.first_name,
SUM(total_order_cost)
FROM orders
JOIN customers ON customers.id = orders.cust_id
GROUP BY customers.id,
customers.first_name
ORDER BY customers.first_name ASC;
And finally, the question asks us to order by the customer's first name alphabetically, which requires a simple ORDER BY at the end of the query:
SELECT customers.id,
customers.first_name,
SUM(total_order_cost)
FROM orders
JOIN customers ON customers.id = orders.cust_id
GROUP BY customers.id,
customers.first_name
ORDER BY customers.first_name ASC;
And there we have it!
Take a look at a few additional data science SQL interview questions from Amazon below:
Finding User Purchases
Interview Question Date: December 2020
Write a query that'll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.
Link to the question: https://platform.stratascratch.com/coding/10322-finding-user-purchases
Marketing Campaign Success [Advanced]
You have a table of in-app purchases by user. Users that make their first in-app purchase are placed in a marketing campaign where they see call-to-actions for more in-app purchases. Find the number of users that made additional in-app purchases due to the success of the marketing campaign.
The marketing campaign doesn't start until one day after the initial in-app purchase so users that only made one or multiple purchases on the first day do not count, nor do we count users that over time purchase only the products they purchased on the first day.
Link to the question: https://platform.stratascratch.com/coding/514-marketing-campaign-success-advanced
This data science SQL interview question involves handling complex logic and implementing multiple scenarios and edge cases. Check out the video below and find how to go over this advanced SQL interview question.
Check out our article on Amazon Data Scientist Interview Questions. We also talk about the broader interview process in our Amazon Interview Guide.
This multinational technology company is also said to ask many modeling questions, with additional product questions leaning more into the business side of things. Again, the exact type and focus of the questions will vary according to the team and position you are applying for.
Let's take a look at a few SQL data science interview questions from Google.
Total AdWords Earnings
Interview Question Date: July 2020
Find the total AdWords earnings for each business type. Output the business types along with the total earnings.
Link to the question: https://platform.stratascratch.com/coding/10164-total-adwords-earnings
User Email Labels
Interview Question Date: April 2020
Find the number of emails received by each user under each built-in email label. The email labels are: 'Promotion', 'Social', and 'Shopping'. Output the user along with the number of promotion, social, and shopping mails count,.
Link to the question: https://platform.stratascratch.com/coding/10068-user-email-labels
Again, let's start off by looking at the tables provided
id | from_user | to_user | day |
---|---|---|---|
0 | 6edf0be4b2267df1fa | 75d295377a46f83236 | 10 |
1 | 6edf0be4b2267df1fa | 32ded68d89443e808 | 6 |
2 | 6edf0be4b2267df1fa | 55e60cfcc9dc49c17e | 10 |
3 | 6edf0be4b2267df1fa | e0e0defbb9ec47f6f7 | 6 |
4 | 6edf0be4b2267df1fa | 47be2887786891367e | 1 |
email_id | label |
---|---|
0 | Shopping |
1 | Custom_3 |
2 | Social |
3 | Promotion |
4 | Social |
As usual, we start off by understanding what output we are looking for, and then getting any clarifications that we might need or considering any potential edge cases. Here, we want to return the emails with their labels, which are stored in a separate table, so we would start off with a simple join. As we are focusing on each user and counting the number of emails they got from each type of label, we can start off by grouping our given columns of user and label before counting them individually:
SELECT mails.to_user,
labels.label,
COUNT(*) AS cnt
FROM google_gmail_emails mails
INNER JOIN google_gmail_labels labels ON mails.id = labels.email_id
GROUP BY mails.to_user,
labels.label
Notice that we use the COUNT instead of the SUM function, as COUNT ignores null values. If we were to use SUM, we would get a blank output for any grouped values that include a null. This required join can then be used as the table we SELECT from, and we can separately make the divisions according to the individual labels. We want the numbers from each individual label, which from some basic exploring or general familiarity with Gmail we know are "Promotion", "Social", and "Shopping". From here, we can put together individual COUNTS for each of the different labels, which would require an additional outer GROUP similar to the above. We will use some pseudocode for readability before we fully incorporate the above code:
SELECT
to_user,
SUM(CASE
WHEN label = 'Promotion' THEN cnt
ELSE 0
END) AS promotion_count,
SUM(CASE
WHEN label = 'Social' THEN cnt
ELSE 0
END) AS social_count,
SUM(CASE
WHEN label = 'Shopping' THEN cnt
ELSE 0
END) AS shopping_count
FROM (CODE BLOCK)
GROUP BY to_user
ORDER BY to_user
And finally, we write in our initial joined table.
SELECT
to_user,
SUM(CASE
WHEN label = 'Promotion' THEN cnt
ELSE 0
END) AS promotion_count,
SUM(CASE
WHEN label = 'Social' THEN cnt
ELSE 0
END) AS social_count,
SUM(CASE
WHEN label = 'Shopping' THEN cnt
ELSE 0
END) AS shopping_count
FROM (SELECT mails.to_user,
labels.label,
COUNT(*) AS cnt
FROM google_gmail_emails mails
INNER JOIN google_gmail_labels labels ON mails.id = labels.email_id
GROUP BY mails.to_user,
labels.label) base
GROUP BY to_user
ORDER BY to_user
There we have it!
Activity Rank
Interview Question Date: July 2021
Find the email activity rank for each user. Email activity rank is defined by the total number of emails sent. The user with the highest number of emails sent will have a rank of 1, and so on. Output the user, total emails, and their activity rank. Order records by the total emails in descending order. Sort users with the same number of emails in alphabetical order. In your rankings, return a unique value (i.e., a unique rank) even if multiple users have the same number of emails. For tie breaker use alphabetical order of the user usernames.
Link to the question: https://platform.stratascratch.com/coding/10351-activity-rank
Check out our Ultimate Guide to Become a Data Scientist at Google. We also talk about the broader interview process in our Google Interview Guide
We've also covered other advanced SQL questions asked at popular companies in another blog post Advanced SQL Questions.
Final reminders
Now that we've gone over a few data science SQL interview questions by individual FAANG companies, let's briefly go over best practices for working on your SQL capabilities in preparation for your big interviews. It's vital to build up your SQL skills, as they can be important for any role that even tangentially handles data. The ability to read and understand other people's code, or even pull your own data, is immeasurably powerful.
A good first step to building up those SQL abilities is being here! Getting practical experience with SQL can be immensely helpful, as you get practice working with SQL beyond the theoretical framework you built from whatever classroom or bootcamp equivalent you may have spawned from.
As you look at practicing SQL data science interview questions, start off by trying to build out solutions without running the code runner. Spend a good 15 minutes or so thinking through the raw code you've written, and do your best to assure that there aren't any syntax errors or bugs. You can even enforce the process by physically writing down your code on paper instead of working through the code editor. Take a quick look at the hints if you get stuck.
While in a practical work environment, you can generally rerun any code you want to test over and over to solve each error that may come up. Working in this way is good practice to truly solidify your SQL knowledge by training yourself to be able to get the most well-rounded first try answer. Furthermore, when working on these data science SQL interview questions in an interview setting, it's important to get a good, solid answer in your first try. While it doesn't need to be 100% perfect, having your code littered with small syntax errors doesn't make for a good impression to your interviewer.
It's also important to note that in extreme cases, rerunning the code over and over may not be practical either, such as in cases when the database that you are working with is so large that each query takes a significant amount of time. Having a large query take a few hours to finish running, only to realize you accidentally did an incorrect join, can result in a lot of wasted time.
Next up, after you are as sure as you can reasonably be in the short term-frame you are working with that there are no more errors in what you've written, do the actual run in the code editor. Look at any errors or issues that pop up that you may have missed in the previous step and make the necessary changes. It's also a good practice to take notes on what you miss, and as you work through multiple practice SQL interview questions, see if there are patterns that reveal any gaps in your knowledge, problematic habits, or common issues that you may have.
Finally, check your work against the other available solutions, which you can use to compare approaches and learn about potential optimizations. First off would be the official solution, where you should ask yourself a few questions.
- Are there any ways you could have optimized your code further?
- Are there any inefficiencies in your code?
- How are alternative approaches to the problem different from yours? Are they better methods that you can learn things from?
Furthermore, read through the discussion threads of each problem to identify any topics related to the question, such as edge cases or alternative approaches, or even similar questions to the given scenario.
From there on out, you're well set to ace any SQL question that comes up in a data science interview.