Data Science Interview Guide - Questions from 80 Different Companies
Data science interview guide that includes 900+ real interview questions from 80 different companies in 2020 and 2021
To be called a data scientist is slowly becoming a prestigious trait; every year the pool of data scientist roles in the world expands exponentially. Back in 2012, Harvard Business Review called data scientist the sexiest job of the 21st century and the growing trend of roles in the industry seems to be confirming that statement. However, how does one pass the rigorous interview process to get a job as a data scientist? We have done some research in this data science interview guide to find out.
The data scientist interview process can be very broad and complex. Since your role can incorporate so many areas (depending on the company you work for), the questions getting asked on interviews are quite diverse. For example, you can go to an interview and get asked question on statistics, modeling and algorithms, or questions on coding, system design and product. Due to the diverse nature of questions, we have decided to analyze them in order to help you better prepare for your future interview.
The goal of this data science interview guide is to look at a repository of real interview questions from real companies that we have collected over the years. These questions have been used to conduct an analysis of what an interview consists of at a company. We have reviewed all the applicable questions and present our findings in this article.
Description and Methodology of the Analysis
The research in this data science interview guide will identify to what degree several types of questions are being asked in data science interviews as well as the relation between companies and the types of questions being asked. Furthermore, the research in this data science interview guide will examine significant trends among companies, questions types and questions themselves, through descriptive statistics.
The data we have gathered comes from various job search boards and websites as well as company review platforms such as Glassdoor, Indeed, Reddit and Blind App. For the purpose of this research, we have collected 903 different questions over the past 4 years. The 3 most important data points we have gathered from our sources that we will use for this analysis are company name, question type and description of the question(s) asked.
The question type data in our research has been produced by sectioning questions into pre-determined categories. These categories have been produced by an expert analysis of the interview experience description taken from our sources. The categories produced are: algorithms, business case, coding, modeling, probability, product, statistics, system design and technical. We will go into more detail on each category in the section on most tested technical concepts in order to get an understanding of the categorization method.
What Kind of Questions are Being Asked on Data Science Interviews?
Our analysis of 903 different data science interview questions has shown some meaningful insights.
When we look at all the questions broken down by category, we can see some meaningful insights. Coding and modeling questions are the most dominant types of questions being asked on data science interviews, with more than half of all the questions we analyzed coming from that area; therefore, we can conclude that demonstrating practical skills is more dominant in data science interviews. Coding type questions are especially prominent, consisting of more than one third of all questions. This finding is no surprise considering that these are probably the two most important skills a data scientist should master before interviewing. Furthermore, we can see that theoretical question types such as algorithms and statistics are being asked to a certain extent; 24% of all questions comes from these two categories. Other categories are not as represented which is reasonable, considering the nature of such question types as well as the nature of a data scientist role.
Breaking down the questions by the company which asked them on the interview gives us more great insights for this data science interview guide. We can see that Facebook is clearly dominating the scene with over 20% of all questions coming from this company; no other company is even close to getting to 100 questions whereas Facebook is only 7 questions away from getting to a figure of 200. Furthermore, Facebook has more questions (193) than the next 4 top companies combined (190). Amazon is the second company on this list with 71 questions, and it is the only company other than Facebook with more than 50 questions. Following Amazon are companies such as Goldman Sachs, Google, IBM and Microsoft. The conclusion from this analysis is that big tech companies are generally leading the growth in data science, with Facebook being the catalyst in terms of the number of roles they are hiring. It is important to note that not all companies from our data set have been included in this graph for ease of readership; however, all the companies excluded from the graph had values significantly lower than our outliers.
Analysis of FAANG Companies
Due to their size, innovation capabilities and industry leadership in data science as well as tech overall, we will cover Facebook, Amazon, Apple, Netflix and Google in more depth; after all, they would not get their own acronym if they have not been the drivers of change in technology.
When we break down the question categories and percentage of questions appearing from each category, and we separate the results between FAANG and non-FAANG companies, we can see one very clear difference: the tech giants put a lot more emphasis on coding. Eighteen percent more, to be exact. However, non-FAANG companies ask a lot more modeling questions; seventeen percent more. There are no significant variations in any of the other categories.
If we analyze Facebook separately, we can see that it follows a similar trend as when we compared FAANG vs non-FAANG companies: more coding and less modeling than average. However, Facebook also asks double the amount of product questions than the average, which makes knowledge about how their social media platforms work that much more valuable.
When we break down Amazon in the similar fashion, we can see a slightly different picture. On top of putting a high emphasis on coding as other FAANG companies, Amazon also puts a lot of emphasis on modeling (24%). Where they lack behind other FAANG members is product questions: while rest of the companies have an average of 10% of questions from this category, Amazon has none.
Due to the low number of questions, we have gathered from Apple (11), there are only 4 categories this company has questions in. It is interesting to note that even with a smaller sample, the trend towards coding emphasizing around 50% of all questions is as true for Apple as is for FAANG overall.
Google’s breakdown seems to resemble the graph on categorization of all questions, more than it resembles the questions’ breakdown of FAANG companies. We can see that they have lower number of coding, but higher number of modeling questions as compared to their FAANG peers. Furthermore, they have half the product questions and more than double business case questions. This could potentially be explained by Google’s diversity in business operations, where certain roles and organizational structures would require data scientists with a different set of skills.
Due to the low number of questions gathered from Netflix, this company is not further analyzed in this section.
Most Tested Technical Concepts on Data Science Interviews
Here, in this data science interview guide, we will cover the categorization method we used to structure the questions for analysis. Furthermore, we will analyze each category in depth and offer a real image of industry requirements for data science interviews. Finally, we will go through the most tested technical concepts for each of the question type categories we used to structure our research and offer some real-world examples of those concepts.
Coding questions have been identified as all questions that require some sort of data manipulation (through code) to identify insights. For example, question asking a candidate to do SQL joins would be considered a coding question. Coding questions are designed to test the interviewee’s coding ability, problem solving skills and creativity, usually demonstrated on a computer or a whiteboard. The importance of coding questions in data science interviews cannot be overstated as vast majority of data science roles involves coding on a regular basis.
If we look at the graph above, we can see that there is a wide industry picture when it comes to putting emphasis on coding questions. Airbnb is the absolute champion, with 94% of all questions in our analysis from this company being related to coding. Large tech giants such as Amazon, Apple and Facebook follow suit, although much below Airbnb. Companies such as Walmart (11%) and Goldman Sachs (15%) seem to put less emphasis on coding compared to our average of 34%.
When it comes to questions categorized under coding, the most prominent concept tested was writing SQL queries with emphasis on writing join statements. With SQL being the most utilized tool in data science, it makes perfect sense why these types of questions are asked most often. For example, a question about joins asked on a Facebook interview was: “What is the difference between left join and right join?”
Answer to this question could be something like: “Main difference between left join and right join is in the inclusion of non-matched rows. The LEFT join includes records from the left side and matched rows from the right table while RIGHT JOIN returns all rows from the right side and unmatched rows from the left table.”
Technical questions have been categorized as all questions which are asking about the explanation on various data science technical concepts. Although some of the principals tested are similar to coding questions, technical questions are theoretical and require knowledge on the technology you will be using at the company. For example, technical question would be to explain the process of creating a table in R without using external files. Knowing the theory behind what you are doing is quite important which is why technical questions can be asked on interviews often.
Due to the lower number of technical questions in our research data (48 questions, or around 5%), not all companies from our analysis had questions categorized under technical. We can see that LinkedIn is putting above average emphasis on technical questions with 14% of their questions comprised from this category, compared to the total average of 6%.
In terms of questions categorized as technical, the most tested area is theoretical knowledge on Python and SQL. With these two languages being dominant in the field of data science (along with R to complement Python), it is no surprise that most interviewers want to test theoretical knowledge in these areas. Example of a real-world technical question from Amazon would be: “What is the difference between a list and an array?”
You could answer this question with the following statement: “The main difference between a list and an array is the operation you can perform on them. Lists serve as containers for different data types while arrays store only one data type.”
System design questions are all questions related to designing technology systems. These questions are asked in order to analyze the candidate’s process in solving problems and creating (and designing) systems to help customers/clients. For example, you could be asked to show how you would design a data warehouse for one of the other departments. Knowing system design can be quite important for a data scientist; even if your role is not to design a system, you will most likely play a role in an established system and need to know how it works in order to do your work.
For the same reason as questions categorized as technical (system design comprises 3% of all questions), only a few companies had questions from this area. Walmart is the only organization putting above-average on system design, with 6% of all the interview questions being asked from this category.
Questions categorized under system design have numerous completely different topics and tasks, but when it comes to technical concepts teste, the one that stands out is building a database. Since data scientists deal heavily with databases on an everyday basis, it makes sense to ask this question and verify whether your candidate can build a database from scratch. Here is one question example from Facebook uncovered in our research: “Explain the process of designing a relational database for a ride-sharing app.” Since there is such a variety of approaches to answer this question, we will leave you to come up with your own way of designing one.
Statistics interview questions have been categorized as all questions which would require knowledge of statistical theory and associated principles. The questions are asked in order to test the interviewee’s knowledge on founding theoretical principles which are used in data science processes. Examples of questions categorized as statistics would be to calculate a sample size or an explanation of the Bayes theorem. These questions are especially significant since being able to understand the theoretical and mathematical background of analyses being done is what every interviewer will appreciate.
Although questions from this category make up about 10% of interview questions on average, we can see there are significant data variations among companies when it comes to this topic. Companies such as Netflix and Lyft are leading the pack here with 33% and 31% of questions being asked from this area respectively. Microsoft (24%) and Twitter (22%) are other companies that have more than double the average of questions from this category on their interviews. It is interesting to note that tech giants and two of the FAANG companies, Amazon (7%) and Facebook (6%) are below average in this category.
When it comes to questions that are under statistics, the most mentioned technical concept is sampling and distribution. This is one of the most basic and most commonly used statistics principles that data scientist can implement on a daily basis. For example, an interview question from IBM asks: “What is an example of a data type with a non-Gaussian distribution?
To answer this question, first we need to know what a Gaussian distribution is. This is a distribution where a certain known percentage of the data can be found when examining standard deviations from the mean, otherwise known as normal distribution. So, to answer this question, you can mention any data type that does not have a normal distribution. Some of the examples can be exponential distribution or binomial distribution.
Probability interview questions are all questions which require theoretical knowledge only on probability concepts. Interviewers ask these questions in order to get a deep understanding of your knowledge on the methods and uses of probability to complete the complex data studies usually performed in the workplace. For example, you could be asked to determine the probability of drawing two cards from the same deck of cards that have the same suite.
Along with system design, probability was the category with the lowest number of questions in our research data, comprising only 3% of all questions. It is therefore no surprise that only 3 companies from our analysis have questions from this area. Goldman Sachs is the only notable outlier here, with 8% of all of their interview questions coming from this category.
Questions related to probability clearly have one technical concept tested the most: probability of getting a certain card/number from a set of dice/cards. This seems to be the most common element of questioning for majority of companies in our research as many of them have asked these types of questions. An example of such probability question, from Facebook: “What is the probability of getting a pair by drawing 2 cards separately in a 52-card deck?
Here is how you can answer this: “This first card you draw can be whatever, so it does not impact the result other than that there is one card less left in the deck. Once the first card is drawn, there are 3 remaining cards in the deck that can be drawn to get a pair. So, the chance of matching your first card with a pair is 3 out of 51 (remaining cards). This means that the probability of this event occurring is 3/51 or 5.89%.”
Product interview questions have been categorized as all questions related to evaluating the performance of a product/service through data. An example of a product question would be to explain the design of an A/B test on the new metric in order to see if it captures meaningful social interactions better. Being able to answer questions about a product is significant as it tests your knowledge on being able to adapt and use data science principles in any environment, as is the case with daily work.
Not all companies from our analysis had product questions as we can see from the diminished graph; however, most of them have even though product questions comprise only 7% of all interview questions on average. Lyft (25%) and Twitter (22%) are the leaders here, with Facebook and Uber following suit (15% each). It is interesting to note that two of these are ride sharing service companies with the remaining two being social media companies. Goldman Sachs is the only notable underperformer in this category, with only 2% of their questions being related to product.
In terms of questions categorized under product, the most prominent technical concept that repeated in questions with multiple companies is to identify a company’s product and propose improvements from a data scientist’s perspective. The high variance in technical concepts tested on the product side can be explained by the nature of product questions and the higher level of creativity that is usually required to answer these. An example of a product improvement question would be: “What is your favourite Facebook product and how would you improve it?” Due to the nature of the question, we will let you answer this one on your own as well.
Business case questions have been identified as questions involving case studies as well as generic questions related to the business that would test a data science skill. An example of a business case question would be to determine how many windows there are in New York City, or to use the GPS data from a car to determine the quality of the driver. The significance of knowing how to answer these questions can be enormous as some interviewers would like the candidates to know how to apply data science principles to solve company’s specific problems before hiring them.
Since business case category only number about 4% of all questions, it is no surprise that plenty of companies do not have questions from this area. However, Uber is putting an astronomically high value on questions from this category; 25% of all questions on Uber interviews come from this category, more than six times the total average! Twitter is the only other company that has double-digit percentages in this area.
Due to the nature of the question type, we could not really identify a single technical concept which stands out. Since most of the questions categorized here are case studies, each of them is unique in a certain way. However, here is an example of a business case question from Google which is not related to the company, but would test your data science skills: “How many cans of blue paint were sold in the United States last year?”
Answer: “There are 300 Million people in the US. Say there are 100 Million households, in which 1% needs painting, that's 1 Million. Say there are only 1% wants to paint their houses blue, then there are 10,000 houses, which needs 6 cans, then there are 60,000 cans of blue paint for residential painting. Assume there are another 100,000 commercial buildings that paints blue, and each needs 1,000 cans. Thus, the total would be 100 Million + 60,000 cans = 100,060,000 cans.”
Modeling interview questions are categorized as all questions related to machine learning and statistical modelling (regressions). These questions require the knowledge on how to use mathematical models and statistical assumptions to generate sample data and make prediction about real-world events. An example of a modeling questions would be to explain the difference between L1 and L2 regularization for linear regression. For data scientists going into roles with modeling responsibilities, knowing how to answer these questions is crucial as it will most likely be heavily related to their performance.
Modeling was the second largest category in our research data, with 20% of all questions coming from here. There is a lot of variation among companies when it comes to modeling questions. Walmart is the absolute leader in this area, with a staggering 56% of all their questions being categorized under modeling. Other companies above average are machine learning tech giants such as IBM, Microsoft and Netflix. It is interesting to note that Facebook does not put a high emphasis on modeling, with only 3% of their questions from this category.
When it comes to questions categorized under modeling, the most common technical concept asked on interviews is regression. Due to the nature of machine learning and how statistical modeling works, it is no surprise that there are lots of questions on regression. One example from Walmart would be the following: “What is the difference between L1 and L2 regularization for Linear regression?”
Here is how you could answer this question: “A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these two is the penalty term. Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function whereas Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function. The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether.”
Questions on algorithms are categorized as all questions which require solving a mathematical problem, mostly through code by using one of the programming languages. These questions involve a step-by-step process usually requiring adjustment or computation to produce an answer. An example of an algorithmic question would be to find a square root of a number using Python. These questions are important to test the basic knowledge of problem-solving and data manipulation which can be implemented for complex problems at work.
Questions on algorithms on average comprised 14% of all questions we have collected. When we look at companies which have question from this area, we can see that Goldman Sachs is an absolute leader with 63% of their questions being under algorithms. Other notable companies are LinkedIn and Spotify, and those are the only three companies that are above 20%. All other organizations had scores around the mean, with ride sharing services Lyft (6%) and Uber (5%) being the poorest performers.
The technical concept tested most on questions categorized under algorithms is solving a mathematical or syntax problem with a programming language. Since the concepts tested under algorithms are intended to demonstrate problem solving of such nature, it makes sense as to why this is the most common topic. Here is an example: “How would you count the number of occurrences of a letter in a word using Python? “
Here’s the approach you should have to be able to answer them. First of all, to be able to search a word in a statement or a letter in a word, you need to look over the string. Let’s say you have the following example:
statement= “ I love StrataScratch, it helped me get much better at SQL”
to find the number of occurrences of the letter ‘t’, you need first to loop over the statement. Then compare each element of the statement to the letter ‘t’, if the element is truly a ‘t’ then you count one occurrence!
Here’s the pseudo-code:
statement= “ I love StrataScratch, it helped me get much better at SQL” occurrence=0 for i in length(statement): if ( i == ‘t’ ) : occurrence ++
This data science interview guide has been written in order to support the research undertaken to understand the types of questions being asked at a data science interview. We have taken the interview questions’ data from dozens of companies over a four-year period and compiled it for analysis. As part of the research process, the questions have been categorized under nine different question types (algorithms, business case, coding, modeling, probability, product, statistics, system design and technical questions).
Our analysis of data has resulted in some interesting findings. We saw that Facebook is a dominant company when it comes to data science interview questions, followed by Amazon. Furthermore, we found out which companies give the most emphasis on coding and algorithm question types, as well as which companies ask the most questions in other analyzed categories. Finally, we looked at the breakdown of FAANG companies and got some interesting insights there.
As part of our analysis, we talked about some of the most common technical concepts from each of the question type categories. For example, we discovered that the most asked statistics’ questions have to do with sampling and distribution.
The article is intended to serve you as an important guide; whether you just want to learn more about data science, want to brush up on your skills or in the interview preparation process. We hope you have gained plenty of valuable insights from our research and now feel more comfortable about the data science interview process.