How to Guarantee the Right Answers to Data Science Interview Questions
In this technical data science interview questions and answers blog, we'll go through some tips on how to approach interview questions for a data scientist job. We will use real interview questions and apply the tip and hopefully through this way you become better at interviews or you can apply these tips on the job to improve your skill set at becoming a better data scientist, engineer or an analyst.
So, today's tip is on how to guarantee the right answer and solution every single time by clarifying all of your assumptions before you write a single line of code.
1. You're narrowing down the scope of the solution space
What was once a question with multiple different use cases and edge cases through dialogue and through clearing and clarifying, your assumptions could be a solution with just a few use cases and edge cases that you would have to code up. This makes the question potentially easier for you and increases the likelihood of you getting it right.
2. You're putting accountability on the interviewer
Interviews could be passive at times. The other person might not be very much interested in going through the motions. But through dialogue and through asking and trying to clarify assumptions that person is then more present and more actively engaged in your conversation and is more accountable to essentially get you on track to get the solution to any data science interview questions.
3. You're showing off your communication skills
Especially as a data scientist, you have many different stakeholders across different departments, different teams and the better you are at communication the better your solutions and your recommendations could be.
Let's get started and apply this tip to the real data science interview questions
Today we're going to use StrataScratch which is an online platform that has over a thousand data science interview questions from real companies to help you prep for a data science interview or just to help you practice to get better.
So, here's what the UI looks like:
Here’s the link if you want to follow along with me: https://platform.stratascratch.com/coding/10071-hosts-abroad-apartments?python=
So, today's question comes from Airbnb.
And the data science interview question is:
Find the number of hosts that have apartments in countries of which they are not citizens.
Then you are also given two tables: Airbnb_hosts and Airbnb_apartments with all of these columns listed in the above screenshot.
Once I have all the information from the interviewer, I would apply today's tip before I answer this data science interview question.
Today's tip is to clarify all of your assumptions before writing a line of code. This will help you understand how to organize the solution as well as reduce the solution space so that it's concise and it answers the question.
What I would do next is go through all my assumptions that I have with the interviewer.
Assumptions that we have with the interviewer before we answer any data science interview questions
The first obvious assumption is that I have two tables and two data sets. I am most likely going to be merging them right otherwise why would they give me two tables. The assumption that I have is that host_id is the common column that I'm going to use to merge these two tables. So, what I like to do during an interview is essentially write out my approach before writing a line of code.
My first question or assumption that I want to clarify is host_id going to be the common column to use or to merge these two data sets together.
I'm just going to write as a comment - ‘merge/join two tables using host_id’.
Obviously, the second question or the follow-up question is what sort of join: inner join, left join, and right join would I need to merge these two tables together.
That is essentially another question that you can ask the interviewer. You ask the question not necessarily to get an answer but to understand the underlying data because my assumption is that for every host there's a piece of property and there's an apartment that host owns.
There could be, for whatever reason, hosts that actually don't have apartments. They're listed in the table but then they don't have any apartments.
Do you want to preserve those host_id or do you want to eliminate them from the solution set or from the output? Depending on what that answer is, it really depends on whether or not I'm going to use an inner join or whether or not I'm going to use a left join or a right join.
So, the answer to that question from the interviewer could be – Yes, every host does have an apartment or does have a property so we can use an inner join.
But that conversation is very important because that gives the interviewer an understanding that you actually understand the differences between a left join and an inner join not in just the implementation of using that join but in what will happen to the output once you implement an inner join versus a left join.
Let's just say for this case we're going to use an inner join.
For my next assumption, I'm going to ask the interviewer about the column ‘apartment_type’ because if we preview this table there are numerous apartment types.
Here, my question is – do we care about apartment type or are we really just talking about hosts that own any type of apartment or any apartment?
This is important because it narrows down the solution space even further. We know whether or not we need to parse by apartment type or just ignore that column altogether in which case what I do is I just write ‘use any value in apartment type’.
Essentially ignoring it, clarifying your assumptions are important not only to reduce the solution space to get to a solution but you're also bringing the interviewer into active dialogue.
This puts on accountability on their part because now they have to know what the solution is to be able to answer your questions.
They are actually just present during this data science interview to help you along and answer any questions that you have to get to that solution. You have less of a probability of being misguided by the interviewer if that interviewer is actually paying attention and actively participating.
Then a third thing or a third benefit, through this communication that you have back and forth with the interviewer, is it just shows off your communication skills because as a data scientist or as a business analyst you're just not coding all day but you're actually talking to different stakeholders to different people in different departments. That's important to be able to communicate effectively and efficiently.
Here, you're showing that skill off as you are talking to that interviewer.
Now let's cover the last assumption that I have about this question. This assumption doesn't have to do with the columns of the data set or of the table. It has to do with the data itself. Because the data itself will inform how I'm going to write the solution.
I would ask the interviewer if I'm trying to find the number of hosts that have apartments in different countries - Is it safe to assume that there are one host and one property?
Effectively one host listed in the host table and then one property of theirs in the apartment table. That doesn't really make sense realistically especially at Airbnb. You can have one host and multiple properties across multiple countries. You can have one host and one property, and you can have one host having multiple passports so multiple nationalities across multiple apartments in different countries. Just the use cases are really like four to five different use cases or edge cases that you need to control for.
This is the biggest reason why you want to go down your list of assumptions and get it clarified for you so that you're not going off and writing a solution for every single use case or every single edge case. This assumption here is the trick of the question and typically, in every interview, you have somewhat of a trick or an “Aha” moment where there is something that is somewhat of a curveball that gets thrown at you in terms of how to write up the solution for the question or in terms of how to answer this question.
What I'll do next is write down all the use cases in the editor here as comments.
This helps me just organize my thoughts and it helps me talk to the interviewer in an organized way.
Then there's probably two or three more different use cases that you would have to plan for the purpose is really just to have a conversation with the interviewer about all the different scenarios that you're seeing as you are developing the solution. And the most important part is to have that dialogue with the interviewer so that they understand that you understand data and you understand the different scenarios that can actually happen.
One additional thing that these use cases are exposing is that you have the potential of having some duplicate rows if one host has multiple apartments. You're trying to count the number of hosts. So, in the select clause, if you're using SQL, you may not want to use star or you may not want to just count host id. You may want to count using a distinct host id so that you have the number of unique hosts.
Just talking about these use cases also help you to clarify what sort of counting is going to be needed to get an accurate answer.
With that being said, we can just start coding up the solutions.
The reason why I like to just organize all my thoughts as comments is that from top to bottom, I know exactly what to do and how to write the code.
If I start with the ‘SELECT’, I know that I am going to count a distinct host_id here. I have two tables that I want to merge or join using an inner join. The ‘WHERE’ clause here I know that I can just ignore any value in the apartment types but maybe just for completeness. What I want to do is just remove all the nulls so I can just say ‘a.apartment_type is not null’. And of course, the nationality is not equal to the apartment country. Now that was the goal of this question in general.
Now just actually alias the host_id so we run this line of code.
Why discuss all these assumptions before you answer any data science interview questions
The fact is that it's correct, the syntax is good. It is not the point of the interview. The point of the interview is really to discuss with the interviewer all the assumptions that you would need to have and you would need to clarify to actually come up with a solution like this. They want to understand not only can you write code but can you actually break down a problem in an organized way and come up with the solution.
So, that was today's tip! I hope that it helps you prepare for interviews or helps you understand how to approach problems as you prepare for your interviews.