Facebook Data Science Interview Questions
Recent Facebook data science interview questions solved using Python
Facebook controls some of the top social networks across the world. Besides the eponymous app, it also offers - Messenger, Instagram, WhatsApp, Oculus, and Giphy among others. Along with Google, Apple, Microsoft, and Amazon, it is considered one of the Big Five data science companies in U.S. information technology and has a market cap of over US$ 1 trillion.
Data Science Roles at Facebook
Data Scientists work on various large scale quantitative research projects at Facebook. They conduct researches to achieve deep insights into how people are interacting with each other and with the world. People at data scientist positions at Facebook work with a variety of methods including machine learning, field experiments, surveys, and information visualization to accomplish their goals. The roles at Facebook will therefore vary on the basis of business unit and the function that you are interviewing for.
Concepts Tested in Facebook Data Science Interview Questions
The main areas and concepts tested in the Facebook Data Science Interview Questions include.
- Indexing and Slicing DataFrames
- Boolean Masking
- apply() method
You can practice these and more such Facebook data science interview questions on the StrataScratch platform and become interview ready.
Check out our previous article on Facebook Interview Process that can provide you an insight into the whole process.
Facebook Data Science Interview Questions
Facebook has developed a search algorithm that will parse through user comments and present the results of the search to a user. To evaluate the performance of the algorithm, we have a table that has information about query position and whether the user has clicked on the search result.
The higher the position, the better, because the user was immediately shown what he was searching for. Write a query that assigns ratings to queries in the following fashion:
- if no results were clicked on, assign the query with rating=1
- if the query has a click on a search result, but only for the one(s) below the top 3 positions, assign it with the rating=2
- if a query has a click in the top 3 positions give it a 3 rating
For multiple searches with the same id, aggregate using average
Output search_id along with the calculated rating
You can solve this question here: https://platform.stratascratch.com/coding/10350-algorithm-performance?python=1
Since you typically will not have access to the underlying data in the data science interview, you will have to use a mixture of business logic and your understanding of data storage to impute the variable assumptions. You must ensure that your solution boundaries are reasonably well defined.
So let us try to figure out what the data might look like. Please ensure that you confirm the validity of the assumptions that you make with the interviewer so that you do not veer away from the solution path. This will also give you a chance to showcase your ability to visualize the tables and data structures. Most interviewers will be more than happy to help you at this stage.
Assumptions on the Data and the Table:
search_id: This appears to be the identifying field for the search. However, this may not be a unique key – since the problem also mentions
For multiple searches with the same id, aggregate using average
query: This is the search text entered by the user. For this problem, we can safely ignore this field.
has_clicked: This field appears to be an indicator of whether the user has previously clicked on the search result. This field is required for the final analysis. Further, since the data type for this field is object and not Boolean, we need to check with the interviewer regarding the values it takes.
position: This too is required for the final analysis and appears to be a field that denotes the rank of the query in search results.
notes: By looking at the data type, we can infer that this field contains comments or narration text. For this analysis, we can ignore this field.
Before we proceed towards drafting a solution for this problem, it is highly recommended that you confirm your assumptions with the interviewer to ensure that we can refine our assumptions and ensure that any edge cases are handled in the solution.
This is the table that we will be working with –
The biggest challenge in this problem is to create the ratings column. Once that is done, the query is relatively straightforward since the query parameters are already provided. Let us visualize this.
1. We need to work with two columns: has_clicked and position.
- If has_clicked is not “yes” then set the rating as 1
- Else if position is 3 or lesser, then set the rating as 3
- Else set the rating as 2
2. Once we have the rating, we can aggregate it on search_id taking an average (mean).
Now that we have our logic, let us begin coding this in Python.
1. We start by creating the ratings column as described above by applying a conditional statement on two columns. There are many ways to accomplish this. We look at two of the most efficient methods
a) Boolean Mask: Boolean Masks can apply a conditional on the entire data frame and return the indexes with a Boolean output for the conditional. So we can create three Boolean masks one each for the three ratings.
# Import your libraries import pandas as pd # Start writing code # Mask 01 fb_search_events['rating1'] = fb_search_events['has_clicked'] != 'yes' # Mask 02 fb_search_events['rating2'] = (fb_search_events['has_clicked'] == 'yes') & (fb_search_events['position'] > 3) # Mask 03 fb_search_events['rating3'] = (fb_search_events['has_clicked'] == 'yes') & (fb_search_events['position'] <=3) fb_search_events[['search_id', 'has_clicked', 'position','rating1', 'rating2', 'rating3']]
Let us verify that our masks are working fine. We will be checking if there are any overlaps in the ratings (there should not be any).
fb_search_events[['rating1', 'rating2', 'rating3', 'search_id']].groupby(by = ['rating1', 'rating2', 'rating3'], as_index = False).count()
b) Now that the masks are working fine, we can create the rating field. For this, we use the loc method.
# Import your libraries import pandas as pd # Start writing code # Mask 01 rating1 = fb_search_events['has_clicked'] != 'yes' # Mask 02 rating2 = (fb_search_events['has_clicked'] == 'yes') & (fb_search_events['position'] > 3) # Mask 03 rating3 = (fb_search_events['has_clicked'] == 'yes') & (fb_search_events['position'] <=3) # Calculate Ratings fb_search_events.loc[rating1, 'rating'] = 1 fb_search_events.loc[rating2, 'rating'] = 2 fb_search_events.loc[rating3, 'rating'] = 3 # Verify fb_search_events[['search_id', 'has_clicked', 'position','rating']]
2. Alternatively, we can use the apply() method in Pandas with a lambda function to do all this in one step. The apply() method can be used to apply a function along an axis of a DataFrame. The lambda function is used to create a user defined function on the fly.
# Import your libraries import pandas as pd # Start writing code fb_search_events['rating'] = fb_search_events[['has_clicked', 'position']].apply(lambda x : 1 if x != 'yes' else 3 if x <=3 else 2 , axis = 1) # Verify fb_search_events[['search_id', 'has_clicked', 'position','rating']]
3. Once we have the ratings field, we can easily summarize the data frame using the groupby() method
fb_search_events[['search_id', 'rating']].groupby(by = ['search_id'], as_index = False).mean()
NumPy forms the basis of the Python Pandas library. These libraries are specifically designed to perform vectorized operations in super quick times. In simple terms, instead of iterating item by item using a for loop, NumPy and by extension Pandas can perform the same operation over an entire column in one go. Think of it as creating a formula in a spreadsheet and then copying it along the entire column.
For our solution:
- We used Boolean Masking in order to speed up filtering rows
- Alternatively, we can use the apply() method with a lambda function to apply a conditional statement over the entire data frame.
Additional Facebook Data Science Interview Questions
Facebook Data Science Interview Question #1: Find whether the number of seniors works at Facebook is higher than its number of USA based employees
Find whether the number of seniors works at Facebook is higher than its number of USA based employees
Find whether the number of senior workers (i.e., more experienced) at Facebook is higher than its number of USA based employees.
If the number of seniors is higher then output as 'More seniors'. Otherwise, output as 'More USA-based'.
You can solve this Facebook data science interview question here -https://platform.stratascratch.com/coding-question?id=10065&python=1
This is one of the easy level Facebook Data Science Interview questions. We can solve this in multiple ways using the len() method. We will then have to output the data in the form of a dataframe. For that, we need to create a new dataframe with the output. That can be done using the DataFrame() method in pandas.
- Find the count of the number of seniors
- Find the count of the number of employees based in USA
- Compare the counts and output the result into a DataFrame.
Facebook Data Science Interview Question #2: Clicked Vs Non-Clicked Search Results
Clicked Vs Non-Clicked Search Results
The 'position' column represents the position of the search results, and 'has_clicked' column represents whether the user has clicked on this result. Calculate the percentage of clicked search results, compared to those not clicked, that were in the top 3 positions (with respect to total number of records)
You can solve this Facebook data science interview problem here -https://platform.stratascratch.com/coding-question?id=10288&python=1
This Facebook Data Science Interview question uses the same fb_search_events dataset we had seen earlier. We can solve this problem using the built-in len() method. We will then have to output the data in the form of a dataframe. For that we need to create a new dataframe with the output. That can be done using the DataFrame() method in pandas. One approach to solving this is presented below.
- Calculate the number of results that were clicked (filtering by the ‘has_clicked’ field) and in the top three results (filtering on the ‘position’ field) as a percentage of the total number of query results. This will be the clicked percentage.
- Calculate the not_clicked results in a similar manner by changing the filter on the has_clicked field.
- Output the clicked and not_clicked values in a Pandas data frame.
Facebook Data Science Interview Question #3: Popularity of Hack
Popularity of Hack
Facebook has developed a new programming language called Hack. To measure the popularity of Hack they ran a survey with their employees. The survey included data on previous programming familiarity as well as the number of years of experience, age, gender and most importantly satisfaction with Hack. Due to an error location data was not collected, but your supervisor demands a report showing average popularity of Hack by office location. Luckily the user IDs of employees completing the surveys were stored.
Based on the above, find the average popularity of the Hack per office location. Output the location along with the average popularity.
You can solve this Facebook data science interview question here -https://platform.stratascratch.com/coding-question?id=10061&python=1
This problem uses the facebook_employees data set that we had used earlier along and additional facebook_hack_survey dataset.
This Facebook Data Science Interview question can be solved by merging the two data sets using the merge() and groupby() methods. One such approach is presented here.
- Merge the two datasets. The join keys are – id column in the facebook_employees dataset and employee_id column in facebook_hack_survey dataset.
- Calculate the average popularity, aggregating on the location column.
Check out our article Facebook Data Scientist Questions to find more questions from Facebook interviews.
In this article, we have discussed an approach to solving one of the real-life Facebook Data Science interview questions in detail using Python. The question was not too tough. Besides getting the right answer, the final evaluation would also take into consideration your optimization skills and knowledge of specific Pandas features like Boolean Masking and the use of the apply() method. Expertise in Python in general and Pandas library for Data Science, in particular, can be accomplished only with the practice of solving a variety of problems. Join the StrataScratch platform and practice more such data science interview questions from Facebook and other top companies like Amazon, Apple, Microsoft, Netflix and more.