Facebook Data Science Interview Questions

Published: August 11, 2021

Categories:

Written by:
Vivek Sankaran

In this article, we have discussed the approach to solving real-life Facebook Data Science interview questions in detail using Python.

Facebook controls some of the top social networks across the world. Besides the eponymous app, it also offers - Messenger, Instagram, WhatsApp, Oculus, and Giphy among others. Along with Google, Apple, Microsoft, and Amazon, it is considered one of the Big Five FAANG companies in U.S. information technology and has a market cap of over US$ 1 trillion.

Data Science Roles at Facebook

Data Scientists work on various large scale quantitative research projects at Facebook. They conduct researches to achieve deep insights into how people are interacting with each other and with the world. People at data scientist positions at Facebook work with a variety of methods including machine learning, field experiments, surveys, and information visualization to accomplish their goals. The roles at Facebook will therefore vary on the basis of business unit and the function that you are interviewing for.

Concepts Tested in Facebook Data Science Interview Questions

The main areas and concepts tested in the Facebook Data Science Interview Questions include.

Pandas
groupby
Indexing and Slicing DataFrames
Boolean Masking
apply() method

You can practice these and more such Facebook data science interview questions on the StrataScratch platform and become interview ready.

Check out our previous article on Facebook Interview Process that can provide you an insight into the whole process.

Facebook Data Science Interview Questions

Algorithm Performance

Last Updated: July 2021

Meta

HardID 10350

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Meta/Facebook is developing a search algorithm that will allow users to search through their post history. You have been assigned to evaluate the performance of this algorithm.

We have a table with the user's search term, search result positions, and whether or not the user clicked on the search result.

Write a query that assigns ratings to the searches in the following way: • If the search was not clicked for any term, assign the search with rating=1 • If the search was clicked but the top position of clicked terms was outside the top 3 positions, assign the search a rating=2 • If the search was clicked and the top position of a clicked term was in the top 3 positions, assign the search a rating=3

As a search ID can contain more than one search term, select the highest rating for that search ID. Output the search ID and its highest rating.

Example: The search_id 1 was clicked (clicked = 1) and its position is outside of the top 3 positions (search_results_position = 5), therefore its rating is 2.

DataFrame: fb_search_events

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10350-algorithm-performance

Dataset

Table: fb_search_events

Assumptions

Since you typically will not have access to the underlying data in the data science interview, you will have to use a mixture of business logic and your understanding of data storage to impute the variable assumptions. You must ensure that your solution boundaries are defined reasonably well.

So let us try to figure out what the data might look like. Please ensure that you confirm the validity of the assumptions that you make with the interviewer so that you do not veer away from the solution path. This will also give you a chance to showcase your ability to visualize the tables and data structures. Most interviewers will be more than happy to help you at this stage.

Assumptions on the Data and the Table:

search_id: This appears to be the identifying field for the search. However, this may not be a unique key – since the problem also mentions

As a search ID can contain more than one search term, select the highest rating for that search ID.

search_term: This is the search text entered by the user. For this problem, we can safely ignore this field.

clicked: This field appears to be an indicator of whether the user has previously clicked on the search result. This field is required for the final analysis. Further, since the data type for this field is int64, we might need to check with the interviewer regarding the values it takes.

search_results_position: This, too, is required for the final analysis and appears to be a field that denotes the rank of the query in search results.

Before we proceed towards drafting a solution for this problem, it is highly recommended that you confirm your assumptions with the interviewer to ensure that we can refine our assumptions and ensure that any edge cases are handled in the solution.

Logic

The biggest challenge in this problem is to create the rating column. Once that is done, the query is relatively straightforward since the query parameters are already provided. Let us visualize this.

We need to work with two columns: clicked and search_results_position.
1. If clicked is not 1, then set the rating as 1
2. Else if the position is 3 or lesser, then set the rating as 2
3. Else set the rating as 2
Once we have the rating, we can aggregate it on search_id, taking the highest rating.

Now that we have our logic let us begin coding this in Python.

Solution

1. We start by creating the rating column as described above by applying a conditional statement on two columns. There are many ways to accomplish this. We look at two of the most efficient methods

a) Boolean Mask: Boolean Masks can apply a conditional on the entire data frame and return the indexes with a Boolean output for the conditional. So we can create three Boolean masks, one each for the three ratings.

# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
fb_search_events['rating1'] = fb_search_events['clicked'] != 1
# Mask 02
fb_search_events['rating2'] = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
fb_search_events['rating3'] = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)

Let’s see what the data looks like.

fb_search_events[['search_id', 'clicked', 'search_results_position','rating1', 'rating2', 'rating3']]

Let us verify that our masks are working fine. We will be checking if there are any overlaps in the ratings (there should not be any).

fb_search_events[['rating1', 'rating2', 'rating3', 'search_id']].groupby(by = 
['rating1', 'rating2', 'rating3'], as_index = False).count()

b) Now that the masks are working fine, we can create the rating field. For this, we use the loc method.

# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
rating1 = fb_search_events['clicked'] != 1
# Mask 02
rating2 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
rating3 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)
# Calculate Ratings
fb_search_events.loc[rating1, 'rating'] = 1
fb_search_events.loc[rating2, 'rating'] = 2
fb_search_events.loc[rating3, 'rating'] = 3
# Verify
fb_search_events[['search_id', 'clicked', 'search_results_position','rating']]

c) The last step is to return the highest rating for each search ID. We use the max() function to do that, and the solution to this interview question looks like this.

2. Alternatively, we can use the apply() method in Pandas with a lambda function to do all this in one step.

a)The apply() method can be used to apply a function along an axis of a DataFrame. The lambda function is used to create a user-defined function on the fly.

# Import your libraries
import pandas as pd
# Start writing code
fb_search_events['rating'] = fb_search_events[['clicked', 
'search_results_position']].apply(lambda x : 1 if x[0] != 1 else 3 if x[1] <=3 
else 2 , axis = 1)
# Verify
fb_search_events[['search_id', 'clicked', 'search_results_position','rating']]

b) Once we have the rating field, we can easily summarize the data frame using the groupby() and max() methods.

The final code is given below.

# Import your libraries
import pandas as pd
# Start writing code
fb_search_events['rating'] = fb_search_events[['clicked', 
'search_results_position']].apply(lambda x : 1 if x[0] != 1 else 3 if x[1] <=3 
else 2 , axis = 1)
result = fb_search_events.groupby('search_id')['rating'].max().reset_index()

Optimization

NumPy forms the basis of the Python Pandas library. These libraries are specifically designed to perform vectorized operations in super quick times. In simple terms, instead of iterating item by item using a for loop, NumPy and, by extension, Pandas can perform the same operation over an entire column in one go. Think of it as creating a formula in a spreadsheet and then copying it along the entire column.

For our solution:

We used Boolean Masking in order to speed up filtering rows.
Alternatively, we can use the apply() method with a lambda function to apply a conditional statement over the entire data frame.

# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
rating1 = fb_search_events['clicked'] != 1
# Mask 02
rating2 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
rating3 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)
# Calculate Ratings
fb_search_events.loc[rating1, 'rating'] = 1
fb_search_events.loc[rating2, 'rating'] = 2
fb_search_events.loc[rating3, 'rating'] = 3
result = fb_search_events.groupby('search_id')['rating'].max().reset_index()

Additional Facebook Data Science Interview Questions

Facebook Data Science Interview Question #1: Find whether the number of seniors works at Facebook is higher than its number of USA based employees

Find whether the number of seniors works at Meta/Facebook is higher than its number of USA based employees

Last Updated: April 2020

Facebook Data Science Interview Question #2: Clicked Vs Non-Clicked Search Results

Clicked Vs Non-Clicked Search Results

The 'position' column represents the position of the search results, and 'has_clicked' column represents whether the user has clicked on this result. Calculate the percentage of clicked search results, compared to those not clicked, that were in the top 3 positions (with respect to total number of records)

Link to the question: https://platform.stratascratch.com/coding/10288-clicked-vs-non-clicked-search-results

Dataset

This Facebook Data Science Interview question uses the same fb_search_events dataset we had seen earlier. We can solve this problem using the built-in len() method. We will then have to output the data in the form of a dataframe. For that we need to create a new dataframe with the output. That can be done using the DataFrame() method in pandas. One approach to solving this is presented below.

Approach

Calculate the number of results that were clicked (filtering by the ‘has_clicked’ field) and in the top three results (filtering on the ‘position’ field) as a percentage of the total number of query results. This will be the clicked percentage.
Calculate the not_clicked results in a similar manner by changing the filter on the has_clicked field.
Output the clicked and not_clicked values in a Pandas data frame.

Facebook Data Science Interview Question #3: Popularity of Hack

Popularity of Hack

Last Updated: March 2020

Conclusion

In this article, we have discussed an approach to solving one of the real-life Facebook Data Science interview questions in detail using Python. The question was not too tough. Besides getting the right answer, the final evaluation would also take into consideration your optimization skills and knowledge of specific Pandas features like Boolean Masking and the use of the apply() method. Expertise in Python in general and Pandas library for Data Science, in particular, can be accomplished only with the practice of solving a variety of problems. Join the StrataScratch platform and practice more such data science interview questions from Facebook and other top companies like Amazon, Apple, Microsoft, Netflix and more.

Facebook Data Science Interview Questions

Data Science Roles at Facebook

Concepts Tested in Facebook Data Science Interview Questions

Facebook Data Science Interview Questions

Algorithm Performance

Additional Facebook Data Science Interview Questions

Facebook Data Science Interview Question #1: Find whether the number of seniors works at Facebook is higher than its number of USA based employees

Find whether the number of seniors works at Meta/Facebook is higher than its number of USA based employees

Facebook Data Science Interview Question #2: Clicked Vs Non-Clicked Search Results

Facebook Data Science Interview Question #3: Popularity of Hack

Popularity of Hack

Conclusion

Latest Posts:

The Biggest Lie in Data Science: 'You Need to Build an App'

Learn PySpark Joins Easily with This Guide

Unsupervised Clustering: Methods, Examples, and When to Use