Facebook Data Science Interview Questions

Facebook Data Science Interview Questions


Recent Facebook data science interview questions solved using Python

Facebook controls some of the top social networks across the world. Besides the eponymous app, it also offers - Messenger, Instagram, WhatsApp, Oculus, and Giphy among others. Along with Google, Apple, Microsoft, and Amazon, it is considered one of the Big Five FAANG companies in U.S. information technology and has a market cap of over US$ 1 trillion.

Data Science Roles at Facebook

Data Scientists work​ on various large scale quantitative research projects at Facebook. They conduct researches to achieve deep insights into how people are interacting with each other and with the world. People at data scientist positions at Facebook work with a variety of methods including machine learning, field experiments, surveys, and information visualization to accomplish their goals. The roles at Facebook will therefore vary on the basis of business unit and the function that you are interviewing for.

Concepts Tested in Facebook Data Science Interview Questions

The main areas and concepts tested in the Facebook Data Science Interview Questions include.

  • Pandas
  • groupby
  • Indexing and Slicing DataFrames
  • Boolean Masking
  • apply() method

You can practice these and more such Facebook data science interview questions on the StrataScratch platform and become interview ready.

Check out our previous article on Facebook Interview Process that can provide you an insight into the whole process.

Facebook Data Science Interview Questions


DataFrame: fb_search_events
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10350-algorithm-performance

Dataset

Table: fb_search_events
search_idsearch_termclickedsearch_results_position
1rabbit15
2airline14
2quality15
3hotel11
3scandal14

Assumptions

Since you typically will not have access to the underlying data in the data science interview, you will have to use a mixture of business logic and your understanding of data storage to impute the variable assumptions. You must ensure that your solution boundaries are defined reasonably well.

So let us try to figure out what the data might look like. Please ensure that you confirm the validity of the assumptions that you make with the interviewer so that you do not veer away from the solution path. This will also give you a chance to showcase your ability to visualize the tables and data structures. Most interviewers will be more than happy to help you at this stage.

Assumptions on the Data and the Table:

search_id: This appears to be the identifying field for the search. However, this may not be a unique key – since the problem also mentions

As a search ID can contain more than one search term, select the highest rating for that search ID.

search_term: This is the search text entered by the user. For this problem, we can safely ignore this field.

clicked: This field appears to be an indicator of whether the user has previously clicked on the search result. This field is required for the final analysis. Further, since the data type for this field is int64, we might need to check with the interviewer regarding the values it takes.

search_results_position: This, too, is required for the final analysis and appears to be a field that denotes the rank of the query in search results.

Before we proceed towards drafting a solution for this problem, it is highly recommended that you confirm your assumptions with the interviewer to ensure that we can refine our assumptions and ensure that any edge cases are handled in the solution.

Logic

The biggest challenge in this problem is to create the rating column. Once that is done, the query is relatively straightforward since the query parameters are already provided. Let us visualize this.

  1. We need to work with two columns: clicked and search_results_position.
    1. If clicked is not 1, then set the rating as 1
    2. Else if the position is 3 or lesser, then set the rating as 2
    3. Else set the rating as 2
  2. Once we have the rating, we can aggregate it on search_id, taking the highest rating.

Now that we have our logic let us begin coding this in Python.

Solution

1. We start by creating the rating column as described above by applying a conditional statement on two columns. There are many ways to accomplish this. We look at two of the most efficient methods

a) Boolean Mask: Boolean Masks can apply a conditional on the entire data frame and return the indexes with a Boolean output for the conditional. So we can create three Boolean masks, one each for the three ratings.

# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
fb_search_events['rating1'] = fb_search_events['clicked'] != 1
# Mask 02
fb_search_events['rating2'] = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
fb_search_events['rating3'] = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)

Let’s see what the data looks like.

fb_search_events[['search_id', 'clicked', 'search_results_position','rating1', 'rating2', 'rating3']]

All required columns and the first 5 rows of the solution are shown

search_idclickedsearch_results_positionrating1rating2rating3
115FALSETRUEFALSE
214FALSETRUEFALSE
215FALSETRUEFALSE
311FALSEFALSETRUE
314FALSETRUEFALSE

Let us verify that our masks are working fine. We will be checking if there are any overlaps in the ratings (there should not be any).

fb_search_events[['rating1', 'rating2', 'rating3', 'search_id']].groupby(by = 
['rating1', 'rating2', 'rating3'], as_index = False).count()

All required columns and the first 5 rows of the solution are shown

rating1rating2rating3search_id
FALSEFALSETRUE25
FALSETRUEFALSE20
TRUEFALSEFALSE30

b) Now that the masks are working fine, we can create the rating field. For this, we use the loc method.

# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
rating1 = fb_search_events['clicked'] != 1
# Mask 02
rating2 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
rating3 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)
# Calculate Ratings
fb_search_events.loc[rating1, 'rating'] = 1
fb_search_events.loc[rating2, 'rating'] = 2
fb_search_events.loc[rating3, 'rating'] = 3
# Verify
fb_search_events[['search_id', 'clicked', 'search_results_position','rating']]

All required columns and the first 5 rows of the solution are shown

search_idclickedsearch_results_positionrating
1152
2142
2152
3113
3142

c) The last step is to return the highest rating for each search ID. We use the max() function to do that, and the solution to this interview question looks like this.

All required columns and the first 5 rows of the solution are shown

search_idrating
12
22
33
53
63

2. Alternatively, we can use the apply() method in Pandas with a lambda function to do all this in one step.

a)The apply() method can be used to apply a function along an axis of a DataFrame. The lambda function is used to create a user-defined function on the fly.

# Import your libraries
import pandas as pd
# Start writing code
fb_search_events['rating'] = fb_search_events[['clicked', 
'search_results_position']].apply(lambda x : 1 if x[0] != 1 else 3 if x[1] <=3 
else 2 , axis = 1)
# Verify
fb_search_events[['search_id', 'clicked', 'search_results_position','rating']]

All required columns and the first 5 rows of the solution are shown

search_idclickedsearch_results_positionrating
1152
2142
2152
3113
3142

b) Once we have the rating field, we can easily summarize the data frame using the groupby() and max() methods.

The final code is given below.

# Import your libraries
import pandas as pd
# Start writing code
fb_search_events['rating'] = fb_search_events[['clicked', 
'search_results_position']].apply(lambda x : 1 if x[0] != 1 else 3 if x[1] <=3 
else 2 , axis = 1)
result = fb_search_events.groupby('search_id')['rating'].max().reset_index()

All required columns and the first 5 rows of the solution are shown

search_idratings
12
22
33
53
63

Optimization

NumPy forms the basis of the Python Pandas library. These libraries are specifically designed to perform vectorized operations in super quick times. In simple terms, instead of iterating item by item using a for loop, NumPy and, by extension, Pandas can perform the same operation over an entire column in one go. Think of it as creating a formula in a spreadsheet and then copying it along the entire column.

For our solution:

  1. We used Boolean Masking in order to speed up filtering rows.
  2. Alternatively, we can use the apply() method with a lambda function to apply a conditional statement over the entire data frame.
# Import your libraries
import pandas as pd
# Start writing code
# Mask 01
rating1 = fb_search_events['clicked'] != 1
# Mask 02
rating2 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] > 3)
# Mask 03
rating3 = (fb_search_events['clicked'] == 1) & (fb_search_events['search_results_position'] <=3)
# Calculate Ratings
fb_search_events.loc[rating1, 'rating'] = 1
fb_search_events.loc[rating2, 'rating'] = 2
fb_search_events.loc[rating3, 'rating'] = 3
result = fb_search_events.groupby('search_id')['rating'].max().reset_index()

Additional Facebook Data Science Interview Questions

Facebook Data Science Interview Question #1: Find whether the number of seniors works at Facebook is higher than its number of USA based employees


DataFrame: facebook_employees
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10065-find-whether-the-number-of-seniors-works-at-facebook-is-higher-than-its-number-of-usa-based-employees

Dataset

Table: facebook_employees
idlocationagegenderis_senior
0USA24MFALSE
1USA31FTRUE
2USA29FFALSE
3USA33MFALSE
4USA36FTRUE

This is one of the easy level Facebook Data Science Interview questions. We can solve this in multiple ways using the len() method. We will then have to output the data in the form of a dataframe. For that, we need to create a new dataframe with the output. That can be done using the DataFrame() method in pandas.

Approach

  1. Find the count of the number of seniors
  2. Find the count of the number of employees based in USA
  3. Compare the counts and output the result into a DataFrame.

Facebook Data Science Interview Question #2: Clicked Vs Non-Clicked Search Results

Clicked Vs Non-Clicked Search Results

The 'position' column represents the position of the search results, and 'has_clicked' column represents whether the user has clicked on this result. Calculate the percentage of clicked search results, compared to those not clicked, that were in the top 3 positions (with respect to total number of records)

Link to the question: https://platform.stratascratch.com/coding/10288-clicked-vs-non-clicked-search-results

Dataset

This Facebook Data Science Interview question uses the same fb_search_events dataset we had seen earlier. We can solve this problem using the built-in len() method. We will then have to output the data in the form of a dataframe. For that we need to create a new dataframe with the output. That can be done using the DataFrame() method in pandas. One approach to solving this is presented below.

Approach

  1. Calculate the number of results that were clicked (filtering by the ‘has_clicked’ field) and in the top three results (filtering on the ‘position’ field) as a percentage of the total number of query results. This will be the clicked percentage.
  2. Calculate the not_clicked results in a similar manner by changing the filter on the has_clicked field.
  3. Output the clicked and not_clicked values in a Pandas data frame.

Facebook Data Science Interview Question #3: Popularity of Hack


DataFrames: facebook_employees, facebook_hack_survey
Expected Output Type: pandas.Series

Link to the question: https://platform.stratascratch.com/coding/10061-popularity-of-hack

Dataset

This problem uses the facebook_employees data set that we had used earlier along and additional facebook_hack_survey dataset.

Table: facebook_employees
idlocationagegenderis_senior
0USA24MFALSE
1USA31FTRUE
2USA29FFALSE
3USA33MFALSE
4USA36FTRUE

Table: facebook_hack_survey
employee_idagegenderpopularity
024M6
131F4
229F0
333M7
436F6

This Facebook Data Science Interview question can be solved by merging the two data sets using the merge() and groupby() methods. One such approach is presented here.

Approach

  1. Merge the two datasets. The join keys are – id column in the facebook_employees dataset and employee_id column in facebook_hack_survey dataset.
  2. Calculate the average popularity, aggregating on the location column.

Check out our article Facebook Data Scientist Interview Questions to find more questions from Facebook interviews.

Conclusion

In this article, we have discussed an approach to solving one of the real-life Facebook Data Science interview questions in detail using Python. The question was not too tough. Besides getting the right answer, the final evaluation would also take into consideration your optimization skills and knowledge of specific Pandas features like Boolean Masking and the use of the apply() method. Expertise in Python in general and Pandas library for Data Science, in particular, can be accomplished only with the practice of solving a variety of problems. Join the StrataScratch platform and practice more such data science interview questions from Facebook and other top companies like Amazon, Apple, Microsoft, Netflix and more.

Facebook Data Science Interview Questions


Become a data expert. Subscribe to our newsletter.