Facebook Data Scientist Analytics Interview Questions

This is one of the medium difficulty data scientist analytics interview questions from Facebook. This question requires writing a function that will check whether a search term is found in a column. We’ll use both a SQL wildcard % and concat() function to implement this logic check. We’ll then write a CASE WHEN statement to evaluate the performance of the search algorithm. This question covers concepts that are commonly found in data science interviews at Facebook and Google.

Facebook data scientist analytics interview question

Algorithm Performance
Facebook has developed a search algorithm that will parse through user comments and present the results of the search to a user. To evaluate the performance of the algorithm, we are given a table that consists of the search result the user clicked on ('notes' column), the user's search query, and the resulting search position that was returned for the specific comment.
The higher the position, the better, since these comments were exactly what the user was searching for. Write a query that evaluates the performance of the search algorithm against each user query. Refer to the hint section for more specifics on how to write the query.

If you want to follow along with me, below is the link to StrataScratch:
https://platform.stratascratch.com/coding-question?id=10287&python=

Breaking Down the Question

If we hit the hint section, we get more information on the question.

We got some information about the column names:

- The 'query' column is the search term the user was searching for
- The 'notes' column is the specific comment the user clicked on as a result of performing the search (i.e., this is the comment the user was searching for)
- The 'position' column is the search result position of the specific comment
- Output only the result_id, query, and rating
- To evaluate the performance of the search algorithm, you'll need to rate each search result position.
- When the 'position' is between 1 and 3, the search algorithm performed well. Give a rating of 5.
- When the 'position' is between 4 and 5, give a rating of 4
- When the 'position' is between 6 and 10, give a rating of 3
- When the 'position' is beyond 11, give a rating of 2
- If the query itself is not found in the 'notes' column, give a rating of 1, since it means that the search algorithm provided a false positive. You can assume that the search algorithm is case insensitive (for example, 'Cat' is equal to 'cat').

This will make much more sense once we dive into the data itself, so let's do that now.

Exploring the Data

Now, let's hit this 'Preview' button.

In the output, we have the 'query' which is the search term. We have the 'notes' column, which is actually the comment. And we have the 'position' in which the comment actually appeared when the user performed this search. And then lastly, we have the 'result_id', which is just a primary key to keep track of all of these results.

The last thing to keep note of is that the comments under the 'note' column were all clicked by the user. For example, the first raw shows that the user searching for 'employee' clicked on the comment, which appeared in position seven. So, we know that this user that searched for 'employee' wanted to find this comment, and that's why they clicked on it.
What we're trying to do is grade the search algorithm that a developer at Facebook created.

Next, let's write our approach down before we start coding. The most important point that I found was actually in the hint section. At the very bottom, it says, "If the query itself is not found in the notes column, give a rating of one."
Basically, it's saying that if the search term was not found in the comment itself, give a rating of one. This is probably the most important thing to solve for first, before classifying the positions with giving it a rating of five, four, three, or two.

Let's solve for the bottom part first, and then we will solve for the upper part afterwards.

Building Skeleton of Solution

What we'll write in our approach is to check if query is in notes. It means to check if the search term is in the comment itself.

-- check if query is in notes

Then we have to categorize the position based on the criteria in the hints section. And we can also write down what that criteria is.

-- check if query is in notes
-- categorize the position based on the criteria in the hints section
  -- 'position' is between 1 and 3, give a rating of 5
  -- 'position' is between 4 and 5, give a rating of 4
  -- 'position' is between 6 and 10, give a rating of 3
  -- 'position' is beyond 11, give a rating of 2
  -- query not found in the 'notes' column, give a rating of 1

Now, we have to check if the query is in the notes. Let's take an example to understand this:

The search term was 'query' and the comments are in the 'notes' section. We can see that employee is found in the 'notes' section. But we need SQL to be able to find that. You'll actually take the 'employee' search term and try to find it in the comment section, but there's a bunch of symbols, texts and space to the right of 'employee' and to the left of 'employee'. So in order to compensate for that, you will have to utilize a wildcard.

Using Wildcards and concat(%query%)

Now, we need to use a wildcard.

-- %employee%

The wildcard in SQL is a percentage sign. What it's saying is that for the text or word "employee," the wildcard means you can ignore everything that comes to the right of it and you can ignore anything that comes to the left of it.

The next question is, how do I actually implement this programmatically where it will go row by row for each one of these query terms or search terms. Because the term itself is different each and every time. Like in our case, we have 'employee', 'organization', 'office' and more. So now, we need to put 'query' in the column name, and then wildcard symbols to the left and right of it. And we can do that by leveraging a concat function. It could look something like:

-- concat(%query%)

We used a wildcard symbol and we used the concat function to be able to evaluate whether or not the search term is found in the comments itself. So, now we'll put this at the very top.

-- check if query is in notes using concat(%query%)
-- categorize the position based on the criteria in the hints section
   using CASE WHEN
  -- 'position' is between 1 and 3, give a rating of 5
  -- 'position' is between 4 and 5, give a rating of 4
  -- 'position' is between 6 and 10, give a rating of 3
  -- 'position' is beyond 11, give a rating of 2
  -- query not found in the 'notes' column, give a rating of 1

Now, let's start coding!

First, we'll SELECT the table and then we have to output all of the columns and then implement the concat query.

SELECT
  query,
  result_id,
  position,
  notes,
FROM fb_search_results

Now, implement the concat function with the wildcard. What you have to do is enclose the wildcard symbol or the percentage symbol in quotes, and then put in the column name (the column name is query). And then again, use the another wildcard.

SELECT
  query,
  result_id,
  position,
  notes,
  concat('%', query, '%')
FROM fb_search_results

You have the wildcard symbol and employee text (search term). This function will go row by row and it will concat the search term with wildcard symbols, just like as you see above.

So, let's test this. I'm going to name this column as "Check" and let's just test to see what happens if we run this query.

SELECT
  query,
  result_id,
  position,
  notes,
  concat('%', query, '%') as check
FROM fb_search_results

Let's run this query now.

We get the wildcards and the search term, but it's not telling us whether or not search term (for e.g. 'employee') is in the notes. It's not telling us if the search term is in the comment itself. For that, we're going to have turned this into a true or false.

ilike Implementation for TRUE/FALSE

What we have to do is we call the notes column, because the note column contains the comment itself and then ilike. And then we'll see whether or not that search term is actually in the 'notes' column. That should give us a true or false. It should tell us if the search term is in the comment itself.

SELECT
  query,
  result_id,
  position,
  notes,
  (notes ilike concat('%', query, '%')) as check
FROM fb_search_results

I'm using an "ilike" here. You could use "like" if you'd like. It depends what you see in the data and some of your assumptions. My assumption is that this search algorithm is case insensitive, so it doesn't matter if it's, for example, "Employee" or "employee", the search algorithms should view those words or terms exactly the same. So, in that case, I want to use an "ilike". But if I want to be able to differentiate between an 'Employee' with a capital E and 'employee' with the lowercase e, I probably have to use a "like". But in our case, this doesn't matter.

Let's run this now!

We now get trues and falses. We can see that 'employee' is in the comment and it's 'TRUE'. But we're getting a 'FALSE' on the second one, that is 'organisation'. So the answer is no, because 'organisation' is spelled with an S and we have 'organization' that is spelled with a Z. So, this makes a lot of sense. The chapter is right and It should return a 'FALSE'. And then when we implement the CASE WHEN we're going to give it a rating of 1. This is basically done.

Now, we have to start categorizing the position based on the criteria that we found and read in the hints column, using a CASE WHEN.

Implementing a CASE WHEN

This is going to be a subquery in my categorization. We'll write FROM and alias that with a t. And then in our SELECT clause, we'll start writing the categorization using the CASE WHEN. So, we will output the result_id and then the query itself. And then we will start on the CASE WHEN.

SELECT
  t.result_id,
  t.query,
  CASE WHEN t.check = FALSE THEN 1

FROM (
SELECT
  query,
  result_id,
  position,
  notes,
  (notes ilike concat('%', query, '%')) as check
FROM fb_search_results) t

This is the first criteria, "If the query is not found in the notes column, "then give it a rating of 1."

Now, let's just start layering on the other ratings.

SELECT
  t.result_id,
  t.query,
  CASE WHEN t.check = FALSE THEN 1
       WHEN t.check = TRUE AND t.position >= 11 THEN 2

Now, we have, "If the checker says that it's TRUE, meaning the search term was found in the comment itself, and the position was equal or greater than 11, give it a rating of two."

And the third criteria is:

SELECT
  t.result_id,
  t.query,
  CASE WHEN t.check = FALSE THEN 1
       WHEN t.check = TRUE AND t.position >= 11 THEN 2
       WHEN t.check = TRUE AND (t.position between 6 AND 10) THEN 3

The third criteria is, "If the check says TRUE, meaning the search term was found in the comment, and the position was between 6 and 10, give it a rating of 3".

What's important to note is that we're using 'between' rather than 'inequalities'. It is the same thing. What's important to also note is that the 'between' is inclusive, so it's going to count the 6 and 10, and it's going to give it a rating of 3. If you don't want this behavior, don't use 'between', and use 'inequalities' instead.

Now, it will be the same thing for a rating of four and five.

SELECT
  t.result_id,
  t.query,
  CASE WHEN t.check = FALSE THEN 1
       WHEN t.check = TRUE AND t.position >= 11 THEN 2
       WHEN t.check = TRUE AND (t.position between 6 AND 10) THEN 3
       WHEN t.check = TRUE AND (t.position between 4 AND 5) THEN 4
       WHEN t.check = TRUE AND t.position <= 3 THEN 5

If the position is between 4 and 5, we're giving it a rating of 4. And if the position is less than or equal to 3, give it a rating of 5.

For the most part, this finishes our CASE WHEN. All we need to do is end the CASE WHEN and name the column.

Solution

SELECT
  t.result_id,
  t.query,
  CASE WHEN t.check = FALSE THEN 1
       WHEN t.check = TRUE AND t.position >= 11 THEN 2
       WHEN t.check = TRUE AND (t.position between 6 AND 10) THEN 3
       WHEN t.check = TRUE AND (t.position between 4 AND 5) THEN 4
       WHEN t.check = TRUE AND t.position <= 3 THEN 5
  END as rating
FROM (
SELECT
  query,
  result_id,
  position,
  notes,
  (notes ilike concat('%', query, '%')) as check
FROM fb_search_results) t

So, we are done with the query. We have all the logic now and implemented in the SQL query. Now, let's run this query to see the output.

So, you can see that we do have an output. We have the 'result_id' which is the primary key, the search term ('query') itself, and then the 'performance rating' of the search algorithm.

Now, let's check the solution.

So, we can see that it actually works.