Facebook Data Science Interview Practice - Calculate the Distribution of Comments

Want to practice Facebook data science interview questions? Here we get you a real data science interview question from Facebook.

The question is a hard question because it requires several manipulations of the data that results in 2 SUBQUERIES that help create new aggregate columns. You’re tested on whether or not you understand what type of JOIN to use (in this case we use an INNER JOIN) and if you can filter data based on the correct dates. The question also tests on whether you know how to transpose data columns and re-aggregate already aggregate data using SUBQUERIES. This question covers concepts that are commonly found in data science interviews at Facebook and Google.

Practice The Real Facebook Data Science Interview Question

Question

Our suggestion is to try to solve this Facebook data scientist interview question first before reading the solution. You can try to solve the interview question here (https://platform.stratascratch.com/coding-question?id=10297&python=).

Write a query to calculate the distribution of comments by the count of users that joined Facebook between 2018 and 2020, for the month of January 2020.

The output should contain a count of comments and the corresponding number of users that made that number of comments in Jan-2020. For example, you'll be counting how many users made 1 comment, 2 comments, 3 comments, 4 comments, etc in Jan-2020. Your left column in the output will be the number of comments while your right column in the output will be the number of users. Sort the output from the least number of comments to the highest.

Isn't it wordy and slightly confusing?

So, what you could do is press the expected output button and you'll get the required columns that are needed for the correct solution and then the first five rows of the output.

If there are more than five rows it will not show all of the rows. It will just show the first five. But what we have is a better understanding now. We have a number of comments and how many users actually made that number of comments.

Assumptions

It shows that six users made one comment in January 2020, 11 users made two comments in January 2020 and then so on and so forth. So, I have a better understanding now of what this question is really asking me to do. Now, what I want to do next is look at the tables. I can use the preview table buttons to look at the underlying data.

After pressing the preview button, if we look at Facebook users what we have are five different columns. The first column ID is definitely going to be needed because we'll be counting users.

So, I can use the id column or I can probably use the name column. But the problem with using the name column is there are probably duplicate names like people have identical names so using a primary key column like id would definitely be much better.

Then joined_at is the next column I need because I want to filter out users that joined Facebook before 2018 or after 2020.

The next table we'll take a look at is Facebook comments (fb_comments).

These are just basically the comments. The first column user_id is going to connect to my id column under fb_users table. So, I know how to join these tables.

Then the created_at I need to filter and look up for January 2020 because I only care about those comments.

Approach

Here is how you can approach this Facebook data science interview question.

The first thing I have to do is join these two tables together. So, I write:

join fb_users and fb_comments by the id and id_user key

Once we perform that join what we will get is a list of users that made comments in January 2020 and also joined Facebook between 2018 and 2020.

The next thing to do is to count the number of comments made by users.

count the number of comments made by users

This will give me a comment count. And after that, I want to count the number of users that made the corresponding number of comments.

count the number of users that made the corresponding number
of comments

It seems a confusing part! So, what I want to do is count the number of users that made that many comments. For example, 6 users made 1 comment and 11 users made 2 comments in January 2020.

One thing more to add is actually the logic to just include users that join Facebook between 2018 and 2020 and then count comments that were made in January 2020.

-- join fb_users and fb_comments by the id and id_user key
  -- add logic for join date of 2018 to 2020
  -- add logic for comment date of jan-2020
-- count the number of comments made by users
-- count the number of users that made the corresponding number
   of comments

Now, l will join the two tables together and then add the logic of the dates.

SELECT

FROM fb_users a
JOIN fb_comments b ON a.id = b.user_id

I ignored the select clause real quick and just join the two tables together on the key.

One thing to note is I am doing an inner join so I'll be more explicit and actually type out inner join. The reason why I know this is going to be an inner join is that I really only care about all the comments that users make. The Facebook users table is probably theoretically much larger than the comments in terms of how many users that are included in the comments table because probably not all users make comments on Facebook. So, if I only care about comments and counting comments then I'm going to use an inner join.

SELECT

FROM fb_users a
INNER JOIN fb_comments b ON a.id = b.user_id

Now, we'll add the two logical statements in the WHERE clause.

SELECT

FROM fb_users a
INNER JOIN fb_comments b ON a.id = b.user_id
WHERE a.joined_at BETWEEN '01-01-2018' AND '12-31-2020'

This is saying that a user needs to join between 2018 and 2020. I'm using the full dates because I want the first date of 2018 and the last day of 2020.

Now, I'm going to add parentheses. That way I can add the second logic which is comment date of January 2020.

SELECT

FROM fb_users a
INNER JOIN fb_comments b ON a.id = b.user_id
WHERE (a.joined_at BETWEEN '01-01-2018' AND '12-31-2020')
  AND (b.created_at BETWEEN '01-01-2020' AND '01-31-2020')

The tables we have are not very big. There are not a lot of columns but if you are working at a company in the industry you might encounter tables that have a lot of columns or a lot of data. So, it's good not to always put an asterisk and just to really isolate the data that you need.

The data that we need is the id, user_id, created_at and joined_at.

We need basically four columns:

SELECT
  a.id,
  a.joined_at,
  b.user_id,
  b.created_at
FROM fb_users a
INNER JOIN fb_comments b ON a.id = b.user_id
WHERE (a.joined_at BETWEEN '01-01-2018' AND '12-31-2020')
  AND (b.created_at BETWEEN '01-01-2020' AND '01-31-2020')

These are the four columns that we need. Now, let's run this code to make sure we wrote it correctly.

And that's exactly what we are expecting!

The next thing we have to do is count the number of comments made by users. So, the query is going to become a subquery.

Now, I want to count the id and I can either use id or the user_id. It doesn't really matter. I'm just going to pick id then I'm going to perform a count where I am counting the number of rows. For count, I can really just pick any column. And I will pick the user_id and rename it as comment_cnt.

SELECT
  t.id,
  count(t.user_id) as comment_cnt
FROM (
    SELECT
      a.id,
      a.joined_at,
      b.user_id,
      b.created_at
    FROM fb_users a
    INNER JOIN fb_comments b ON a.id = b.user_id
    WHERE (a.joined_at BETWEEN '01-01-2018' AND '12-31-2020')
      AND (b.created_at BETWEEN '01-01-2020' AND '01-31-2020')) t

Then we need a GROUP BY.

SELECT
  t.id,
  count(t.user_id) as comment_cnt
FROM (
    SELECT
      a.id,
      a.joined_at,
      b.user_id,
      b.created_at
    FROM fb_users a
    INNER JOIN fb_comments b ON a.id = b.user_id
    WHERE (a.joined_at BETWEEN '01-01-2018' AND '12-31-2020')
      AND (b.created_at BETWEEN '01-01-2020' AND '01-31-2020')) t
GROUP BY t.id

If we run this query, we get the id and the number of comments that the user has made in January 2020.

So, it's starting to look how we want it to become. Now, what we need to do is transpose this. I am going to count the number of users that made the corresponding number of comments.

Now, the whole block of code is going to be a subquery.

SELECT
FROM (
    SELECT
      t.id,
      count(t.user_id) as comment_cnt
    FROM (
        SELECT
          a.id,
          a.joined_at,
          b.user_id,
          b.created_at
        FROM fb_users a
        INNER JOIN fb_comments b ON a.id = b.user_id
        WHERE (a.joined_at BETWEEN '01-01-2018' AND '12-31-2020')
          AND (b.created_at BETWEEN '01-01-2020' AND '01-31-2020')) t
    GROUP BY t.id) c

Now, in the SELECT class, I want the comment_cnt. Then we have to count the number of users that made the corresponding comment count. What I mean by that is count the id and rename it as user_cnt. Then we need to group this with GROUP BY comment_cnt.

SELECT
  comment_cnt,
  count(id) as user_cnt
FROM (
    SELECT
      t.id,
      count(t.user_id) as comment_cnt
    FROM (
        SELECT
          a.id,
          a.joined_at,
          b.user_id,
          b.created_at
        FROM fb_users a
        INNER JOIN fb_comments b ON a.id = b.user_id
        WHERE (a.joined_at BETWEEN '01-01-2018' AND '12-31-2020')
          AND (b.created_at BETWEEN '01-01-2020' AND '01-31-2020')) t
    GROUP BY t.id) c
GROUP BY comment_cnt

Then there's also an order that says "Sort the output from the least number of comments to highest".

Now, we'll do:

ORDER BY comment_cnt ASC

Solution

-- join fb_users and fb_comments by the id and id_user key
  -- add logic for join date of 2018 to 2020
  -- add logic for comment date of jan-2020
-- count the number of comments made by users
-- count the number of users that made the corresponding number
   of comments

SELECT
  comment_cnt,
  count(id) as user_cnt
FROM (
    SELECT
      t.id,
      count(t.user_id) as comment_cnt
    FROM (
        SELECT
          a.id,
          a.joined_at,
          b.user_id,
          b.created_at
        FROM fb_users a
        INNER JOIN fb_comments b ON a.id = b.user_id
        WHERE (a.joined_at BETWEEN '01-01-2018' AND '12-31-2020')
          AND (b.created_at BETWEEN '01-01-2020' AND '01-31-2020')) t
    GROUP BY t.id) c
GROUP BY comment_cnt
ORDER BY comment_cnt ASC

If we run this code we get the expected output.

And if we go and check the solution, it says that it's correct.

So, the entire query makes sense! We are basically joining the two tables together and filtering for users that join Facebook between 2018 and 2020 and then users that made a comment in January 2020. We are using an inner join because we are trying to preserve the number of comments or all of the comments that users make. And through making the entire thing a subquery, we can count the number of comments made by users and that's what we have done. Then we have created a distribution of comments so that the output looks like we have the number of comments and then the number of users that made that number of comments. That means that we have another subquery. So, the whole thing is a subquery to the outer select where we have the comment count and then we are counting the number of users that made the corresponding number of comments.