Data Science Coding Interview Questions with 5 Technical Concepts

Data Science Coding Interview Questions


Understanding 5 coding concepts of data science coding interview questions that companies will test you on and preparing for them effectively

A data science interview involves multiple rounds and one of such rounds involves coding interview questions. The purpose of these data science coding interview questions is to check if a candidate can program and knows the required coding languages such as SQL and Python. We’ll cover the coding questions you may receive during your data science technical interview round.

In this article, we'll cover the 5 coding concepts that companies will test you on in your data science interviews. As you guys know, we’re always tracking and collecting data science coding interview questions from various companies to better understand what technical concepts you need to know to succeed in interviews and what data science skills you need to have to succeed on the job. So these are coding questions that I’ve collected and are as real-time as can be.

The companies that we’re sourcing these data science coding interview questions from today are Facebook, Postmates, Credit Karma, Ring Central, and Twitch. We’ll introduce the coding concept first to understand and then show you a real interview question we collected from the company.

Technical Concepts of Data Science Coding Interview Questions

This guide of technical concepts contains almost all types of data science coding interview questions you should expect when interviewing for a data scientist position.

From this list of data science coding interview questions, an interviewee should be able to prepare for:

  • The complex and tough questions
  • What answers will positively resonate with an employer
  • The different concepts that companies will test you on
  • Developing the confidence to ace the interview

We’ve broken these data science coding interview questions into five different concepts:

  1. Concept #1: Categorizations, aggregations, ratios
  2. Concept #2: LEFT JOINs and Subqueries and CTEs
  3. Concept #3: Subqueries in the WHERE clause
  4. Concept #4: Window functions
  5. Concept #5: Date manipulations

1. Categorizations, Aggregations, Ratios

Let’s start simple -- categorizations, aggregations, ratios. Aren’t these 3 different concepts? Yes, but they’re almost always tested together in one coding question. And it’s probably the most commonly tested concept on data science coding interviews, especially in the beginning rounds.

By categorizations, we mean CASE statements in SQL. We’ll stick with SQL technical concepts because most people use SQL for data science coding interviews.

Aggregations like taking the sum or average are always tested and usually are applied after the categorization takes place.

But they can also take the form of a ratio.

Here is an example coding interview question by Facebook


Table: fb_active_users

Link to the question: https://platform.stratascratch.com/coding/2005-share-of-active-users

Here’s the solution to this coding interview question:

SELECT active_users /total_users::float AS active_users_share
FROM
  (SELECT count(user_id) total_users,
          count(CASE
                    WHEN status = 'open' THEN 1
                    ELSE NULL
                END) AS active_users
   FROM fb_active_users
   WHERE country = 'USA') subq

If you read the above solution and it makes sense to you then congrats, you’re on the right track.

To solve this question, you need to understand how to properly write a case statement, properly aggregate the results of the case statement, and then take a ratio from the result of the aggregation itself. All in one query.

The tricky part isn’t even those 3 technical concepts. It’s organizing your code into logical steps -- as you can see there’s a subquery to split out the logic.

And there’s also a data type conversion from an integer to a float so that your output can be between 0 and 1.

These are the little nuances you need to know in addition to these 3 concepts. Lots of people forget these small things, especially if you’re out of practice or nervous.

2. LEFT JOINs and Subqueries and CTEs

The second concept of data science coding interview questions includes JOINs, subqueries, and CTEs. Again 3 different concepts? But they’re always tested together to solve one problem.

Let's have an example. This question is from Credit Karma Interview on Submissions.


Tables: loans, submissions

Link to the question: https://platform.stratascratch.com/coding/2003-recent-refinance-submissions

And here’s the SQL code solution for this coding interview question:

SELECT l.user_id, balance
FROM
  (SELECT DISTINCT id, user_id, created_at,
                   max(created_at) OVER (PARTITION BY user_id,
                                                      TYPE) most_recent
   FROM loans
   WHERE TYPE = 'Refinance') l
INNER JOIN submissions s ON l.id = s.loan_id
WHERE most_recent = created_at

It might look pretty complicated. It even includes other advanced concepts we’re not even covering. But you can see there are a JOIN statement and a subquery (which could be re-written as a CTE, if you wanted to do that).

What’s being tested is if you can blend two sources of data together and aggregate the data from one table, “balance” with a variable from another table “user_id”. Remember, a successful interviewee will talk through the entire logic and why they’re doing what they’re doing.

3. Subqueries in the WHERE Clause

While we’re talking about subqueries, they can be found in multiple places – not just at the JOIN statement like in the last question. Subqueries in the WHERE clause are also common concepts tested and actually, it’s a question that most people get wrong.

Here is a question from Ring Central Interview on Subqueries


Tables: rc_calls, rc_users

Link to the question: https://platform.stratascratch.com/coding/2017-paid-users-in-april-2020

And here’s the solution:

SELECT COUNT(DISTINCT user_id)
FROM rc_calls
WHERE date BETWEEN '2020-04-01' AND '2020-04-30'
  AND user_id IN
    (SELECT user_id
     FROM rc_users
     WHERE status = 'paid');

Subqueries in the WHERE clause allow you to quickly grab information from one table and use it as a filtering condition in the main query like we are seeing here. So in this example, we’re grabbing the user_id of paid users from the user table and using that as a condition to find how many calls they placed in the calls table.

4. Window Functions

The 4th coding concept tested on data science coding interviews so far in 2021 is window functions. Window functions are heavily utilized in data science work. We’ve talked about them extensively in the past articles - SQL Window Functions. One example of this is a question from Twitch interview using Window functions.


Table: twitch_sessions

Link to the question: https://platform.stratascratch.com/coding/2012-viewers-turned-streamers

Here’s the solution to this coding question:

SELECT user_id, count(*) n_sessions
FROM twitch_sessions
WHERE session_type = 'streamer'
  AND user_id in
    (SELECT user_id
     FROM
       (SELECT user_id,
               session_type,
               rank() OVER (PARTITION BY user_id
                            ORDER BY session_start) streams_order
        FROM twitch_sessions) s1
     WHERE streams_order =1
       AND session_type = 'viewer')
GROUP BY user_id
ORDER BY n_sessions DESC,
         user_id ASC

This one’s insane! And yes it did get asked in a data science interview. But don’t worry, in the upcoming weeks, StrataScratch will be stepping through the solution for each of the data science interview questions we just covered.

These questions take a lot of time to solve and they’re typically found in the later rounds of your data science interviews because there is so much code and it just takes time to talk through it.

5. Date Manipulations

The last technical concept of data science coding interview questions is date manipulations. This one is always asked on data science coding interviews because everyone in analytics from an analyst to a data scientist works with dates. It would be weird not to. Because you’re usually given a complete day, year-month-day-timestamp, but asked to aggregate by weeks, months, or years, you’ll need to know how to manipulate dates.

Here is one example from Postmates Interview with Date manipulations


Table: postmates_orders

Link to the question: https://platform.stratascratch.com/coding/2014-hour-with-the-highest-order-volume

And here’s the solution to this question:

WITH SUMMARY AS
  (SELECT HOUR,
          avg(n_orders) avg_orders
   FROM
     (SELECT date_part('hour', order_timestamp_utc) AS HOUR,
             order_timestamp_utc::date,
             count(id) n_orders
      FROM postmates_orders
      GROUP BY date_part('hour', order_timestamp_utc),
               order_timestamp_utc::date) sq
   GROUP BY HOUR)
SELECT *
FROM SUMMARY
WHERE avg_orders =
    (SELECT max(avg_orders)
     FROM SUMMARY)
ORDER BY avg_orders DESC

This one requires you to extract the hour from the DateTime in order to find the highest order volume. So it’s just a matter of knowing these DateTime functions, like date_part().

Conclusion

So these are the 5 technical concepts that data science companies will most likely to test you on in your interviews in 2021. Did you think these data science coding interview questions were difficult to answer? If so, we, at StrataScratch, will step through all the solutions in the upcoming weeks.

Remember, these are coding interview questions that were asked in the first few weeks/months of 2021, so they’re definitely requiring you to master these 5 technical concepts.

Data Science Coding Interview Questions


Become a data expert. Subscribe to our newsletter.