Uber Data Scientist Interview Questions

Uber Data Scientist Interview Questions


Solving recent Uber interview questions for the Data Scientist position, and some tips on how to solve similar questions.

On a cold winter evening in Paris, Travis Kalanick and Garrett Camp could not get a ride. That is when the idea for Uber was born. Uber has now grown to become a USD 89 Billion publicly listed company operating across 900 different locations in over 60 different countries. Each day Uber completes over 10 million trips and has till date completed over 10 billion. Uber is the poster child for the gig economy and has been consistently ranked as one of the best places to work.

In this article, we discuss real-life Uber Data Scientist interview questions with detailed solutions. We look at the process behind solving these questions, and some tips on how to solve similar questions. Find more real-life data scientist questions from Uber and other top tech companies here "data scientist interview questions".

Data Science Role at Uber

Data Science & Analytics teams at Uber handle multiple functions as one would expect from a large multinational corporation. There are dedicated teams that cover various aspects of the business –

  • Safety and Insurance
  • Rides
  • Risk
  • Eats
  • Policy
  • Platform
  • Marketing

There are different roles in these functions ranging from Product Analyst, Data Analyst, Data Scientist and Leadership.

Technical Concepts Tested in Uber Data Scientist Interview Questions

  • Aggregation / Sorting
  • Subqueries
  • Lag and Lead
  • Window Functions
  • CTE

Uber Data Scientist Interview Questions

Here you can practice real-life interview questions asked by Uber in their interviews for the data scientist position.

Uber Data Scientist Interview Question #1: Maximum Number of Employees Reached


Table: uber_employees

You can solve this question here: https://platform.stratascratch.com/coding-question?id=2046&python=

Dataset

Table: uber_employees
first_namelast_nameidhire_datetermination_datesalary
BobSmith12009-02-032016-01-0110000
JoeJarrod22009-02-0320000
NancySoley32009-02-0330000
KeithWidjaja42009-04-152014-01-0120000
KellySmalls52009-02-0320000

Assumptions

This step not only gives you thinking time, but also showcases your ability to visualize data, understand the business and plot a way to the solution. This also saves a lot of time later should you make an error and end up on the wrong path. Since you typically will not have access to the underlying data in the interview, you will have to ensure that your solution boundaries are reasonably well defined.

What are the assumptions on the data and the table?

first_name, last_name and id: These three seem to be employee name and ID fields. While you can expect id to be the unique identifier, you can confirm that an employee who left the organization and joined it again had a different id. This will tell you if you need to deduplicate the data or aggregate it.

salary field can be ignored as that is not needed in the query at all.

You confirm your assumption that if the employee is still with the organization, then the termination date is null or if any special dates are posted on the system like 31 Dec 2199.

Logic

Let us break this problem down.

1. To find out the number of employees anyone worked with, it would be simple if we had a table where we had the number of employees each day starting from the inception of the company to present date. Something like this –

Logic for Uber Data Scientist Interview Questions

2. Then we can query the number of employees for each employee based on the hire_date and the termination_date and manipulate this sub table to get the maximum number of employees and the date when this maximum number was reached.

The table however represents snapshot data – the situation as on a particular date. If tomorrow Joe Jarod (id 2) leaves the company, the termination_date will not be updated. So, we need to transform this table.

If we combine the hire_date and termination_date fields and a counter 1 for hiring and -1 for termination, we can convert this into a daily transaction type data. Once we have that, we can query in the aforementioned manner, and get the output.

At this point, you might want to check with the interviewer if they agree with your approach. They might provide you with some edge case that you might want to incorporate to get the final output. Once you have agreed on the logic, let's begin coding.

Coding - Solution

We can do this in a step-by-step manner.

Step 1- Tag each hiring and termination date with a count (1 in case of hire and -1 in case of termination):

SELECT hire_date AS dt, 1 AS num_emp FROM uber_employees
UNION ALL
SELECT termination_date AS dt, -1 AS num_emp FROM uber_employees WHERE termination_date IS NOT NULL

All required columns and the first 5 rows of the solution are shown

dtnum_emp
2009-02-031
2009-02-031
2009-02-031
2009-04-151
2009-02-031

Step 2- Calculate the cumulative sum of employees for each date using a window function:

SELECT dt, SUM(num_emp) OVER (ORDER BY dt) AS num_emp
FROM (
    -- Step 1 code here
) d

All required columns and the first 5 rows of the solution are shown

num_empdt
42009-02-03
12009-04-15
12009-06-20
12009-07-20
12009-08-13

This step calculates the cumulative sum of employee counts for each date, giving the total number of employees working under a given employee up to that date.

Step 3- Join the main table with the cumulative employee count table:

FROM uber_employees e
JOIN (
    -- Step 2 code here
) d2
ON d2.dt >= e.hire_date AND d2.dt < COALESCE(e.termination_date, current_date)

All required columns and the first 5 rows of the solution are shown

num_empdt
42009-02-03
52009-04-15
62009-06-20
72009-07-20
82009-08-13

This step joins the main table (uber_employees) with the cumulative employee count table based on the employee's hire date and termination date (or the current date if the termination date is NULL).

Step 4 - Filter the results based on the maximum number of employees working under each employee during their tenure:

WHERE d2.num_emp = (
    SELECT MAX(num_emp)
    FROM (
        -- Step 2 code repeated here
    ) d3
    WHERE d3.dt >= e.hire_date AND d3.dt < COALESCE(e.termination_date, current_date)
)

All required columns and the first 5 rows of the solution are shown

idnum_empdt
1572015-04-10
1572015-07-20
1572015-09-10
1562015-05-01
1562016-01-01

This step keeps only the rows where the employee count is equal to the maximum employee count under each employee during their tenure.

Step 5 - Deduplicate the results to get each employee's ID, maximum employees, and the first date when they had the maximum employees working under them:

SELECT DISTINCT ON (e.id) e.id, d2.num_emp, d2.dt
-- Steps 3 and 4 code here
ORDER BY e.id, d2.dt

This final step deduplicates the results based on the employee's ID and sorts the output by the employee's ID and the date, showing each employee's ID, the maximum number of employees they had working under them, and the first date when they had that maximum count.

Here is the full code.

FROM uber_employees e
JOIN (
    SELECT dt, SUM(num_emp) OVER (ORDER BY dt) AS num_emp
    FROM (
        SELECT hire_date AS dt, 1 AS num_emp FROM uber_employees
        UNION ALL
        SELECT termination_date AS dt, -1 AS num_emp FROM uber_employees WHERE termination_date IS NOT NULL
    ) d
) d2
ON d2.dt >= e.hire_date AND d2.dt < COALESCE(e.termination_date, current_date)
WHERE d2.num_emp = (
    SELECT MAX(num_emp)
    FROM (
        SELECT dt, SUM(num_emp) OVER (ORDER BY dt) AS num_emp
        FROM (
            SELECT hire_date AS dt, 1 AS num_emp FROM uber_employees
            UNION ALL
            SELECT termination_date AS dt, -1 AS num_emp FROM uber_employees WHERE termination_date IS NOT NULL
        ) d
    ) d3
    WHERE d3.dt >= e.hire_date AND d3.dt < COALESCE(e.termination_date, current_date)
)
ORDER BY e.id, d2.dt;

All required columns and the first 5 rows of the solution are shown

idmaxempdateaq
1572015-04-10
2582017-02-03
3582017-02-03
4482013-12-23
5582017-02-03

And there we have our desired output.

Optimization

An extremely popular follow-up data science question is -- Can you optimize the code?

What should you look out for? The main thing is condensing CTEs and subqueries or removing JOINs and using a CASE statement instead. Those are the low hanging fruits.

So, for our solution:

  1. Let us take a look at all the queries to see if we can condense/collapse some.
  2. Can we condense the identification of users that were first viewers? You can’t because you need to find their 1st streams regardless of if they were streamers or viewers. Then you need to create another query to work on top of your view. So you need to have these subqueries within subqueries.
    1. What are other ways to write this? You can use a CTE or a temp table.
    2. You then might get asked, which one performs better?
      1. https://dba.stackexchange.com/questions/13112/whats-the-difference-between-a-cte-and-a-temp-table/13117#13117
      2. Temp tables if you are going to use this for other queries.
      3. Temp tables can be indexed so this itself can make the query faster.

Uber Data Scientist Interview Question #2: ​Days Without Hiring/Termination


Table: uber_employees

You can solve this Uber data scientist interview question here https://platform.stratascratch.com/coding/2045-days-without-hiringtermination

Dataset

Table: uber_employees
first_namelast_nameidhire_datetermination_datesalary
BobSmith12009-02-032016-01-0110000
JoeJarrod22009-02-0320000
NancySoley32009-02-0330000
KeithWidjaja42009-04-152014-01-0120000
KellySmalls52009-02-0320000

This dataset is identical to the one we discussed above. This is one of the Uber Data Scientist interview questions that test your understanding of Lag and Lead functions. You can solve this interview question using CTE as well, although it is not needed here as a sub-query will suffice.

Approach:

  1. Get the difference in days between two consecutive hiring dates using the lag / lead functions.
  2. Get the maximum of this difference.
  3. Repeat the process with termination dates as well.

Uber Data Scientist Interview Question #3: ​Employees' Without Annual Review


Tables: uber_employees, uber_annual_review

You can solve this Uber data scientist interview question here - https://platform.stratascratch.com/coding/2043-employees-without-annual-review

Dataset

Table: uber_employees
first_namelast_nameidhire_datetermination_datesalary
BobSmith12009-02-032016-01-0110000
JoeJarrod22009-02-0320000
NancySoley32009-02-0330000
KeithWidjaja42009-04-152014-01-0120000
KellySmalls52009-02-0320000
Table: uber_annual_review
idemp_idreview_date
112013-01-20
212015-01-20
322013-02-03
422015-02-03
522017-02-03

This additional dataset contains the date of review by employee id. Note: The emp_id field needs to be used to join with the uber_employees table, not the id field.

This question from a real-life Uber Data Scientist interview is relatively simple. It tests your SQL data manipulation skills using joins, sorting and subsetting using the select statement.

Approach

  1. Remove all employees with an annual review using a subquery and join.
  2. You can then output the required fields sorted by hire_date

Uber Data Scientist Interview Question #4: ​Percentage Of Revenue Loss


Table: uber_orders

You can solve this problem here - https://platform.stratascratch.com/coding/2048-percentage-of-revenue-loss

Dataset

Table: uber_orders
order_datenumber_of_ordersstatus_of_ordermonetary_valueservice_name
2016-01-01180Cancelled76076000Uber_BOX
2016-01-0110Cancelled1960140Uber_CLEAN
2016-01-0128380Cancelled489898500Uber_FOOD
2016-01-010Cancelled728000Uber_GLAM
2016-01-0110Cancelled136500Uber_KILAT

Besides the order_date field, all the other fields will be needed in this problem.

While the dataset is straightforward, the problem is a little lengthy because of the metrics that need to be calculated. This is one of the medium difficulty level Uber Data Science interview questions that test your skills with SQL Conditional calculations. This question can be solved using multiple approaches - using a CTE, FILTER or CASE WHEN statements.

Approach

  1. To solve this we first need to calculate the Value per order for each service_name for Completed orders.
  2. Using this average, we need to assign these averages to the Cancelled orders and estimate the loss.
  3. Finally calculated the Revenue loss basis of the value of Cancelled orders as a % of the total possible value (Completed + Cancelled)

Uber Data Scientist Interview Question #5: ​Total Monetary Value Per Month/Service


Table: uber_orders

You can solve this Uber data scientist interview question here - https://platform.stratascratch.com/coding/2047-total-monatery-value-per-monthservice

Dataset

This problem uses the same uber_orders dataset used in the previous problems.

This problem is arguably easier than the previous problem with the same dataset. This real-life Uber Data Scientist interview question can be solved with CASE WHEN or FILTER functions in SQL.

Approach

  1. We start with Completed orders
  2. Then we get the month and year from the order_date field.
  3. Finally using conditionals CASE WHEN or FILTER functions we can get the Total Monetary Value for each service_name and aggregate them on a monthly basis.

Uber Data Scientist Interview Question #6: ​Total Order Per Status Per Service


Table: uber_orders

You can solve this Uber data scientist interview question here - https://platform.stratascratch.com/coding/2049-total-order-per-status-per-service

Dataset

This problem too uses the uber_orders dataset used in the previous two problems

This problem is perhaps the easiest of the lot and a nice way to round off the practice problems. This real-life Uber Data Scientist Interview question can be solved with a simple SUM + GROUP BY combination.

Approach

Simply Sum the number_of_orders grouped by service_name and status_of_order

Conclusion

In this article, we have discussed in detail the approach to solving the real-life Uber Data Scientist interview questions. You can use a similar step-by-step approach to solving more complex problems.

Problems like this separate those with a passing knowledge of SQL from those who can understand the intricacies of data. The key to solving this problem lay in creating the transaction type sub-table. If you're wondering what other SQL interview questions are out there you might be asked in Uber interviews, check out our post Uber SQL interview questions.

Uber Data Scientist Interview Questions


Become a data expert. Subscribe to our newsletter.