# Uber Data Scientist Interview Questions

##### Categories

*Recent Uber interview questions for Data Scientist position*

On a cold winter evening in Paris, Travis Kalanick and Garrett Camp could not get a ride. That is when the idea for Uber was born. Uber has now grown to become a USD 89 Billion publicly listed company operating across 900 different locations in over 60 different countries. Each day Uber completes over 10 million trips and has till date completed over 10 billion. Uber is the poster child for the gig economy and has been consistently ranked as one of the best places to work.

In this article, we discuss real-life Uber Data Scientist interview questions with detailed solutions. We look at the process behind solving these questions, and some tips on how to solve similar questions. Find more real-life data scientist questions from Uber and other top tech companies here "*data scientist interview questions*".

## Data Science Role at Uber

Data Science & Analytics teams at Uber handle multiple functions as one would expect from a large multinational corporation. There are dedicated teams that cover various aspects of the business –

- Safety and Insurance
- Rides
- Risk
- Eats
- Policy
- Platform
- Marketing

There are different roles in these functions ranging from Product Analyst, Data Analyst, Data Scientist and Leadership.

## Technical Concepts Tested in Uber Data Scientist Interview Questions

- Aggregation / Sorting
- Subqueries
- Lag and Lead
- Window Functions
- CTE

## Uber Data Scientist Interview Question

Maximum Number of Employees Reached

*"Write a query that returns every employee that has ever worked for the company. For each employee, calculate the greatest number of employees that worked for the company during their tenure and the first date that number was reached.Your output should have the employee ID, the greatest number of employees that worked for the company during the employee's tenure, and the first date that number was reached."*

You can solve this question here: https://platform.stratascratch.com/coding-question?id=2046&python=

**Table Schema**

first_name | varchar |

last_name | varchar |

id | int |

hire_date | datetime |

termination_date | datetime |

salary | int |

**Assumptions**

This step not only gives you thinking time, but also showcases your ability to visualize data, understand the business and plot a way to the solution. This also saves a lot of time later should you make an error and end up on the wrong path. Since you typically will not have access to the underlying data in the interview, you will have to ensure that your solution boundaries are reasonably well defined.

*What are the assumptions on the data and the table?*

*first_name*, *last_name* and *id*: These three seem to be employee name and ID fields. While you can expect id to be the unique identifier, you can confirm that an employee who left the organization and joined it again had a different id. This will tell you if you need to deduplicate the data or aggregate it.

*salary* field can be ignored as that is not needed in the query at all.

You confirm your assumption that if the employee is still with the organization, then the termination date is null or if any special dates are posted on the system like 31 Dec 2199.

This is the table that we have to work with –

**Logic**

Let us break this problem down.

1. To find out the number of employees anyone worked with, it would be simple if we had a table where we had the number of employees each day starting from the inception of the company to present date. Something like this –

2. Then we can query the number of employees for each employee based on the hire_date and the termination_date and manipulate this sub table to get the maximum number of employees and the date when this maximum number was reached.

The table however represents snapshot data – the situation as on a particular date. If tomorrow Joe Jarod (id 2) leaves the company, the termination_date will not be updated. So, we need to transform this table.

*If we combine the hire_date and termination_date fields and a counter 1 for hiring and -1 for termination, we can convert this into a daily transaction type data. Once we have that, we can query in the aforementioned manner, and get the output.*

At this point, you might want to check with the interviewer if they agree with your approach. They might provide you with some edge case that you might want to incorporate to get the final output. Once you have agreed on the logic, let's begin coding.

**Coding - Solution**

We can do this in a step-by-step manner.

**1**. Tag each hiring and termination date with a count (1 in case of hire and -1 in case of termination)

```
select hire_date as dt, 1 as num_emp from uber_employees
UNION ALL
select termination_date as dt, -1 as num_emp from uber_employees
where termination_date IS NOT NULL
```

**2**. Summarize for each date. The value now represents the change in the number of employees for that day. Remember to exclude the termination dates that are NULL

```
select sum(num_emp) as num_emp, dt from
(
select hire_date as dt, 1 as num_emp from uber_employees
UNION ALL
select termination_date as dt, -1 as num_emp from uber_employees
where termination_date IS NOT NULL
) d
group by dt
order by dt
```

**3**. Use a window function to get a cumulative sum.

```
select sum(num_emp) over (order by dt) as num_emp, dt from
(
select sum(num_emp) as num_emp, dt from
(
select hire_date as dt, 1 as num_emp from uber_employees
UNION ALL
select termination_date as dt, -1 as num_emp from uber_employees
where termination_date IS NOT NULL
) d
group by dt
)d2
```

**4**. Merge with the main table to get the value for each employee. Sort in descending order of employee count and ascending order of date. *Note: We use the COALESCE function to set the termination date to the current date in case terminate_date is empty.*

```
select id, num_emp, dt from uber_employees e
join
(
select sum(num_emp) over (order by dt) as num_emp, dt from
(
select sum(num_emp) as num_emp, dt from
(
select hire_date as dt, 1 as num_emp from
uber_employees
UNION ALL
select termination_date as dt, -1 as num_emp from
uber_employees
where termination_date IS NOT NULL
) d
group by dt
) d2
) d3
on d3.dt between e.hire_date and coalesce(e.termination_date,
current_date)
order by id, num_emp desc, dt
```

**5**. Deduplicate to get the maximum employees and the first date.

```
select distinct on (id) id, num_emp, dt from uber_employees e
join
(
select sum(num_emp) over (order by dt) as num_emp, dt from
(
select sum(num_emp) as num_emp, dt from
(
select hire_date as dt, 1 as num_emp from uber_employees
UNION ALL
select termination_date as dt, -1 as num_emp from
uber_employees
where termination_date IS NOT NULL
) d
group by dt
)d2
)d3
on d3.dt between e.hire_date and coalesce(e.termination_date, current_date)
order by id, num_emp desc, dt desc
```

And there we have our desired output.

**Optimization**

An extremely popular follow-up data science question is -- Can you optimize the code?

What should you look out for? The main thing is condensing CTEs and subqueries or removing JOINs and using a CASE statement instead. Those are the low hanging fruits.

So, for our solution:

- Let us take a look at all the queries to see if we can condense/collapse some.
- Can we condense the identification of users that were first viewers? You can’t because you need to find their 1st streams regardless of if they were streamers or viewers. Then you need to create another query to work on top of your view. So you need to have these subqueries within subqueries.
- What are other ways to write this? You can use a CTE or a temp table.
- You then might get asked, which one performs better?
- https://dba.stackexchange.com/questions/13112/whats-the-difference-between-a-cte-and-a-temp-table/13117#13117
- Temp tables if you are going to use this for other queries.
- Temp tables can be indexed so this itself can make the query faster.

### Additional Uber Data Scientist Interview Questions

Days Without Hiring/Termination

*"Write a query to calculate the longest period (in days) that the company has gone without hiring anyone. Also, calculate the longest period without firing anyone."*

You can solve this Uber data scientist interview question here -

https://platform.stratascratch.com/coding-question?id=2045&python=

**Dataset**

*Uber_Employees*

This dataset is identical to the one we discussed above. This is one of the Uber Data Scientist interview questions that test your understanding of Lag and Lead functions. You can solve this interview question using CTE as well, although it is not needed here as a sub-query will suffice.

**Approach:**

- Get the difference in days between two consecutive hiring dates using the lag / lead functions.
- Get the maximum of this difference.
- Repeat the process with termination dates as well.

Employees' Without Annual Review

*"Return all employees who have never had an annual review. Your output should include the employee's first name, last name, hiring date, and termination date. List the most recently hired employees first."*

You can solve this Uber data scientist interview question here -

https://platform.stratascratch.com/coding-question?id=2043&python=

**Dataset**

Besides the Uber_Employees from above, this problem has an additional dataset - uber_annual_review

This additional dataset contains the date of review by employee id. *Note: The emp_id field needs to be used to join with the uber_employees table, not the id field.*

This question from a real-life Uber Data Scientist interview is relatively simple. It tests your SQL data manipulation skills using joins, sorting and subsetting using the select statement.

**Approach**

- Remove all employees with an annual review using a subquery and join.
- You can then output the required fields sorted by hire_date

Percentage Of Revenue Loss

*"For each service, calculate the percentage of incomplete orders along with the revenue loss. Your output should include the name of the service, percentage of incomplete orders, and total revenue loss from the incomplete orders."*

You can solve this problem here -

https://platform.stratascratch.com/coding-question?id=2048&python=

**Dataset**

*uber_orders*

Besides the order_date field, all the other fields will be needed in this problem.

While the dataset is straightforward, the problem is a little lengthy because of the metrics that need to be calculated. This is one of the medium difficulty level Uber Data Science interview questions that test your skills with SQL Conditional calculations. This question can be solved using multiple approaches - using a CTE, FILTER or CASE WHEN statements.

**Approach**

- To solve this we first need to calculate the Value per order for each service_name for Completed orders.
- Using this average, we need to assign these averages to the Cancelled orders and estimate the loss.
- Finally calculated the Revenue loss basis of the value of Cancelled orders as a % of the total possible value (Completed + Cancelled)

Total Monetary Value Per Month/Service

*"Find the total monetary value for completed orders by service type for every month. Output your result as a pivot table where there is a column for month and columns for each service type."*

You can solve this Uber data scientist interview question here -

https://platform.stratascratch.com/coding-question?id=2047&python=

**Dataset**

This problem uses the same uber_orders dataset used in the previous problems.

This problem is arguably easier than the previous problem with the same dataset. This real-life Uber Data Scientist interview question can be solved with CASE WHEN or FILTER functions in SQL.

**Approach**

- We start with Completed orders
- Then we get the month and year from the order_date field.
- Finally using conditionals CASE WHEN or FILTER functions we can get the Total Monetary Value for each service_name and aggregate them on a monthly basis.

Total Order Per Status Per Service

Uber is interested in identifying gaps in its business. Calculate the count of orders for each status of each service. Your output should include the service name, status of the order, and the number of orders.

You can solve this Uber data scientist interview question here -

https://platform.stratascratch.com/coding-question?id=2049&python=

**Dataset**

This problem too uses the uber_orders dataset used in the previous two problems

This problem is perhaps the easiest of the lot and a nice way to round off the practice problems. This real-life Uber Data Scientist Interview question can be solved with a simple SUM + GROUP BY combination.

**Approach**

Simply Sum the number_of_orders grouped by service_name and status_of_order

*Are you wondering what other SQL interview questions are out there that you might be asked in any data science interview? Check out our ultimate guide to SQL interview questions.*

### Conclusion

In this article, we have discussed in detail the approach to solving the real-life Uber Data Scientist interview questions. You can use a similar step-by-step approach to solving more complex problems.

Problems like this separate those with a passing knowledge of SQL from those who can understand the intricacies of data. The key to solving this problem lay in creating the transaction type sub-table. To be able to reach the level of comfort with data, you need to practice. Join our platform to practice more such interview questions from companies like Facebook, Amazon, Microsoft, Netflix and more. We have a community of over 20,000 aspiring data scientists seeking to improve their coding skills, prepare for interviews, and jump start their careers.