LinkedIn Data Scientist Interview Questions

Data Scientist Position at LinkedIn

LinkedIn Data Scientist Positions typically work for the Data Science team which belongs to the larger Data engineering division. As a member of the Data Science team, you will leverage data from over 700 million members to provide insights to product, sales, marketing, economics, infrastructure, and other teams. These data insights will drive strategy, apply statistical inference and algorithms for optimization, and build engineered solutions to scale LinkedIn’s ongoing business objectives and long-term company vision.

Data Scientist Positions at LinkedIn require skills with SQL, structured and unstructured data, distributed data systems, programming languages such as R or Python, data workflows, and more. For additional information on the Data Science team at LinkedIn, here's their official webpage.

Concepts Tested in LinkedIn Data Scientist Interview Questions

The primary SQL skills you’d show in a LinkedIn Data Scientist Interview challenge include:

Use of basic arithmetic operations on columns of data
Using subqueries
Use of JOIN to combine tables of data
How to use the CEILING function
How to cast integer data into floating point data
Using WHERE to filter out results

The problem we’re looking at next requires knowledge of several of these concepts, and you’ll be able to apply them in your own code on the StrataScratch platform to better understand them and become interview-ready.

LinkedIn Data Scientist Interview Question

Risky Projects

This question was heard in an interview for a data scientist position at LinkedIn. It’s titled “Risky Projects”, and the key challenge is finding which LinkedIn projects are over budget using data from disparate sets.

Risky Projects

Interview Question Date: November 2020

MediumID 10304

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Identify projects that are at risk for going overbudget. A project is considered to be overbudget if the cost of all employees assigned to the project is greater than the budget of the project.

You'll need to prorate the cost of the employees to the duration of the project. For example, if the budget for a project that takes half a year to complete is $10K, then the total half-year salary of all employees assigned to the project should not exceed $10K. Salary is defined on a yearly basis, so be careful how to calculate salaries for the projects that last less or more than one year.

Output a list of projects that are overbudget with their project name, project budget, and prorated total employee expense (rounded to the next dollar amount).

HINT: to make it simpler, consider that all years have 365 days. You don't need to think about the leap years.

Tables: linkedin_projects, linkedin_emp_projects, linkedin_employees

Link to the question: https://platform.stratascratch.com/coding/10304-risky-projects

Ultimately, the question is asking us to use employee salaries, project length, and project budget to determine which projects exceed expectations.

This LinkedIn data scientist interview question is really more along the lines of medium-difficult than medium since, as we will see, the answer requires subqueries, several column calculations, and joining multiple data tables. While there are a number of ways to solve this question, we’re going to focus on a flexible solution that shows our knowledge of several important SQL concepts.

Framework to Solve the Problem

One of the most important principles when answering data science problems is to start with a framework. We’ll outline a three-step framework which builds a logical process for us to approach and solve this problem through code. We’re adapting this framework to the question at hand, but, with a few simple adjustments, you can frame any other question. Here are our steps:

1. Understand your data:

First look at all the columns the question gives you and make some assumptions about the data. If you’re given multiple tables, take note of which data each of the tables are giving you and how you might need it for your answer.
If you fail to understand the schema, see if you can examine some example data. Look at the first few rows and attempt to match why values correlate to their respective columns. Don’t hesitate to ask the LinkedIn interviewer for example values if you don’t receive any initially. Seeing sample data can help you establish limits to your solution or determine if you need to broaden it for edge cases.

2. Formulate your approach:

Begin writing down logical programming or code steps. It doesn’t need to be in order first. In fact, the solution we provide here won’t be in order either as we query the project dates before we grab salary in this problem.
It’s helpful to identify the main functions you have to use to perform your calculations. Attempt to translate the problem statement and data into a series of SQL functions.
Try to not be silent when figuring out the solution. Your LinkedIn interviewer will want to understand how you approach problems. You can ask them for clarification and can expect them to specify if you have to write everything from scratch.

3. Code Execution:

It’s important to build code in such a way as to avoid presenting an oversimplified or overcomplicated solution. Take advantage of blank space between subqueries to keep your solution neat and organized. There’s never any harm in commenting out a code block you want to return to later.
Follow the steps you outlined in the beginning. This will help you avoid confusion and ensure you’re answering all of the questions.
Most importantly, think through your functions and operations. Using these well can produce efficient and generic solutions.
Speak through your code with the LinkedIn interviewer. They’re going to be evaluating your problem-solving abilities as you begin writing.

Understand Your Data

The first step in our framework is to examine our data. Typically, you won’t have access to actual data nor will you be able to execute code during interviews like those at LinkedIn. Usually, you’ll have to understand the data and instead make assumptions based on the schema and information the interviewer gives you.

In the case of this LinkedIn data scientist interview question, the question provides us with the schema for three tables, so we need to start by looking at what’s in each table.

Table: linkedin_projects

id	title	budget	start_date	end_date
1	Project1	29498	2018-08-31	2019-03-13
2	Project2	32487	2018-01-27	2018-12-13
3	Project3	43909	2019-11-05	2019-12-09
4	Project4	15776	2018-06-28	2018-11-20
5	Project5	36268	2019-03-13	2020-01-02

This linkedin_projects table tells us how long a project has been running, its budget, and its id.

Table: linkedin_emp_projects

emp_id	project_id
10592	1
10593	2
10594	3
10595	4
10596	5

The linkedin_emp_projects table maps out which employees work on which projects.

Table: linkedin_employees

id	first_name	last_name	salary
10592	Jennifer	Roberts	20204
10593	Haley	Ho	33154
10594	Eric	Mccarthy	32360
10595	Gina	Martinez	46388
10596	Jason	Fields	12348

Linkedin_employees tells us the salary of each employee and their id. Since we need to know the salary, employee and project ids, project length, and project budget to calculate which projects are over budget, we’ll have to join data from all three tables using the ids. We can also at this point conclude we’ll be able to throw out some irrelevant columns such as project titles and employee first and last names.

Solution:

Formulate Approach

According to our framework, we’ll want to outline some general steps to translate into code. We’re going to keep it high-level and get more specific later on, but outlining these steps initially will make writing the code easier. Here are the general steps:

Query linkedin_projects and use subtraction and floating point type casting to get total project durations in years
Put our first query to the side while we JOIN our linkedin_emp_projects and linkedin_employees tables on the employee id to get a yearly salary for each project id
JOIN our initial query to the project yearly salary on the project id
Calculate the prorated employee expense by multiplying project yearly salary by project duration and round out our output using the CEILING function
Use a WHERE clause to filter for projects over budget

Calculate Project Durations

Let’s translate the general steps we laid out into functional SQL code. For the first step, we start by querying the linkedin_projects table and subtracting the project end_date from the project start_date. We don’t have to worry about odd differences here since end_date will always be after start_date. Since our salary data is in years, we must also convert our project length into a fraction of years and will need to cast it as a floating point decimal to avoid results rounding down to 0.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       end_date - start_date AS project_duration
FROM linkedin_projects

All required columns and the first 5 rows of the solution are shown

id	title	budget	start_date	end_date	project_duration
1	Project1	29498	2018-08-31	2019-03-13	194
2	Project2	32487	2018-01-27	2018-12-13	320
3	Project3	43909	2019-11-05	2019-12-09	34
4	Project4	15776	2018-06-28	2018-11-20	145
5	Project5	36268	2019-03-13	2020-01-02	295

We have project duration in days, so we’ll want to divide it by 365 to get it in years to compare to the yearly salaries:

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365 AS project_duration
FROM linkedin_projects

All required columns and the first 5 rows of the solution are shown

id	title	budget	start_date	end_date
1	Project1	29498	2018-08-31	2019-03-13
2	Project2	32487	2018-01-27	2018-12-13
3	Project3	43909	2019-11-05	2019-12-09
4	Project4	15776	2018-06-28	2018-11-20
5	Project5	36268	2019-03-13	2020-01-02

Because we divided integer by integer, our calculation rounds down to 0 since all projects are less than year. Let’s cast the 365 denominator as a float to get the decimal value of project duration in years.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365::float AS project_duration
FROM linkedin_projects

All required columns and the first 5 rows of the solution are shown

id	title	budget	start_date	end_date	project_duration
1	Project1	29498	2018-08-31	2019-03-13	0.532
2	Project2	32487	2018-01-27	2018-12-13	0.877
3	Project3	43909	2019-11-05	2019-12-09	0.093
4	Project4	15776	2018-06-28	2018-11-20	0.397
5	Project5	36268	2019-03-13	2020-01-02	0.808

Now we have a project duration as a decimal portion of a year. We can later use this value and the collective yearly salaries of all employees on the project to get the total project expenditure.

Calculate the Yearly Salary for Employees per Project

Feel free to comment out your existing code while we build our next query. Before we go any further, here’s an opportunity to interactively test out your project duration calculation from the previous section and the yearly salary per project query for this section:

SELECT *
FROM linkedin_emp_projects ep
JOIN linkedin_employees e ON ep.emp_id = e.id

Since we need to know the yearly salary for all employees on a project, our next step includes combining the linked_emp_projects showing which employees work on what projects with employee salary data from linkedin_employees to get salary per project. We see both tables have a column for employee ids, so it’s fairly obvious to JOIN them on these ids.

Keep in mind we need to sum the salaries per project, so we’ll also GROUP BY project id.

SELECT project_id,
       SUM(salary) AS project_yearly_salary
FROM linkedin_emp_projects ep
JOIN linkedin_employees e ON ep.emp_id = e.id
GROUP BY project_id

All required columns and the first 5 rows of the solution are shown

project_id	project_yearly_salary
42	59710
29	61732
4	77167
34	51416
41	35229

We now have a yearly salary by project, and it’s pretty clear our next step is to combine the project yearly salary dataset with the linkedin_projects dataset.

Joining All The Datasets

We’re going to uncomment our original code and JOIN the datasets on the project_id. Note we encase and give a name to our project yearly salary dataset in order to complete the join.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id

All required columns and the first 5 rows of the solution are shown

id	title	budget	start_date	end_date	project_duration	project_yearly_salary
1	Project1	29498	2018-08-31	2019-03-13	0.532	68283
2	Project2	32487	2018-01-27	2018-12-13	0.877	60304
3	Project3	43909	2019-11-05	2019-12-09	0.093	78363
4	Project4	15776	2018-06-28	2018-11-20	0.397	77167
5	Project5	36268	2019-03-13	2020-01-02	0.808	29748

This completes the joining of all our datasets. On one table, we have all the information we need for any given project to calculate if it’s over budget or not. From this point, we can discard some of the columns such as the id and dates to simplify our table.

SELECT title,
       budget,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary

Calculate Prorated Employee Expense

Before we wrap up the last two steps of this LinkedIn data scientist interview question, here’s another opportunity to practice this question interactively. See if you can finish the expense per project calculation and determine which projects are over budget:

SELECT id,
       budget,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id

Prorated employee expense is a fairly simple calculation only requiring us to multiply our project duration by our project yearly salary.

SELECT title,
       budget,
       (end_date - start_date)/365::float * project_yearly_salary AS prorated_employee_expense
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary

Our output isn’t rounded, so we apply a CEILING function to get a prorated_employee_expense in a format where we can better compare it to the budget. Remember a CEILING function will return the smallest integer value which is larger than or equal to the input, so it’s always rounding up any floating point not equal to a whole number.

SELECT title,
       budget,
       CEILING((end_date - start_date)/365::float* project_yearly_salary) AS prorated_employee_expense
FROM linkedin_projects

Our expenses are rounded up and look more comparable to the budget numbers. To finalize our answer, all we have to do is compare the budget to the prorated employee expense.

Filter for Projects Which Are Over Budget

We’re essentially comparing two integers at this point representing the project’s budget and the project’s prorated employee expense. Since the question asks us to only present projects which are over budget, we can use a WHERE clause to filter out any projects below budget. Keep in mind, we’ll have to repeat our prorated employee expense calculation in the WHERE clause due to SQL syntax limitations. Here’s what our complete solution looks like:

SELECT title,
       budget,
       CEILING((end_date - start_date)/365::float * project_yearly_salary) AS prorated_employee_expense
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id
WHERE ((end_date - start_date)/365::float*project_yearly_salary) > budget

All required columns and the first 5 rows of the solution are shown

title	budget	prorated_employee_expense
Project1	29498	36293
Project2	32487	52870
Project4	15776	30656
Project6	41611	63230
Project9	32341	44691

This new output shows we are missing several projects which correlate to ones which are under budget. While the result contains more than these five projects, you’ll see all the results have a prorated_employee_expense value higher than the project’s budget.

We have now solved for the correct answer. The solution ended up being a bit complex with multiple joins, arithmetic operations, and type manipulations, but it’s still flexible enough to handle various datasets which follow this schema. Additionally, it displays a variety of different SQL skills your interviewer may appreciate.

Conclusion

In this article, we tackled one of the medium-difficult LinkedIn data scientist interview questions with complex but robust code. This isn’t the only way to answer this question, so we recommend attempting to see if you can come up with other ways which are either more or less efficient.

On the StrataScratch platform, you can try answering other SQL interview questions to improve your skills and prepare for data science interviews. No matter how simple or complicated your solution is, make sure to post your ideas and answers to see what other users think. If you’re stuck on any problem or want a fresh perspective, you can also take a look at their solutions for some guidance!

LinkedIn Data Scientist Interview Questions

Categories

Data Scientist Position at LinkedIn

Concepts Tested in LinkedIn Data Scientist Interview Questions

LinkedIn Data Scientist Interview Question

Risky Projects

Framework to Solve the Problem

Understand Your Data

Solution:

Conclusion

Latest Posts:

From Arrays to Lists: Converting Numpy Arrays with Python

Tips for Crafting an Impressive Data Analyst Portfolio

PySpark GroupBy Guide: Super Simple Way to Group Data

Share

Follow

Categories