LinkedIn Data Scientist Interview Questions

Published: October 30, 2023

Categories:

Written by:
Nathan Rosidi

In this article, we’ll work through one of the common LinkedIn Data Scientist Interview questions requiring a wide range of SQL skills.

The LinkedIn data scientist interview question we’re solving has a variety of ways to reach a solution, but we’re going to simplify the problem-solving process by working through it in small steps. Keep reading to see how we leverage a framework you could apply to any data scientist interview questions and how we break down this medium-level LinkedIn data scientist interview question into more manageable chunks.

LinkedIn started in 2003 with the goal of connecting the world’s professionals and experts to enhance their career success. With over 750 million members worldwide, LinkedIn has more professionals than any other social network ranging from those who just started their career to executives from all the Fortune 500 companies. Operationally, LinkedIn is a global company with a wide variety of revenue models involving marketing, talent, sales, and membership solutions.

Data Scientist Position at LinkedIn

LinkedIn Data Scientist Positions typically work for the Data Science team which belongs to the larger Data engineering division. As a member of the Data Science team, you will leverage data from over 700 million members to provide insights to product, sales, marketing, economics, infrastructure, and other teams. These data insights will drive strategy, apply statistical inference and algorithms for optimization, and build engineered solutions to scale LinkedIn’s ongoing business objectives and long-term company vision.

Data Scientist Positions at LinkedIn require skills with SQL, structured and unstructured data, distributed data systems, programming languages such as R or Python, data workflows, and more. For additional information on the Data Science team at LinkedIn, here's their official webpage.

Concepts Tested in LinkedIn Data Scientist Interview Questions

The primary SQL skills you’d show in a LinkedIn Data Scientist Interview challenge include:

Use of basic arithmetic operations on columns of data
Using subqueries
Use of JOIN to combine tables of data
How to use the CEILING function
How to cast integer data into floating point data
Using WHERE to filter out results

The problem we’re looking at next requires knowledge of several of these concepts, and you’ll be able to apply them in your own code on the StrataScratch platform to better understand them and become interview-ready.

LinkedIn Data Scientist Interview Question

Risky Projects

This question was heard in an interview for a data scientist position at LinkedIn. It’s titled “Risky Projects”, and the key challenge is finding which LinkedIn projects are over budget using data from disparate sets.

Risky Projects

Last Updated: November 2020

MediumID 10304

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

You are given a set of projects and employee data. Each project has a name, a budget, and a specific duration, while each employee has an annual salary and may be assigned to one or more projects for particular periods. The task is to identify which projects are overbudget. A project is considered overbudget if the prorated cost of all employees assigned to it exceeds the project’s budget.

To solve this, you must prorate each employee's annual salary based on the exact period they work on a given project, relative to a full year. For example, if an employee works on a six-month project, only half of their annual salary should be attributed to that project. Sum these prorated salary amounts for all employees assigned to a project and compare the total with the project’s budget.

Your output should be a list of overbudget projects, where each entry includes the project’s name, its budget, and the total prorated employee expenses for that project. The total expenses should be rounded up to the nearest dollar. Assume all years have 365 days and disregard leap years.

Tables: linkedin_projects, linkedin_emp_projects, linkedin_employees

Link to the question: https://platform.stratascratch.com/coding/10304-risky-projects

Ultimately, the question is asking us to use employee salaries, project length, and project budget to determine which projects exceed expectations.

This LinkedIn data scientist interview question is really more along the lines of medium-difficult than medium since, as we will see, the answer requires subqueries, several column calculations, and joining multiple data tables. While there are a number of ways to solve this question, we’re going to focus on a flexible solution that shows our knowledge of several important SQL concepts.

Framework to Solve the Problem

One of the most important principles when answering data science problems is to start with a framework. We’ll outline a three-step framework which builds a logical process for us to approach and solve this problem through code. We’re adapting this framework to the question at hand, but, with a few simple adjustments, you can frame any other question. Here are our steps:

1. Understand your data:

First look at all the columns the question gives you and make some assumptions about the data. If you’re given multiple tables, take note of which data each of the tables are giving you and how you might need it for your answer.
If you fail to understand the schema, see if you can examine some example data. Look at the first few rows and attempt to match why values correlate to their respective columns. Don’t hesitate to ask the LinkedIn interviewer for example values if you don’t receive any initially. Seeing sample data can help you establish limits to your solution or determine if you need to broaden it for edge cases.

2. Formulate your approach:

Begin writing down logical programming or code steps. It doesn’t need to be in order first. In fact, the solution we provide here won’t be in order either as we query the project dates before we grab salary in this problem.
It’s helpful to identify the main functions you have to use to perform your calculations. Attempt to translate the problem statement and data into a series of SQL functions.
Try to not be silent when figuring out the solution. Your LinkedIn interviewer will want to understand how you approach problems. You can ask them for clarification and can expect them to specify if you have to write everything from scratch.

3. Code Execution:

It’s important to build code in such a way as to avoid presenting an oversimplified or overcomplicated solution. Take advantage of blank space between subqueries to keep your solution neat and organized. There’s never any harm in commenting out a code block you want to return to later.
Follow the steps you outlined in the beginning. This will help you avoid confusion and ensure you’re answering all of the questions.
Most importantly, think through your functions and operations. Using these well can produce efficient and generic solutions.
Speak through your code with the LinkedIn interviewer. They’re going to be evaluating your problem-solving abilities as you begin writing.

Understand Your Data

The first step in our framework is to examine our data. Typically, you won’t have access to actual data nor will you be able to execute code during interviews like those at LinkedIn. Usually, you’ll have to understand the data and instead make assumptions based on the schema and information the interviewer gives you.

In the case of this LinkedIn data scientist interview question, the question provides us with the schema for three tables, so we need to start by looking at what’s in each table.

Table: linkedin_projects

This linkedin_projects table tells us how long a project has been running, its budget, and its id.

Table: linkedin_emp_projects

The linkedin_emp_projects table maps out which employees work on which projects.

Table: linkedin_employees

Linkedin_employees tells us the salary of each employee and their id. Since we need to know the salary, employee and project ids, project length, and project budget to calculate which projects are over budget, we’ll have to join data from all three tables using the ids. We can also at this point conclude we’ll be able to throw out some irrelevant columns such as project titles and employee first and last names.

Solution:

Formulate Approach

According to our framework, we’ll want to outline some general steps to translate into code. We’re going to keep it high-level and get more specific later on, but outlining these steps initially will make writing the code easier. Here are the general steps:

Query linkedin_projects and use subtraction and floating point type casting to get total project durations in years
Put our first query to the side while we JOIN our linkedin_emp_projects and linkedin_employees tables on the employee id to get a yearly salary for each project id
JOIN our initial query to the project yearly salary on the project id
Calculate the prorated employee expense by multiplying project yearly salary by project duration and round out our output using the CEILING function
Use a WHERE clause to filter for projects over budget

Calculate Project Durations

Let’s translate the general steps we laid out into functional SQL code. For the first step, we start by querying the linkedin_projects table and subtracting the project end_date from the project start_date. We don’t have to worry about odd differences here since end_date will always be after start_date. Since our salary data is in years, we must also convert our project length into a fraction of years and will need to cast it as a floating point decimal to avoid results rounding down to 0.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       end_date - start_date AS project_duration
FROM linkedin_projects

We have project duration in days, so we’ll want to divide it by 365 to get it in years to compare to the yearly salaries:

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365 AS project_duration
FROM linkedin_projects

Because we divided integer by integer, our calculation rounds down to 0 since all projects are less than year. Let’s cast the 365 denominator as a float to get the decimal value of project duration in years.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365::float AS project_duration
FROM linkedin_projects

Now we have a project duration as a decimal portion of a year. We can later use this value and the collective yearly salaries of all employees on the project to get the total project expenditure.

Calculate the Yearly Salary for Employees per Project

Feel free to comment out your existing code while we build our next query. Before we go any further, here’s an opportunity to interactively test out your project duration calculation from the previous section and the yearly salary per project query for this section:

SELECT *
FROM linkedin_emp_projects ep
JOIN linkedin_employees e ON ep.emp_id = e.id

Since we need to know the yearly salary for all employees on a project, our next step includes combining the linked_emp_projects showing which employees work on what projects with employee salary data from linkedin_employees to get salary per project. We see both tables have a column for employee ids, so it’s fairly obvious to JOIN them on these ids.

Keep in mind we need to sum the salaries per project, so we’ll also GROUP BY project id.

SELECT project_id,
       SUM(salary) AS project_yearly_salary
FROM linkedin_emp_projects ep
JOIN linkedin_employees e ON ep.emp_id = e.id
GROUP BY project_id

We now have a yearly salary by project, and it’s pretty clear our next step is to combine the project yearly salary dataset with the linkedin_projects dataset.

Joining All The Datasets

We’re going to uncomment our original code and JOIN the datasets on the project_id. Note we encase and give a name to our project yearly salary dataset in order to complete the join.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id

This completes the joining of all our datasets. On one table, we have all the information we need for any given project to calculate if it’s over budget or not. From this point, we can discard some of the columns such as the id and dates to simplify our table.

SELECT title,
       budget,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary

Calculate Prorated Employee Expense

Before we wrap up the last two steps of this LinkedIn data scientist interview question, here’s another opportunity to practice this question interactively. See if you can finish the expense per project calculation and determine which projects are over budget:

SELECT id,
       budget,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id

Prorated employee expense is a fairly simple calculation only requiring us to multiply our project duration by our project yearly salary.

SELECT title,
       budget,
       (end_date - start_date)/365::float * project_yearly_salary AS prorated_employee_expense
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary

Our output isn’t rounded, so we apply a CEILING function to get a prorated_employee_expense in a format where we can better compare it to the budget. Remember a CEILING function will return the smallest integer value which is larger than or equal to the input, so it’s always rounding up any floating point not equal to a whole number.

SELECT title,
       budget,
       CEILING((end_date - start_date)/365::float* project_yearly_salary) AS prorated_employee_expense
FROM linkedin_projects

Our expenses are rounded up and look more comparable to the budget numbers. To finalize our answer, all we have to do is compare the budget to the prorated employee expense.

Filter for Projects Which Are Over Budget

We’re essentially comparing two integers at this point representing the project’s budget and the project’s prorated employee expense. Since the question asks us to only present projects which are over budget, we can use a WHERE clause to filter out any projects below budget. Keep in mind, we’ll have to repeat our prorated employee expense calculation in the WHERE clause due to SQL syntax limitations. Here’s what our complete solution looks like:

SELECT title,
       budget,
       CEILING((end_date - start_date)/365::float * project_yearly_salary) AS prorated_employee_expense
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id
WHERE ((end_date - start_date)/365::float*project_yearly_salary) > budget

This new output shows we are missing several projects which correlate to ones which are under budget. While the result contains more than these five projects, you’ll see all the results have a prorated_employee_expense value higher than the project’s budget.

We have now solved for the correct answer. The solution ended up being a bit complex with multiple joins, arithmetic operations, and type manipulations, but it’s still flexible enough to handle various datasets which follow this schema. Additionally, it displays a variety of different SQL skills your interviewer may appreciate.

Conclusion

In this article, we tackled one of the medium-difficult LinkedIn data scientist interview questions with complex but robust code. This isn’t the only way to answer this question, so we recommend attempting to see if you can come up with other ways which are either more or less efficient.

On the StrataScratch platform, you can try answering other SQL interview questions to improve your skills and prepare for data science interviews. No matter how simple or complicated your solution is, make sure to post your ideas and answers to see what other users think. If you’re stuck on any problem or want a fresh perspective, you can also take a look at their solutions for some guidance!