LinkedIn Data Scientist Interview Questions

LinkedIn Data Scientist Interview Questions


In this article, we’ll work through one of the common LinkedIn Data Scientist Interview questions requiring a wide range of SQL skills.

The LinkedIn data scientist interview question we’re solving has a variety of ways to reach a solution, but we’re going to simplify the problem-solving process by working through it in small steps. Keep reading to see how we leverage a framework you could apply to any data scientist interview questions and how we break down this medium-level LinkedIn data scientist interview question into more manageable chunks.

LinkedIn started in 2003 with the goal of connecting the world’s professionals and experts to enhance their career success. With over 750 million members worldwide, LinkedIn has more professionals than any other social network ranging from those who just started their career to executives from all the Fortune 500 companies. Operationally, LinkedIn is a global company with a wide variety of revenue models involving marketing, talent, sales, and membership solutions.

Data Scientist Position at LinkedIn

LinkedIn Data Scientist Positions typically work for the Data Science team which belongs to the larger Data engineering division. As a member of the Data Science team, you will leverage data from over 700 million members to provide insights to product, sales, marketing, economics, infrastructure, and other teams. These data insights will drive strategy, apply statistical inference and algorithms for optimization, and build engineered solutions to scale LinkedIn’s ongoing business objectives and long-term company vision.

Data Scientist Positions at LinkedIn require skills with SQL, structured and unstructured data, distributed data systems, programming languages such as R or Python, data workflows, and more. For additional information on the Data Science team at LinkedIn, here's their official webpage.

Concepts Tested in LinkedIn Data Scientist Interview Questions

LinkedIn Data Scientist Interview Questions

The primary SQL skills you’d show in a LinkedIn Data Scientist Interview challenge include:

  • Use of basic arithmetic operations on columns of data
  • Using subqueries
  • Use of JOIN to combine tables of data
  • How to use the CEILING function
  • How to cast integer data into floating point data
  • Using WHERE to filter out results

The problem we’re looking at next requires knowledge of several of these concepts, and you’ll be able to apply them in your own code on the StrataScratch platform to better understand them and become interview-ready.

LinkedIn Data Scientist Interview Question

Risky Projects

This question was heard in an interview for a data scientist position at LinkedIn. It’s titled “Risky Projects”, and the key challenge is finding which LinkedIn projects are over budget using data from disparate sets.


Tables: linkedin_projects, linkedin_emp_projects, linkedin_employees

Link to the question: https://platform.stratascratch.com/coding/10304-risky-projects

Ultimately, the question is asking us to use employee salaries, project length, and project budget to determine which projects exceed expectations.

This LinkedIn data scientist interview question is really more along the lines of medium-difficult than medium since, as we will see, the answer requires subqueries, several column calculations, and joining multiple data tables. While there are a number of ways to solve this question, we’re going to focus on a flexible solution that shows our knowledge of several important SQL concepts.

Framework to Solve the Problem

One of the most important principles when answering data science problems is to start with a framework. We’ll outline a three-step framework which builds a logical process for us to approach and solve this problem through code. We’re adapting this framework to the question at hand, but, with a few simple adjustments, you can frame any other question. Here are our steps:

1. Understand your data:

  1. First look at all the columns the question gives you and make some assumptions about the data. If you’re given multiple tables, take note of which data each of the tables are giving you and how you might need it for your answer.
  2. If you fail to understand the schema, see if you can examine some example data. Look at the first few rows and attempt to match why values correlate to their respective columns. Don’t hesitate to ask the LinkedIn interviewer for example values if you don’t receive any initially. Seeing sample data can help you establish limits to your solution or determine if you need to broaden it for edge cases.

2. Formulate your approach:

  1. Begin writing down logical programming or code steps. It doesn’t need to be in order first. In fact, the solution we provide here won’t be in order either as we query the project dates before we grab salary in this problem.
  2. It’s helpful to identify the main functions you have to use to perform your calculations. Attempt to translate the problem statement and data into a series of SQL functions.
  3. Try to not be silent when figuring out the solution. Your LinkedIn interviewer will want to understand how you approach problems. You can ask them for clarification and can expect them to specify if you have to write everything from scratch.

3. Code Execution:

  1. It’s important to build code in such a way as to avoid presenting an oversimplified or overcomplicated solution. Take advantage of blank space between subqueries to keep your solution neat and organized. There’s never any harm in commenting out a code block you want to return to later.
  2. Follow the steps you outlined in the beginning. This will help you avoid confusion and ensure you’re answering all of the questions.
  3. Most importantly, think through your functions and operations. Using these well can produce efficient and generic solutions.
  4. Speak through your code with the LinkedIn interviewer. They’re going to be evaluating your problem-solving abilities as you begin writing.

Understand Your Data

The first step in our framework is to examine our data. Typically, you won’t have access to actual data nor will you be able to execute code during interviews like those at LinkedIn. Usually, you’ll have to understand the data and instead make assumptions based on the schema and information the interviewer gives you.

In the case of this LinkedIn data scientist interview question, the question provides us with the schema for three tables, so we need to start by looking at what’s in each table.

Table: linkedin_projects
idtitlebudgetstart_dateend_date
1Project1294982018-08-312019-03-13
2Project2324872018-01-272018-12-13
3Project3439092019-11-052019-12-09
4Project4157762018-06-282018-11-20
5Project5362682019-03-132020-01-02

This linkedin_projects table tells us how long a project has been running, its budget, and its id.

Table: linkedin_emp_projects
emp_idproject_id
105921
105932
105943
105954
105965

The linkedin_emp_projects table maps out which employees work on which projects.

Table: linkedin_employees
idfirst_namelast_namesalary
10592JenniferRoberts20204
10593HaleyHo33154
10594EricMccarthy32360
10595GinaMartinez46388
10596JasonFields12348

Linkedin_employees tells us the salary of each employee and their id. Since we need to know the salary, employee and project ids, project length, and project budget to calculate which projects are over budget, we’ll have to join data from all three tables using the ids. We can also at this point conclude we’ll be able to throw out some irrelevant columns such as project titles and employee first and last names.

Solution:

Formulate Approach

According to our framework, we’ll want to outline some general steps to translate into code. We’re going to keep it high-level and get more specific later on, but outlining these steps initially will make writing the code easier. Here are the general steps:

  1. Query linkedin_projects and use subtraction and floating point type casting to get total project durations in years
  2. Put our first query to the side while we JOIN our linkedin_emp_projects and linkedin_employees tables on the employee id to get a yearly salary for each project id
  3. JOIN our initial query to the project yearly salary on the project id
  4. Calculate the prorated employee expense by multiplying project yearly salary by project duration and round out our output using the CEILING function
  5. Use a WHERE clause to filter for projects over budget

Calculate Project Durations

Let’s translate the general steps we laid out into functional SQL code. For the first step, we start by querying the linkedin_projects table and subtracting the project end_date from the project start_date. We don’t have to worry about odd differences here since end_date will always be after start_date. Since our salary data is in years, we must also convert our project length into a fraction of years and will need to cast it as a floating point decimal to avoid results rounding down to 0.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       end_date - start_date AS project_duration
FROM linkedin_projects

All required columns and the first 5 rows of the solution are shown

idtitlebudgetstart_dateend_dateproject_duration
1Project1294982018-08-312019-03-13194
2Project2324872018-01-272018-12-13320
3Project3439092019-11-052019-12-0934
4Project4157762018-06-282018-11-20145
5Project5362682019-03-132020-01-02295

We have project duration in days, so we’ll want to divide it by 365 to get it in years to compare to the yearly salaries:

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365 AS project_duration
FROM linkedin_projects

All required columns and the first 5 rows of the solution are shown

idtitlebudgetstart_dateend_dateproject_duration
1Project1294982018-08-312019-03-130
2Project2324872018-01-272018-12-130
3Project3439092019-11-052019-12-090
4Project4157762018-06-282018-11-200
5Project5362682019-03-132020-01-020

Because we divided integer by integer, our calculation rounds down to 0 since all projects are less than year. Let’s cast the 365 denominator as a float to get the decimal value of project duration in years.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365::float AS project_duration
FROM linkedin_projects

All required columns and the first 5 rows of the solution are shown

idtitlebudgetstart_dateend_dateproject_duration
1Project1294982018-08-312019-03-130.532
2Project2324872018-01-272018-12-130.877
3Project3439092019-11-052019-12-090.093
4Project4157762018-06-282018-11-200.397
5Project5362682019-03-132020-01-020.808

Now we have a project duration as a decimal portion of a year. We can later use this value and the collective yearly salaries of all employees on the project to get the total project expenditure.

Calculate the Yearly Salary for Employees per Project

Feel free to comment out your existing code while we build our next query. Before we go any further, here’s an opportunity to interactively test out your project duration calculation from the previous section and the yearly salary per project query for this section:

SELECT *
FROM linkedin_emp_projects ep
JOIN linkedin_employees e ON ep.emp_id = e.id

Since we need to know the yearly salary for all employees on a project, our next step includes combining the linked_emp_projects showing which employees work on what projects with employee salary data from linkedin_employees to get salary per project. We see both tables have a column for employee ids, so it’s fairly obvious to JOIN them on these ids.

Keep in mind we need to sum the salaries per project, so we’ll also GROUP BY project id.

SELECT project_id,
       SUM(salary) AS project_yearly_salary
FROM linkedin_emp_projects ep
JOIN linkedin_employees e ON ep.emp_id = e.id
GROUP BY project_id

All required columns and the first 5 rows of the solution are shown

project_idproject_yearly_salary
4259710
2961732
477167
3451416
4135229

We now have a yearly salary by project, and it’s pretty clear our next step is to combine the project yearly salary dataset with the linkedin_projects dataset.

Joining All The Datasets

We’re going to uncomment our original code and JOIN the datasets on the project_id. Note we encase and give a name to our project yearly salary dataset in order to complete the join.

SELECT id,
       title,
       budget,
       start_date,
       end_date,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id

All required columns and the first 5 rows of the solution are shown

idtitlebudgetstart_dateend_dateproject_durationproject_yearly_salary
1Project1294982018-08-312019-03-130.53268283
2Project2324872018-01-272018-12-130.87760304
3Project3439092019-11-052019-12-090.09378363
4Project4157762018-06-282018-11-200.39777167
5Project5362682019-03-132020-01-020.80829748

This completes the joining of all our datasets. On one table, we have all the information we need for any given project to calculate if it’s over budget or not. From this point, we can discard some of the columns such as the id and dates to simplify our table.

SELECT title,
       budget,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary

Calculate Prorated Employee Expense

Before we wrap up the last two steps of this LinkedIn data scientist interview question, here’s another opportunity to practice this question interactively. See if you can finish the expense per project calculation and determine which projects are over budget:

SELECT id,
       budget,
       (end_date - start_date)/365::float AS project_duration,
       project_yearly_salary
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id

Prorated employee expense is a fairly simple calculation only requiring us to multiply our project duration by our project yearly salary.

SELECT title,
       budget,
       (end_date - start_date)/365::float * project_yearly_salary AS prorated_employee_expense
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary

Our output isn’t rounded, so we apply a CEILING function to get a prorated_employee_expense in a format where we can better compare it to the budget. Remember a CEILING function will return the smallest integer value which is larger than or equal to the input, so it’s always rounding up any floating point not equal to a whole number.

SELECT title,
       budget,
       CEILING((end_date - start_date)/365::float* project_yearly_salary) AS prorated_employee_expense
FROM linkedin_projects

Our expenses are rounded up and look more comparable to the budget numbers. To finalize our answer, all we have to do is compare the budget to the prorated employee expense.

Filter for Projects Which Are Over Budget

We’re essentially comparing two integers at this point representing the project’s budget and the project’s prorated employee expense. Since the question asks us to only present projects which are over budget, we can use a WHERE clause to filter out any projects below budget. Keep in mind, we’ll have to repeat our prorated employee expense calculation in the WHERE clause due to SQL syntax limitations. Here’s what our complete solution looks like:

SELECT title,
       budget,
       CEILING((end_date - start_date)/365::float * project_yearly_salary) AS prorated_employee_expense
FROM linkedin_projects
JOIN
  (SELECT project_id,
          SUM(salary) AS project_yearly_salary
   FROM linkedin_emp_projects ep
   JOIN linkedin_employees e ON ep.emp_id = e.id
   GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id
WHERE ((end_date - start_date)/365::float*project_yearly_salary) > budget

All required columns and the first 5 rows of the solution are shown

titlebudgetprorated_employee_expense
Project12949836293
Project23248752870
Project41577630656
Project64161163230
Project93234144691

This new output shows we are missing several projects which correlate to ones which are under budget. While the result contains more than these five projects, you’ll see all the results have a prorated_employee_expense value higher than the project’s budget.

We have now solved for the correct answer. The solution ended up being a bit complex with multiple joins, arithmetic operations, and type manipulations, but it’s still flexible enough to handle various datasets which follow this schema. Additionally, it displays a variety of different SQL skills your interviewer may appreciate.

Conclusion

In this article, we tackled one of the medium-difficult LinkedIn data scientist interview questions with complex but robust code. This isn’t the only way to answer this question, so we recommend attempting to see if you can come up with other ways which are either more or less efficient.

On the StrataScratch platform, you can try answering other SQL interview questions to improve your skills and prepare for data science interviews. No matter how simple or complicated your solution is, make sure to post your ideas and answers to see what other users think. If you’re stuck on any problem or want a fresh perspective, you can also take a look at their solutions for some guidance!

LinkedIn Data Scientist Interview Questions


Become a data expert. Subscribe to our newsletter.