LinkedIn Data Scientist Interview Questions
In this article, we’ll work through one of the common LinkedIn Data Scientist Interview questions requiring a wide range of SQL skills.
The LinkedIn data scientist interview question we’re solving has a variety of ways to reach a solution, but we’re going to simplify the problem-solving process by working through it in small steps. Keep reading to see how we leverage a framework you could apply to any data scientist interview questions and how we break down this medium-level LinkedIn data scientist interview question into more manageable chunks.
LinkedIn started in 2003 with the goal of connecting the world’s professionals and experts to enhance their career success. With over 750 million members worldwide, LinkedIn has more professionals than any other social network ranging from those who just started their career to executives from all the Fortune 500 companies. Operationally, LinkedIn is a global company with a wide variety of revenue models involving marketing, talent, sales, and membership solutions.
Data Scientist Position at LinkedIn
LinkedIn Data Scientist Positions typically work for the Data Science team which belongs to the larger Data engineering division. As a member of the Data Science team, you will leverage data from over 700 million members to provide insights to product, sales, marketing, economics, infrastructure, and other teams. These data insights will drive strategy, apply statistical inference and algorithms for optimization, and build engineered solutions to scale LinkedIn’s ongoing business objectives and long-term company vision.
Data Scientist Positions at LinkedIn require skills with SQL, structured and unstructured data, distributed data systems, programming languages such as R or Python, data workflows, and more. For additional information on the Data Science team at LinkedIn, here's their official webpage.
Concepts Tested in LinkedIn Data Scientist Interview Questions
The primary SQL skills you’d show in a LinkedIn Data Scientist Interview challenge include:
- Use of basic arithmetic operations on columns of data
- Using subqueries
- Use of JOIN to combine tables of data
- How to use the CEILING function
- How to cast integer data into floating point data
- Using WHERE to filter out results
The problem we’re looking at next requires knowledge of several of these concepts, and you’ll be able to apply them in your own code on the StrataScratch platform to better understand them and become interview-ready.
LinkedIn Data Scientist Interview Question
This question was heard in an interview for a data scientist position at LinkedIn. It’s titled “Risky Projects”, and the key challenge is finding which LinkedIn projects are over budget using data from disparate sets.
Interview Question Date: November 2020
Identify projects that are at risk for going overbudget. A project is considered to be overbudget if the cost of all employees assigned to the project is greater than the budget of the project.
You'll need to prorate the cost of the employees to the duration of the project. For example, if the budget for a project that takes half a year to complete is $10K, then the total half-year salary of all employees assigned to the project should not exceed $10K. Salary is defined on a yearly basis, so be careful how to calculate salaries for the projects that last less or more than one year.
Output a list of projects that are overbudget with their project name, project budget, and prorated total employee expense (rounded to the next dollar amount).
HINT: to make it simpler, consider that all years have 365 days. You don't need to think about the leap years.
Link to the question: https://platform.stratascratch.com/coding/10304-risky-projects
Ultimately, the question is asking us to use employee salaries, project length, and project budget to determine which projects exceed expectations.
This LinkedIn data scientist interview question is really more along the lines of medium-difficult than medium since, as we will see, the answer requires subqueries, several column calculations, and joining multiple data tables. While there are a number of ways to solve this question, we’re going to focus on a flexible solution that shows our knowledge of several important SQL concepts.
Framework to Solve the Problem
One of the most important principles when answering data science problems is to start with a framework. We’ll outline a three-step framework which builds a logical process for us to approach and solve this problem through code. We’re adapting this framework to the question at hand, but, with a few simple adjustments, you can frame any other question. Here are our steps:
1. Understand your data:
- First look at all the columns the question gives you and make some assumptions about the data. If you’re given multiple tables, take note of which data each of the tables are giving you and how you might need it for your answer.
- If you fail to understand the schema, see if you can examine some example data. Look at the first few rows and attempt to match why values correlate to their respective columns. Don’t hesitate to ask the LinkedIn interviewer for example values if you don’t receive any initially. Seeing sample data can help you establish limits to your solution or determine if you need to broaden it for edge cases.
2. Formulate your approach:
- Begin writing down logical programming or code steps. It doesn’t need to be in order first. In fact, the solution we provide here won’t be in order either as we query the project dates before we grab salary in this problem.
- It’s helpful to identify the main functions you have to use to perform your calculations. Attempt to translate the problem statement and data into a series of SQL functions.
- Try to not be silent when figuring out the solution. Your LinkedIn interviewer will want to understand how you approach problems. You can ask them for clarification and can expect them to specify if you have to write everything from scratch.
3. Code Execution:
- It’s important to build code in such a way as to avoid presenting an oversimplified or overcomplicated solution. Take advantage of blank space between subqueries to keep your solution neat and organized. There’s never any harm in commenting out a code block you want to return to later.
- Follow the steps you outlined in the beginning. This will help you avoid confusion and ensure you’re answering all of the questions.
- Most importantly, think through your functions and operations. Using these well can produce efficient and generic solutions.
- Speak through your code with the LinkedIn interviewer. They’re going to be evaluating your problem-solving abilities as you begin writing.
Understand Your Data
The first step in our framework is to examine our data. Typically, you won’t have access to actual data nor will you be able to execute code during interviews like those at LinkedIn. Usually, you’ll have to understand the data and instead make assumptions based on the schema and information the interviewer gives you.
In the case of this LinkedIn data scientist interview question, the question provides us with the schema for three tables, so we need to start by looking at what’s in each table.
This linkedin_projects table tells us how long a project has been running, its budget, and its id.
The linkedin_emp_projects table maps out which employees work on which projects.
Linkedin_employees tells us the salary of each employee and their id. Since we need to know the salary, employee and project ids, project length, and project budget to calculate which projects are over budget, we’ll have to join data from all three tables using the ids. We can also at this point conclude we’ll be able to throw out some irrelevant columns such as project titles and employee first and last names.
According to our framework, we’ll want to outline some general steps to translate into code. We’re going to keep it high-level and get more specific later on, but outlining these steps initially will make writing the code easier. Here are the general steps:
- Query linkedin_projects and use subtraction and floating point type casting to get total project durations in years
- Put our first query to the side while we JOIN our linkedin_emp_projects and linkedin_employees tables on the employee id to get a yearly salary for each project id
- JOIN our initial query to the project yearly salary on the project id
- Calculate the prorated employee expense by multiplying project yearly salary by project duration and round out our output using the CEILING function
- Use a WHERE clause to filter for projects over budget
Calculate Project Durations
Let’s translate the general steps we laid out into functional SQL code. For the first step, we start by querying the linkedin_projects table and subtracting the project end_date from the project start_date. We don’t have to worry about odd differences here since end_date will always be after start_date. Since our salary data is in years, we must also convert our project length into a fraction of years and will need to cast it as a floating point decimal to avoid results rounding down to 0.
SELECT id, title, budget, start_date, end_date, end_date - start_date AS project_duration FROM linkedin_projects
We have project duration in days, so we’ll want to divide it by 365 to get it in years to compare to the yearly salaries:
SELECT id, title, budget, start_date, end_date, (end_date - start_date)/365 AS project_duration FROM linkedin_projects
Because we divided integer by integer, our calculation rounds down to 0 since all projects are less than year. Let’s cast the 365 denominator as a float to get the decimal value of project duration in years.
SELECT id, title, budget, start_date, end_date, (end_date - start_date)/365::float AS project_duration FROM linkedin_projects
Now we have a project duration as a decimal portion of a year. We can later use this value and the collective yearly salaries of all employees on the project to get the total project expenditure.
Calculate the Yearly Salary for Employees per Project
Feel free to comment out your existing code while we build our next query. Before we go any further, here’s an opportunity to interactively test out your project duration calculation from the previous section and the yearly salary per project query for this section:
SELECT * FROM linkedin_emp_projects ep JOIN linkedin_employees e ON ep.emp_id = e.id
Since we need to know the yearly salary for all employees on a project, our next step includes combining the linked_emp_projects showing which employees work on what projects with employee salary data from linkedin_employees to get salary per project. We see both tables have a column for employee ids, so it’s fairly obvious to JOIN them on these ids.
Keep in mind we need to sum the salaries per project, so we’ll also GROUP BY project id.
SELECT project_id, SUM(salary) AS project_yearly_salary FROM linkedin_emp_projects ep JOIN linkedin_employees e ON ep.emp_id = e.id GROUP BY project_id
We now have a yearly salary by project, and it’s pretty clear our next step is to combine the project yearly salary dataset with the linkedin_projects dataset.
Joining All The Datasets
We’re going to uncomment our original code and JOIN the datasets on the project_id. Note we encase and give a name to our project yearly salary dataset in order to complete the join.
SELECT id, title, budget, start_date, end_date, (end_date - start_date)/365::float AS project_duration, project_yearly_salary FROM linkedin_projects JOIN (SELECT project_id, SUM(salary) AS project_yearly_salary FROM linkedin_emp_projects ep JOIN linkedin_employees e ON ep.emp_id = e.id GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id
This completes the joining of all our datasets. On one table, we have all the information we need for any given project to calculate if it’s over budget or not. From this point, we can discard some of the columns such as the id and dates to simplify our table.
SELECT title, budget, (end_date - start_date)/365::float AS project_duration, project_yearly_salary
Calculate Prorated Employee Expense
Before we wrap up the last two steps of this LinkedIn data scientist interview question, here’s another opportunity to practice this question interactively. See if you can finish the expense per project calculation and determine which projects are over budget:
SELECT id, budget, (end_date - start_date)/365::float AS project_duration, project_yearly_salary FROM linkedin_projects JOIN (SELECT project_id, SUM(salary) AS project_yearly_salary FROM linkedin_emp_projects ep JOIN linkedin_employees e ON ep.emp_id = e.id GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id
Prorated employee expense is a fairly simple calculation only requiring us to multiply our project duration by our project yearly salary.
SELECT title, budget, (end_date - start_date)/365::float * project_yearly_salary AS prorated_employee_expense FROM linkedin_projects JOIN (SELECT project_id, SUM(salary) AS project_yearly_salary FROM linkedin_emp_projects ep JOIN linkedin_employees e ON ep.emp_id = e.id GROUP BY project_id) yearly_salary
Our output isn’t rounded, so we apply a CEILING function to get a prorated_employee_expense in a format where we can better compare it to the budget. Remember a CEILING function will return the smallest integer value which is larger than or equal to the input, so it’s always rounding up any floating point not equal to a whole number.
SELECT title, budget, CEILING((end_date - start_date)/365::float* project_yearly_salary) AS prorated_employee_expense FROM linkedin_projects
Our expenses are rounded up and look more comparable to the budget numbers. To finalize our answer, all we have to do is compare the budget to the prorated employee expense.
Filter for Projects Which Are Over Budget
We’re essentially comparing two integers at this point representing the project’s budget and the project’s prorated employee expense. Since the question asks us to only present projects which are over budget, we can use a WHERE clause to filter out any projects below budget. Keep in mind, we ’ll have to repeat our prorated employee expense calculation in the WHERE clause due to SQL syntax limitations. Here’s what our complete solution looks like:
SELECT title, budget, CEILING((end_date - start_date)/365::float * project_yearly_salary) AS prorated_employee_expense FROM linkedin_projects JOIN (SELECT project_id, SUM(salary) AS project_yearly_salary FROM linkedin_emp_projects ep JOIN linkedin_employees e ON ep.emp_id = e.id GROUP BY project_id) yearly_salary ON linkedin_projects.id = yearly_salary.project_id WHERE ((end_date - start_date)/365::float*project_yearly_salary) > budget
This new output shows we are missing several projects which correlate to ones which are under budget. While the result contains more than these five projects, you’ll see all the results have a prorated_employee_expense value higher than the project’s budget.
We have now solved for the correct answer. The solution ended up being a bit complex with multiple joins, arithmetic operations, and type manipulations, but it’s still flexible enough to handle various datasets which follow this schema. Additionally, it displays a variety of different SQL skills your interviewer may appreciate.
In this article, we tackled one of the medium-difficult LinkedIn data scientist interview questions with complex but robust code. This isn’t the only way to answer this question, so we recommend attempting to see if you can come up with other ways which are either more or less efficient.
On the StrataScratch platform, you can try answering other SQL interview questions to improve your skills and prepare for data science interviews. No matter how simple or complicated your solution is, make sure to post your ideas and answers to see what other users think. If you’re stuck on any problem or want a fresh perspective, you can also take a look at their solutions for some guidance!