Microsoft Data Scientist SQL Interview Questions
We will review a sample interview question to help prepare you in applying for a role as a Microsoft data scientist, selected from recent data analyst interviews.
Data analyst roles at Microsoft can expect to design and build data models to provide meaningful data and models for their team. Below is an example of a real-world scenario that you can expect to be presented with in an interview at Microsoft. Below we will show you how to break down the problem presented, step through the logic used to solve it, and discuss ways the resolution could be optimized for better performance.
Let's review one of the interesting data scientist interview questions that we took from Microsoft.
Concepts Tested in this Microsoft Data Scientist SQL Interview Question
The main concepts that are tested in the following Microsoft data scientist interview question include:
- Aggregate Functions - Sum/Count/Group By
Microsoft Data Scientist SQL Interview Question
Employee Project Budgets
Find the top five most expensive projects by the amount of budget allocated to each employee on the project. Exclude projects with 0 employees. Assume each employee works on only one project. The output should be the project title and the budget that’s allocated to each employee (i.e., a budget-to-employee ratio). Display the top 5 projects with the highest budget-to-employee ratio first.
**Please note that this question has since been updated on the StrataScratch platform, available here. In this article and video, we discuss a different version of the question.
Taking clues from the question above, the following initial assumptions are made:
- One project can have multiple employees, but each employee only works on one project
- Only the top 5 results need to be displayed, so results need to be ordered by cost and then limited to 5 most expensive projects
Our dataset provides the following breakdown:
- id - individual id of each project
- title - name of the project
- budget - total cost of each project
- emp_id - integer value of the employee’s id
- project_id - unique id for each project, same value as id in ms_projects table
Upon reviewing the budget column in ms_projects, one can see that we are working with whole numbers, therefore the assumption that we are working with integer division is also made.
The desired SELECT statement needs to return the title of the project (so it can easily be identified) and the budget per employee on the project. The beginning of the select is simple:
SELECT title AS project,
But now we must decide how to return the cost of the project per employee. For this approach, we’ll be using integer division. First, we’ll test the theory that each project may have multiple employees. To test this, we use the COUNT() function. Slap this into a quick query (the join shown here will be explained further down), and the results show this assumption is correct:
SELECT title AS project, COUNT(emp_id) AS emps_per_project FROM ms_projects p INNER JOIN ms_emp_projects e ON p.id = e.project_id GROUP BY 1, budget
To calculate the cost of employee per project, we divide the budget column from ms_projects by the total count of employees per project. This is accomplished using COUNT(emp_id):
SELECT title AS project, budget/COUNT(emp_id) AS budget_emp_ratio
In order to gather all columns needed for this SELECT statement, the ms_projects and ms_emp_projects tables must be joined together.
Much like the solution above, in order to produce the desired results, a JOIN statement must be used in this query. Here we will conduct an INNER JOIN on ms_projects and ms_emp_projects where the project_id column in ms_emp_projects is equivalent to the id column on ms_projects:
SELECT title AS project, budget/COUNT(emp_id) AS budget_emp_ratio FROM ms_projects p INNER JOIN ms_emp_projects e ON p.id = e.project_id
Since COUNT() is an aggregate function, it must be used in conjunction with the GROUP BY clause on both the project title and the budget.
SELECT title AS project, budget/COUNT(emp_id) AS budget_emp_ratio FROM ms_projects p INNER JOIN ms_emp_projects e ON p.id = e.project_id GROUP BY 1, budget
Here are the results of this query as it currently stands:
Finally, since our Microsoft data scientist interview question is asking for the top 5 most expensive projects, we sort them in descending order and a LIMIT clause is added so only the top 5 rows are returned in the result set.
SELECT title AS project, budget/COUNT(emp_id) AS budget_emp_ratio FROM ms_projects p INNER JOIN ms_emp_projects e ON p.id = e.project_id GROUP BY 1, budget ORDER BY 2 DESC LIMIT 5
Here is the final result:
Commonly, an interviewee will be asked whether a solution can be optimized. If not, what makes the solution optimal? The query here has been optimized, so we will concentrate on why there are no suggestions for improvement:
- The solution meets expectations
- The join method used
- Limiting the results returned
Our solution here provides results that meet the requirements of the brief: nothing more, nothing less. One of the first questions that should be asked during optimization is whether the current solution produces desired and accurate results.
Next, the join method used here was an INNER JOIN. An inner join reviews the two tables being joined and only returns rows where the join condition from the left table is satisfied on the right. This excludes unnecessary data and handles null entries so they don’t need to be accounted for later in the query. Had a LEFT JOIN been used in this instance, the optimization suggestion would have been to change the join method to INNER. Why? A left join actually performs two joins: first an inner join is completed; then for the remaining rows in the table on the left side of the argument that have no match on the right, a row is added to the resultset that contains the data from the left row and inserts NULL values for the right. Not only does this take additional time to run the query, the null values must also be accounted for in the query itself. So, here an INNER JOIN was the correct approach.
Finally, limiting the results returned by the query not only satisfies the parameters of the assignment, but it also saves processing time. It is important to note that limiting the results is only optimized if results are first ordered using an ORDER BY clause. This is because, without first ordering the results, the built-in postgreSQL query optimizer won’t have an ordering constraint to hold onto when generating its query plan, which means your query will produce unordered, unpredictable results.
You’ve now seen how common SQL interview questions can be broken down into their base concepts, the logic used to work through the solutions, and techniques that could be used to optimise those results. We concentrated on JOINS and aggregate functions, which are used in day-to-day life as a data scientist or analyst. You’ve also seen how that data is organized using COUNT and ORDER BY. In many situations, you will be presented with multiple tables from which to pull and organize data, so familiarity with JOINS and their different techniques is a must. If you’ve enjoyed this article, check out our previous articles "Microsoft SQL interview questions" and "Microsoft data science interview questions" which can help you prepare for a few more concepts that frequently arise in interviews.