Microsoft Data Scientist Interview Questions

This is one of the hard Microsoft data scientist interview questions that has a complex solution that involves multiple SUB QUERIES and a LEFT JOIN to calculate a percentage. You’ll be expected to cast data types and implement a WHERE clause to finalize the solution as well. This is definitely one of the more complex questions Microsoft asks in interviews. This data science question covers concepts that are commonly found in data science interviews at Microsoft, Facebook, and Google.

Microsoft Data Scientist Interview Question

Valuable Departments
Select the top 3 departments by the highest percentage of employees making over $100K in salary and have at least 10 employees. Output the department name and percentage of employees making over $100K. Sort by department with the highest percentage to the lowest.

Our suggestion is to try to solve this Microsoft data scientist interview question first before reading the solution. You can try to solve the interview question here:
https://platform.stratascratch.com/coding-question?id=10298&python=

Exploring the Data and Understanding Assumptions

The first thing we have to do is explore some of the data, understand the columns, and then understand the assumptions that we just read in the question.

We have two data tables: ms_employee and ms_department.

Under ms-employee, we have the name of the employee, their salary, and the department ID. And then ms_department gives the name of the department by department_id.

If we hit the preview button for ms_employee, we get the below data:

And if we hit the preview button for ms_department, we get the department IDs and their corresponding names:

Now, we have to join the department table with the employee table so that we can map the name, the salary, with the name of the department.

Next, we have to focus on the assumptions in this data scientist interview question by Microsoft along with the approach that we have to take to write a query to answer this question. The important things we have to do is keep the employees that make 100,000 in salary and then keep the departments with at least 10 employees.

After that, we can take a percentage of employees that make over $100K divided by all of the employees in that department.

Let's write out what our approach is step by step.

Developing the Solution Framework

The first thing we want to do is identify employees making over $100k, and we know that we're going to use the employee table, and then we're going to map that employee to its department name. So, the important thing is really the logic to identify employees that make over $100k.

So, we have to write:

-- Identify employees making over $100k in salary and mapping
   employee to department name using an INNER JOIN

We write an INNER JOIN because we're assuming that every employee in the data set is mapped to a department. If that's not necessarily the case, then I would ask the interviewer whether or not I have to filter out employees that don't have a department listed or mapped to, or keep those employees and somehow deal with them in another way. But in our case, and for this Microsoft data scientist interview question, let's just assume that all employees are mapped to a department. So, in this case, we can use an INNER JOIN.

Now we want a list of all employees mapped to their department. This will serve as the denominator of our percentage calculations.

-- Identify employees making over $100k in salary and mapping
   employee to department name using an INNER JOIN (This serves as the
   numerator)
-- Map all employees with their department (This serves as the
   denominator)

Once we have these two tables, we will join these two tables together so that we can take a percentage calculation, which is numerator divided by denominator, multiplied by 100 to give a percentage. In this case, it's going to be the numerator. So, we'll call it numerator.

-- Identify employees making over $100k in salary and mapping
   employee to department name using an INNER JOIN (This serves as the
   numerator)
-- Map all employees with their department (This serves as the
   denominator)
-- numerator / denominator using a LEFT JOIN

We are taking a percentage, which is just a ratio, so we need to use a LEFT JOIN because we want to preserve all of the real counts. That's the most important part.

Now, what we have to do is filter out all departments that have less than 10 employees. We can use a WHERE clause to do this.

-- Identify employees making over $100k in salary and mapping
   employee to department name using an INNER JOIN (This serves as the
   numerator)
-- Map all employees with their department (This serves as the 
   denominator)
-- numerator / denominator using a LEFT JOIN
-- Use a WHERE clause to filter departments with less than
   10 employees

Now, we have all the employees that make over $100k and we have all the departments that have 10 or more employees. What we want to do after that is sort the departments.

-- Identify employees making over $100k in salary and mapping
   employee to department name using an INNER JOIN (This serves as the
   numerator)
-- Map all employees with their department (This serves as the
   denominator)
-- numerator / denominator using a LEFT JOIN
-- Use a WHERE clause to filter departments with less than
   10 employees
-- Sort department with highest percentage to lowest
-- Take top 3 departments

There will be a few other things that we'll add to this, but it's going to be more obvious once we start writing code, so let's do that now.

1st SUBQUERY - Employees Making Over $100K

For the first part what we want to write is actually the numerator. We need to identify employees making over $100k salary, and then mapping the employee to the department. And this will end up being a subquery.

What we will do is just start with the table. It's ms_employee, and then alias it with an a. Then we'll JOIN that with the department table, alias that with b on the department_id.

-- Identify employees making over $100k in salary and mapping 
   employee to department name using an INNER JOIN (This serves as the
   numerator)
-- Map all employees with their department (This serves as the 
   denominator)
-- numerator / denominator using a LEFT JOIN
-- Use a WHERE clause to filter departments with less than
   10 employees
-- Sort department with highest percentage to lowest
-- Take top 3 departments

FROM ms_employee a
JOIN ms_department b ON a.department_id = b.department_id

Now, let's handle the SELECT clause. We want the name of the department, and then a count of that employee. So, we'll use the employee ID. We'll just call this column - "employees that basically make $100k". This is to count the number of employees by department that make more than $100k in salary.

SELECT
  name,
  count(id) as emp_xs100K
FROM ms_employee a
JOIN ms_department b ON a.department_id = b.department_id

We have to add now in the logic of employees making over $100k. So that's going to happen in the WHERE clause, and WHERE, it could be salary over 100,000.

SELECT
  name,
  count(id) as emp_xs100K
FROM ms_employee a
JOIN ms_department b ON a.department_id = b.department_id
WHERE salary > 100000

What we're going to get out of this query is a list of departments and a count of employees that make over $100k in salary.

The next thing we need to do is add a GROUP_BY.

SELECT
  name,
  count(id) as emp_xs100K
FROM ms_employee a
JOIN ms_department b ON a.department_id = b.department_id
WHERE salary > 100000
GROUP BY name

Now, let's run this code and see if it works.

We get the department name and then the number of employees making over $100k.

2nd SUBQUERY - All Employees

What we have to do next is come up with the denominator, which is going to be a list of departments, and a count of all the employees within that department. It's going to look very similar to the numerator or the SQL query we just wrote. Let's start with the SELECT clause.

SELECT
  name,
  count(id) as emp_all
FROM ms_employee a
JOIN ms_department b on a.department_id = b.department_id
GROUP BY name

We have the name, and the count, the name of the department, and then the count of all of the employees within that department. And then we're grouping by the department name. It looks almost identical to the above query, except the logic of employees having a salary over $100k. We have these two done. And now, we have to join these two queries together using a LEFT JOIN.

LEFT JOIN Both SUBQUERIES

FROM (
    SELECT
      name,
      count(id) as emp_xs100K
    FROM ms_employee a
    JOIN ms_department b ON a.department_id = b.department_id
    WHERE salary > 100000
    GROUP BY name) t
LEFT JOIN (
    SELECT
      name,
      count(id) as emp_all
    FROM ms_employee a
    JOIN ms_department b on a.department_id = b.department_id
    GROUP BY name) s

So, we have our LEFT JOIN, and now we need a key. The key will be the department name. We're going to match these two tables together just by department name.

FROM (
    SELECT
      name,
      count(id) as emp_xs100K
    FROM ms_employee a
    JOIN ms_department b ON a.department_id = b.department_id
    WHERE salary > 100000
    GROUP BY name) t
LEFT JOIN (
    SELECT
      name,
      count(id) as emp_all
    FROM ms_employee a
    JOIN ms_department b on a.department_id = b.department_id
    GROUP BY name) s
  ON t.name = s.name

Creating the Percentage in SELECT & Casting Data Type

We have the two subqueries left joined together with the key. Now, we'll start with the SELECT, have the department name, and take the calculation.

SELECT
  t.name,
  (t.emp_xs100k / s.emp_all)
FROM (
    SELECT
      name,
      count(id) as emp_xs100K
    FROM ms_employee a
    JOIN ms_department b ON a.department_id = b.department_id
    WHERE salary > 100000
    GROUP BY name) t
LEFT JOIN (
    SELECT
      name,
      count(id) as emp_all
    FROM ms_employee a
    JOIN ms_department b on a.department_id = b.department_id
    GROUP BY name) s
  ON t.name = s.name

We have the count of employees that make over $100k divided by the count of all employees. These are two integers. In order to make a non-integer as an output, we need to convert one of these counts or numbers to a FLOAT or a decimal.

  (t.emp_xs100k / s.emp_all::FLOAT)

And then to change that into a percentage, we'll multiply by 100. Then we will name this column PCT, for percentage.

(t.emp_xs100k / s.emp_all::FLOAT)*100 as emp_xs100k_pct

But we're not done yet! To solve this Microsoft data scientist interview question, we still need to apply WHERE, ORDER BY and LIMIT.

WHERE, ORDER BY, LIMIT

Now, we need to apply the WHERE clause to filter departments with less than 10 employees, and then we need to sort the department from highest to lowest percentage. We will use an ORDER BY, and then a limit to take the top three.

SELECT
  t.name,
  (t.emp_xs100k / s.emp_all::FLOAT)*100 as emp_xs100k_pct
FROM (
    SELECT
      name,
      count(id) as emp_xs100K
    FROM ms_employee a
    JOIN ms_department b ON a.department_id = b.department_id
    WHERE salary > 100000
    GROUP BY name) t
LEFT JOIN (
    SELECT
      name,
      count(id) as emp_all
    FROM ms_employee a
    JOIN ms_department b on a.department_id = b.department_id
    GROUP BY name) s
  ON t.name = s.name
WHERE s.emp_all >= 10
ORDER BY emp_xs100k_pct DESC
LIMIT 3

We have the WHERE clause where we're taking the employee all, so the count of all employees. We also want to make sure that that's equal or greater than 10. And then we order this percentage calculation in descending order so the highest is listed first. And then we take the top three departments with the highest percentage.

Let's run this code, now.

We got the department names and the percentages. And it is in descending order, and you only have three. And these are the top three.