Microsoft Data Science Interview Questions - Employee Salaries (ETL Error)
This data science interview question was asked by Microsoft. We’ll cover both the question and answer with a detailed explanation of the approach. We’ll walk through each step of my answer, assumptions, approach, and explain every line of code I write.
Microsoft Data Science Interview Question
This is one of easy level Microsoft data science interview questions that mainly tests if you understand how to de-duplicate data. It’s not obvious how to de-duplicate the data but once you figure out that there are many duplicate employees and that the maximum salary is actually the current salary, then the SQL query is a matter of using the max() function and GROUP BY to de-duplicate employee names to generate a list of unique employee names with their current salary. This question may be easy but it’s a very common data science interview question asked by Microsoft, Facebook, and Google.
Employee Salaries (ETL Error)
Due to an ETL error, the `employee_salary` table instead of updating the salaries every year when doing compensation adjustments, did an insert for some employees instead. Now there are a few records where there are duplicate employees but different salaries. The head of HR still needs the current salary of each employee. Assuming that salaries increase every year due to promotions or to keep up with the cost of living, find the current salary for each employee and output their id, first name, last name, department ID, and current salary. Order your list by employee ID in ascending order.
Also, find a link below to follow along with me. This link will take you to the StrataScratch platform:
Exploring the Data and Understanding Assumptions
There's a lot of background to this question which helps us with defining assumptions. Just taking a look at the data and if we click on preview, what we see essentially are duplicates in the employee table.
If we see the output, we have two Todds, two Justins, two Patricias and two Natashas. Essentially, there are different salaries for each of them. One salary is probably from the last year and another salary is their current salary.
Tip: Quickly Find Duplicates
Hypothetically, it isn't very obvious that there are duplicates in the table. So, how would we double-check just to ensure that there are duplicates in employee names?
One way to just double-check to see if there are duplicates in this table, I can write the following SQL statement.
SELECT id, count(id) as cnt FROM ms_employee_salary GROUP BY id ORDER BY cnt DESC
In this SQL query, it's taking the id column which we're assuming is the employee ID and then it's using a count function to count how many IDs that it sees according to the above id column, and then applying a GROUP BY, and then an ORDER BY count in descending order.
The highest count is listed first followed by the second-highest count all the way down to the lowest count.
If we run this block of code, what we get are the following IDs that are actually duplicated in the dataset.
Employee ID 75 is in the dataset five times. So, that is a programmatic way to check if there are duplicate employee IDs.
Now, let's get back to answering the question.
Developing The Framework For The Solution
There's an assumption in the question that the employee salaries increase every year. And this is actually the most important assumption of the question itself.
So, we'll write:
-- assumption is that employees salaries increase each year so the current salary is the highest salary found in the table
This means that we can use a max function to find the highest salary which equates to their current salary.
Now, we'll write - use max function in SELECT clause to find the highest salary of the employee.
-- use max() in SELECT clause to find highest salary of employee
The next part of the question is to output their ID, first name, last name, department ID, and current salary.
So, we'll write - output ID, first name, last name, department ID and max salary, which is the current salary. Now, we also know that all of this will go into the SELECT clause.
-- output id, first name, last name, department ID, max(salary) in SELECT clause
So with these points that we've written, we have everything we need to start coding.
Creating the SQL Solution
Our SQL query would look like the following:
SELECT id, first_name, last_name, department_id, max(salary) as current_salary FROM ms_employee_salary GROUP BY id, first_name, last_name, department_id ORDER BY id ASC
So, we have the max (salary), which is the current_salary. And above we have the other columns that the question wants us to display as well. But we also have to put in a GROUP BY and then we put the columns that we're grouping by. Then the last part of the question is asking us to order by id in ascending order.
If we run this query, we have the following output:
The output is the id, first_name, last_name, department_id, and the employee's current_salary which is the max salary found in the data.
All other salaries are removed from the output of this dataset and all employees are unique. There are no duplicate employees now.
That's it for the solution. There's one more thing that I wanted to talk about in terms of the GROUP BY statement. In this GROUP BY statement, we write out the column names like id, first_name, last_name, department_id and a lot of times though, you can get away with just using numbers and these are basically the column positions.
SELECT id, first_name, last_name, department_id, max(salary) as current_salary FROM ms_employee_salary GROUP BY --id, first_name, last_name, department_id 1, 2, 3, 4 ORDER BY id ASC
Here, 1 would correspond to id, 2 would correspond to first_name and so on.
And if we run this SQL query, we get the exact same output.
The reason why just really depends on the type of database that you're using, this is a Postgres database which allows you to do 1, 2, 3, and 4 and so on. MySQL databases allow you to do the same thing but there are newer databases out there like Hive databases and I believe that you're not actually allowed to use numbers, you have to use column names.
I've just been in the habit these days to actually write out the column names because I use Hive at work and a lot of other companies have started using these newer databases like Hive and Snowflake and so forth.
If it doesn't really matter to you, you can use numbers or you can actually write out the column names.