Get data science interview questions delivered to your inbox
Please enter a valid email address.

Microsoft SQL Interview Questions for Data Science Position

Recent Microsoft SQL interview questions for the Data Science position

In this article, we discuss real-life Microsoft SQL interview questions with detailed solutions for the Data Science position. We solve these SQL interview questions step by step and we will also provide you with a process to solve similar problems and optimize code once you get the solution. You can always practice real-life questions from Uber and other top tech companies on our platform.

Data Science Role at Microsoft

Microsoft operates in a range of technology domains serving both B2B and B2C segments. The Data Science role at Microsoft is therefore dependent on the business unit that you are interviewing for as well as the seniority of the role.

Areas Tested in Microsoft SQL Interviews

The main concepts tested in the Microsoft SQL Data Science Interviews include.

  • Subquery
  • Window Functions
  • DENSE_RANK()
  • CTE
  • Group By
  • Joins
  • Datetime functions

You can practice these and more such questions on the StrataScratch platform and become interview ready.

Microsoft SQL Interview Question

Bottom 2 Companies By Mobile Usage

Write a query that returns a list of the bottom 2 companies by mobile usage. Mobile usage is defined as the number of events registered on a mobile client_id. Order the result by the number of events ascending.
In the case where there are multiple companies tied for the bottom ranks (rank 1 or 2), return all the companies.

You can solve this question here: https://platform.stratascratch.com/coding-question?id=2026&python=

Dataset

fact_events

idint
time_iddatetime
user_idvarchar
customer_idvarchar
client_idvarchar
event_typevarchar
event_idint

Assumptions

The last statement of this Microsoft SQL interview question should tell you that there are going to be edge cases. Therefore, you need to incorporate that in your assumptions. Further, it is advisable that you mention it while confirming your assumptions to showcase that you have read the problem properly.

Since you typically will not have access to the underlying data in the interview, you will have to ensure that your solution boundaries are reasonably well defined. The first thing that strikes you when you look at the schema is the number of id fields that are present in the data. It will not be a bad idea to confirm with the interviewer regarding the assumptions that you make and their validity.

What are the assumptions on the data and the table?

Let us look at each of the id fields.

  • id, time_id and user_id appear to be the serial number, timestamp, and user id fields. They can be safely ignored for solving this problem.
  • customer_id appears to be the company identifier. This is our grouper field for the data. We will aggregate the data, grouping by this field.
  • client_id this field represents the platform identifier – mobile, desktop, etc. We need to subset the data based on this field, since we want to extract only the mobile client_id as per the problem.
  • event_id would be the identifier for an event and event_type will describe the type of event.

Again, please clarify these assumptions with the interviewer to ensure that you do not end up on the wrong track.

This is the table that we will be working with –

Logic

Once we get a handle on the data, this Microsoft SQL interview problem appears quite straightforward. A prime example of how the right assumptions can make your life quite easy.

  1. We need to subset the events based on client_id = ‘mobile’
  2. Then, count the number of events grouped by customer_id
  3. Rank the companies (customer_id) in the ascending order of the number of events.
  4. Select the lowest two (in this case companies ranked 1 and 2) from the data.

Wait!! There is a sting in the tail.

We were required to choose the appropriate ranking algorithm for this. It is mentioned that in case of a tie, list all the companies with the same rank. So our plain vanilla RANK() function will not suffice. We need to use the DENSE_RANK() function. Let me illustrate the difference with a simplified dataset.

Suppose we have a table of time taken to complete a task by a set of students. A simple rank function will give us results akin to this.

If you observe carefully, three girls – Amy, Bianca and Cathy are tied for the first spot. Hence, the next rank is 4. Our interview question however requires us to list the bottom two companies by number of events. In case there is a tie for the first spot, there will not be any second spot. However, there will be a second lowest event and we are required to report that. We will need something like this.

To accomplish this in SQL, we use the function DENSE_RANK(). As per the documentation, DENSE_RANK() ranks the values without leaving any gaps. For most cases, RANK() and DENSE_RANK() will return the same values. However, in case of a tie, there will be dissimilarities. This fine distinction is the difference between the complete solution and an incomplete one.


Now that we have fixed our logic, let us move forward to solving this Microsoft SQL interview question for the Data Science position. We will combine as many steps as possible to make the code more optimized.

Solution:

1. Subset events based on client_id = ‘mobile’

select * from fact_events where client_id = 'mobile';

2. Count the number of events grouped by customer_id

select customer_id, count(*) as num_events from fact_events
where client_id = 'mobile'
group by customer_id;

3. Rank the companies (customer_id) in the ascending order of the number of events. Use DENSE_RANK()instead of RANK()

select customer_id, count(*) as num_events, 
dense_rank() over (order by count(*)) as rank from fact_events
where client_id = 'mobile'
group by customer_id;

4. Finally, select the lowest two (in this case companies ranked 1 and 2 by the number of events) from the data

select customer_id, num_events from 
         (
         select customer_id, count(*) as num_events, 
         dense_rank() over (order by count(*)) as rank 
         from fact_events
         where client_id = 'mobile'
         group by customer_id) q1
where rank <=2
order by num_events asc;

Optimization

This is a popular follow-up question. What should one look for? Low hanging fruits include condensing CTEs and subqueries or removing JOINs and using a CASE statement instead.

For our solution:

  1. Let us look at all the queries to see if we can condense/collapse some.
  2. Can we do away with the subquery? We cannot. Window functions are not supported in the HAVING or WHERE statement. So, you need a subquery.
    1. What are other ways to write this? You can use a CTE or a temp table.
    2. You then might get asked, which one perform's better?
      1. https://dba.stackexchange.com/questions/13112/whats-the-difference-between-a-cte-and-a-temp-table/13117#13117
      2. Temp tables if you are going to use this for other queries.
      3. Temp tables can be indexed so this itself can make the query faster.

Additional Microsoft SQL Interview Questions for the Data Science Position

Microsoft SQL Interview Question #1: Users Exclusive Per Client

Users Exclusive Per Client

Write a query that returns a number of users who are exclusive to only one client. Output the client_id and number of exclusive users.

You can solve this problem here

https://platform.stratascratch.com/coding-question?id=2025&python=

Dataset

This Microsoft SQL interview problem uses the same fact_events dataset used in the previous problem. This problem is roughly the same level of difficulty as the one we discussed. However, the manipulations needed are slightly different. This real-life Microsoft SQL Data Science interview question can be solved using multiple DISTINCT conditions and subqueries.

Approach

  1. Identify the users with only one client_id
  2. From this group of users, query the respective client_id

Microsoft SQL Interview Question #2: Top Company Where Users Use Desktop Only

Top Company Where Users Use Desktop Only

Write a query that returns the top company in terms of events where users use desktop only.

You can solve this problem here

https://platform.stratascratch.com/coding-question?id=2027&python=

Dataset

This problem uses the same fact_events dataset used in the previous problems. This problem can be thought of as a combination of the previous two problems. This real life Microsoft SQL Data Science problem can be solved using RANK() statements and subqueries.

Approach

  1. Identify the Desktop users who use only one client.
  2. Aggregate these events attended by these users by company (customer_id)
  3. Find the company with the highest number of users using the RANK() function. Note here we will get the same result irrespective whether we use the RANK() or the DENSE_RANK() functions. Can you figure out why?

Microsoft SQL Interview Question #3: New And Existing Users

New And Existing Users

Calculate the share of new and existing users. Output the month, share of new users, and share of existing users as a ratio. New users are defined as users who started using services in the current month. Existing users are users who started using services in the current month and used services in any previous month. Assume that the dates are all from the year 2020.

You can solve this problem here

https://platform.stratascratch.com/coding-question?id=2028&python=

Dataset

This is one of the toughest Microsoft SQL interview questions. This too uses the same fact_events dataset. To solve this real-life Microsoft Data Science problem, you will need a combination of CTE, date time functions and JOINs. You can check out our ultimate guide on SQL interview questions to learn more about these and other data manipulation questions using SQL.

Approach

  1. Extract month from time_id field. You can use the DATE_PART() or EXTRACT() functions to accomplish this.
  2. Aggregate the total number of active users per month.
  3. Find the number of New Users for each month.
    1. To do this find the first time_id and then extract the month from that time_id
  4. Merge the queries or CTEs and calculate the new and existing user shares for each month

Check out our post Microsoft Data Scientist Interview Questions which is focused exclusively on Python perspective to solve this question.

Conclusion

In this article of Microsoft SQL interview questions for Data Science position, we have discussed in detail an approach to solving a real-life Microsoft Data Science interview question. The question was not too tough, however we needed to understand the data and appreciate the fine difference between the RANK() and the DENSE_RANK() functions.

This can be accomplished only with practice of solving a variety of problems. Join our platform to practice more such interview questions from data science companies like Google, Facebook, Amazon, Microsoft, Netflix and more. We have a community of over 20,000 aspiring data scientists seeking to improve their coding skills, prepare for interviews, and jump start their career.