Microsoft SQL Interview Questions
We'll discuss one of the interesting Microsoft SQL interview questions to help prepare you for applying for a role as a Microsoft data scientist.
Being one of the top tech conglomerates in the world, Microsoft has a huge reach with a large diversity in job opportunities and descriptions. Below is a selection of a few real-world scenarios that one can expect to be presented with in an interview with Microsoft. In the below sections, we will show you how to break down the problems presented, step through the logic used to solve them, and discuss ways each resolution could be optimized for better performance.
Concepts Tested in This Microsoft SQL Interview Question
The main concepts that are tested in the following Microsoft SQL interview question include:
- Aggregate Functions - Sum/Group By/Order by
Microsoft SQL Interview Question
Find the total number of downloads for paying and non-paying users by date. Include only records where non-paying customers have more downloads than paying customers. The output should be sorted by earliest date first and contain 3 columns date, non-paying downloads, paying downloads.
This Microsoft SQL interview question can be found here on StrataScratch so you can follow along with this article or come up with your own solution.
In this case, this Microsoft SQL interview question is relatively straightforward. We know that downloads are recorded by date and are documented as paying and non-paying customers. The tables track the customers by customer and account IDs. We assume that each user may potentially be associated with multiple accounts, which will be addressed later.**
In reviewing the data provided, we can break down the columns from each of the tables:
- user_id - we assume each user is assigned to a unique ID
- acc_id - much like the user_id element, each account appears to be assigned to a unique ID
- paying_customer - as this column is assigned to the varchar data type, we assume that the value in this field will be “yes” or “no” per each acc_id shown in the ms_acc_dimension table
- date - a datetime value
- downloads - we assume this column will track the total number of downloads per date, per user_id, based on the columns in use on the ms_download_facts table
As such from the table breakdown, we can assume the following about these three tables:
- The ms_user_dimension table tracks which user is assigned to which account ID
- The ms_acc_dimension table tracks if each account ID is a paying customer or not
- Finally, in ms_download_facts, the data indicates that each user’s download activity is summed and tracked daily
Our approach involves breaking down our actions into smaller, more manageable chunks. These steps will involve identifying the columns to join and the join method, preparing the select statement to gather the desired data, and filtering that data.
1. Joining Tables
First, we concentrate on the method by which we will be joining the table values. In this case, both a LEFT JOIN and an INNER JOIN would be appropriate. In the example below and in the embedded video, an INNER JOIN is used. You may note that the code below does not specify that we are using an INNER JOIN: when no specific method is selected for a JOIN, INNER is the default. More will be touched on the use of INNER JOIN vs LEFT JOIN further on. For readers that are following around in the StrataScratch platform, the suggested solution does show a LEFT JOIN method. Read below for notes on the differences you see here.
The breakdown of our JOIN statement is as follows:
- Since the ms_user_dimension table assigns each user id to an account ID, this will be our base (left) table
- We need to be able to connect whether each individual is a paying customer or not (from ms_acc_dimension)
- Finally, we need to track downloads and date from ms_download_facts
Therefore, we first will join ms_user_dimension to ms_acc_dimension on the common column, acc_id.
FROM ms_user_dimension u JOIN ms_acc_dimension c ON u.acc_id = c.acc_id
Here is what the data looks like after the first join:
We’re still missing the date of the downloads and the download count itself. To retrieve this data, a join must be completed on the ms_download_facts table, as well. We join to ms_user_dimension on the user_id column.
FROM ms_user_dimension u JOIN ms_download_facts f ON u.user_id = f.user_id
Shown below, our join now returns data from ms_user_dimension (yellow), ms_acc_dimension (orange) where the acc_ids match, and ms_download_facts (red) where the user_ids match. This is what the complete JOIN statement looks like now:
FROM ms_user_dimension u JOIN ms_acc_dimension c ON u.acc_id = c.acc_id JOIN ms_download_facts f ON u.user_id = f.user_id
**As promised above, a note on the INNER JOIN vs LEFT JOIN: in cases where there are NULL values in the right-hand side of the argument, an INNER JOIN may trigger an exception. This is because the INNER JOIN returns all values on the left and right side of the argument, even when the matching data on the right does not exist. To completely avoid this risk, a LEFT JOIN may be used. A LEFT JOIN returns all values on the left side of the argument, and only values on the right side of the argument where a match exists. Since we’re specifying in the query, however, that only results where a paying customer is specified with a ‘yes’ or ‘no’ value in the paying_customer column, the NULL value exception is mitigated.
2. Creating the SELECT Statement
Now that we have a visualized data set, the SELECT statement can be prepared. We know our data needs to return three columns: the date, the number of non-paying downloads for that day, and the number of paying downloads for that day.
SELECT date, [non-paying downloads], [paying downloads]
As can be seen from the table above, we can see that multiple users downloaded on the same day, so the SUM() function will be used to sum the download count. Since SUM() is an aggregate function, it must also be used in conjunction with the GROUP BY clause to group like rows. We want to keep the non-paying downloads and paying downloads results separate, so a CASE statement will be embedded SUM() function to produce only results for non-paying and paying customers respectively.
GROUP BY can be used with the column name itself or the column number. Since the date column is column 1, GROUP by 1 is used. We will also order the results by date with ORDER BY. Ascending (ASC) is the default when not specified for the order.
SELECT date, SUM(CASE WHEN paying_customer= ‘no’ THEN downloads END ) AS non_paying, SUM(CASE WHEN paying_customer=’yes’ THEN downloads END ) AS paying GROUP BY 1 ORDER BY 1
Add the SELECT and JOIN statements together, and this is our new code, along with the query results:
SELECT date, SUM(CASE WHEN paying_customer='no' THEN downloads END) AS non_paying, SUM(CASE WHEN paying_customer='yes' THEN downloads END) AS nonpaying FROM ms_user_dimension u JOIN ms_acc_dimension c ON u.acc_id = c.acc_id JOIN ms_download_facts f ON u.user_id = f.user_id GROUP BY 1 ORDER BY 1
3. Filtering Results from the SELECT Statement
If we were just tasked with displaying the total number of downloads on any one day, we would be finished; but in this case a final step is needed: filtering the data. This is where a CTE (common table expression) comes in handy. Use of a CTE is beneficial in terms of readability, and it is easy to filter the desired data returned from resultset. A CTE is created using a WITH clause (shown below).
WITH temp AS( SELECT statement here… )
Adding the filter is now as simple as selecting all rows from the CTE and using a WHERE clause to return only rows where the non_paying column’s value is larger than the paying column.
WITH temp AS( SELECT statement here… ) SELECT * FROM temp WHERE non_paying > paying
That’s it! Since the data in the CTE was already ordered by date, the SELECT statement will already return organized results and no other action is needed. Here is the final code and the results:
WITH temp AS( SELECT date, SUM(CASE WHEN paying_customer='no' THEN downloads END) AS non_paying, SUM(CASE WHEN paying_customer='yes' THEN downloads END) AS nonpaying FROM ms_user_dimension u JOIN ms_acc_dimension c ON u.acc_id = c.acc_id JOIN ms_download_facts f ON u.user_id = f.user_id GROUP BY 1 ORDER BY 1 ) SELECT * from temp WHERE non_paying > paying
It is common that an interviewer will ask if the answer can be optimized. When this question arises, ask yourself the following questions:
- Does this solution meet all of the requirements?
- Can this solution be optimized to run faster?
- Can this solution be optimized to use less resources?
In this case, the results are already fairly optimized and straightforward. The results returned exactly what was requested. In our initial SELECT statement, we defined fields to return instead of using SELECT *. Specifying column names instead of using the asterisk is always a good practice to save time and resources.
Next, the use of a CTE in this solution was faster than a temp table would have been. CTEs are automatically disposed of after the query completes and [typically] do not reside in memory, thereby keeping resources open for other processes. The use of CTEs vs temp tables is highly debated upon. In cases where the dataset is not particularly large and when the reset of the query is not affected by the data returned by the CTE, the CTE tends to be faster. As the size of your data increases, however, or if the results need to be referred to multiple times in your query, a temp table would be more appropriate. For more information on this discussion, here is a great resource.
Finally, let’s examine the use of the LEFT JOIN in StrataScratch’s suggested solution, versus the INNER JOIN that was used in the code shown above and in the video linked at the top of this article. Traditionally speaking, an INNER JOIN processes faster than a LEFT JOIN. This is because the LEFT JOIN first completes an INNER JOIN, then finds all rows on the left side of the argument that did not have a match and inserts those rows into the result set with a NULL value in the columns from the right-hand side. In StrataScratch’s solution, the LEFT JOIN is used to account for edge cases where a particular user has not been assigned to an account or they have made no downloads. Typically, you would want to identify possible edge cases and clarify those expectations with an interviewer before beginning. The INNER JOIN would ignore those rows, omitting them from the query. If the interviewer wanted those values included in the query, then the use of INNER JOIN would fall short.
If you’ve made it to this conclusion section, you’ve seen how common interview questions can be broken down into their base concepts, the logic used to work through the solutions, and techniques that could be used to optimise those results. In this Microsoft SQL interview question, we’ve reviewed JOINS and aggregate functions, much of which are used in day-to-day life as a data scientist or analyst. You will often be presented with multiple tables from which to pull data, so familiarity with JOINS and their different techniques along with edge case handling is a must. The skills needed to filter through these tables–-sub-queries, CTEs, and case statements were examples shown today—and the ability to know when to use them is highly valued in the field. Sharpening these skills is always beneficial for all, whether you are just now preparing for the data science interview process or have been on the job for years.