Top Five SQL Window Functions for Data Science Interviews
Study smart, not hard.
SQL is the universal language in the data world and is the most important skill to nail down as a data professional.
The reason SQL is so important is that it is the main skill that is required during the data wrangling phase. A lot of data exploration, data manipulation, pipeline development, and dashboard creation is done through SQL.
What separates great data scientists from good data scientists is that great data scientists can wrangle data as much as the capabilities of SQL allow. A big part of being able to fully use all that SQL has to offer is by knowing how to use window functions. We also recommend checking out our ultimate guide to SQL Window Functions.
Let’s dive into it!
1) Deltas (Current vs Previous)
LEAD() and LAG() are mostly used when comparing one period of time with the previous period of time for a given metric. To give a few examples…
- You can get the delta between each year’s sales and the previous year’s sales
- You can get the delta in the number of sign-ups/conversions/website visits on a month to month basis
- You can compare user churn on a monthly basis
The following query shows how you can query the monthly percent change in costs
with monthly_costs as ( SELECT date , monthlycosts , LEAD(monthlycosts) OVER (ORDER BY date) as previousCosts FROM costs ) SELECT date , (monthlycosts - previousCosts) / previousCosts * 100 AS costPercentChange FROM monthly_costs
2) Cumulative Sums
Calculating running totals can be simply done through a windows function that starts with SUM() or COUNT(). This is a powerful tool when you want to show the growth of a particular metric over time. More specifically, it’s useful in the following circumstances:
- Get a running total of revenue and costs over time
- Get a running total of time-spent-on-app per user
- Get a running total of conversions over time
The following example shows how you can include a cumulative sum column of monthly costs:
SELECT date , monthlycosts , SUM(monthlycosts) OVER (ORDER BY date) as cumCosts FROM cost_table
3) Moving Averages
AVG() is really powerful in windows functions as it allows you to compute moving averages over time.
Moving averages are a simple, yet effective, way to forecast values in the short term. They’re also extremely useful at smoothing out volatile curves on a graph. Generally, moving averages are used to gauge the general direction of where things are moving.
- They can be used to get the general trend of weekly sales (is the average going up over time?). This would indicate growth as a company.
- They can likewise be used to get the general trend of weekly conversions or website visits.
The following query is an example of getting the 10 day moving average for conversions.
SELECT Date , dailyConversions , AVG(dailyConversions) OVER (ORDER BY Date ROWS 10 PRECEDING) AS 10_dayMovingAverage FROM conversions
ROW_NUMBER() is particularly useful when you want to get the first or last record. For example, if you have a table of when gym members came to the gym and you want to get the date of the first day that they came to the gym, you can PARTITION BY customer (name/id) and ORDER BY purchase date. Then, in order to get the first row, you can simply filter for the rows with rowNumber equal to one.
This example shows how you can use ROW_NUMBER() to get the first date of when each member (user) visited.
with numbered_visits as ( SELECT memberId , visitDate , ROW_NUMBER() OVER (PARTITION BY customerId ORDER BY purchaseDate) as rowNumber FROM gym_visits ) SELECT * FROM numbered_visits WHERE rowNumber = 1
To recap, if you ever need to get the first or last record, ROW_NUMBER() is a great way to achieve that.
5) Record Ranking
DENSE_RANK() is similar to ROW_NUMBER() except that it returns the same rank for equal values. Dense ranking is quite useful when it comes to retrieving the top records, for example:
- If you want to pull the top 10 most-watched Netflix shows this week
- If you want to get the top 100 users based on dollars spent
- If you want to see the behaviour of the 1000 least active users
If you wanted to rank your top customers by total sales, DENSE_RANK() would be an appropriate function to use.
SELECT customerId , totalSales , DENSE_RANK() OVER (ORDER BY totalSales DESC) as rank FROM customers