Top Five SQL Window Functions for Data Science Interviews

Top Five SQL Window Functions For Data Science Interviews


Study smart, not hard.

SQL is the universal language in the data world and is the most important skill to nail down as a data professional.

The reason SQL is so important is that it is the main skill that is required during the data wrangling phase. A lot of data exploration, data manipulation, pipeline development, and dashboard creation is done through SQL.

What separates great data scientists from good data scientists is that great data scientists can wrangle data as much as the capabilities of SQL allow. A big part of being able to fully use all that SQL has to offer is by knowing how to use window functions. We also recommend checking out our ultimate guide to SQL Window Functions.

Let’s dive into it!

1) Deltas (Current vs Previous)

LEAD() and LAG() are mostly used when comparing one period of time with the previous period of time for a given metric. To give a few examples…

  • You can get the delta between each year’s sales and the previous year’s sales
  • You can get the delta in the number of sign-ups/conversions/website visits on a month to month basis
  • You can compare user churn on a monthly basis

Example:
The following query shows how you can query the monthly percent change in costs

with monthly_costs as (
    SELECT
        date
      , monthlycosts
      , LEAD(monthlycosts) OVER (ORDER BY date) as
        previousCosts
    FROM
        costs
)
SELECT
    date
  , (monthlycosts - previousCosts) / previousCosts * 100 AS
    costPercentChange
FROM monthly_costs

2) Cumulative Sums

Calculating running totals can be simply done through a windows function that starts with SUM() or COUNT(). This is a powerful tool when you want to show the growth of a particular metric over time. More specifically, it’s useful in the following circumstances:

  • Get a running total of revenue and costs over time
  • Get a running total of time-spent-on-app per user
  • Get a running total of conversions over time

Example:
The following example shows how you can include a cumulative sum column of monthly costs:

SELECT
    date
  , monthlycosts
  , SUM(monthlycosts) OVER (ORDER BY date) as cumCosts
FROM
    cost_table

3) Moving Averages

AVG() is really powerful in windows functions as it allows you to compute moving averages over time.

Moving averages are a simple, yet effective, way to forecast values in the short term. They’re also extremely useful at smoothing out volatile curves on a graph. Generally, moving averages are used to gauge the general direction of where things are moving.

More specifically…

  • They can be used to get the general trend of weekly sales (is the average going up over time?). This would indicate growth as a company.
  • They can likewise be used to get the general trend of weekly conversions or website visits.

Example:
The following query is an example of getting the 10 day moving average for conversions.

SELECT
    Date
  , dailyConversions
  , AVG(dailyConversions) OVER (ORDER BY Date ROWS 10 PRECEDING) AS
    10_dayMovingAverage
FROM
    conversions

4) ROW_NUMBER()

ROW_NUMBER() is particularly useful when you want to get the first or last record. For example, if you have a table of when gym members came to the gym and you want to get the date of the first day that they came to the gym, you can PARTITION BY customer (name/id) and ORDER BY purchase date. Then, in order to get the first row, you can simply filter for the rows with rowNumber equal to one.

Example:
This example shows how you can use ROW_NUMBER() to get the first date of when each member (user) visited.

with numbered_visits as (
    SELECT
        memberId
      , visitDate
      , ROW_NUMBER() OVER (PARTITION BY customerId ORDER BY
        purchaseDate) as rowNumber
    FROM
        gym_visits
)
SELECT
    *
FROM
    numbered_visits
WHERE 
    rowNumber = 1

To recap, if you ever need to get the first or last record, ROW_NUMBER() is a great way to achieve that.

5) Record Ranking

DENSE_RANK() is similar to ROW_NUMBER() except that it returns the same rank for equal values. Dense ranking is quite useful when it comes to retrieving the top records, for example:

  • If you want to pull the top 10 most-watched Netflix shows this week
  • If you want to get the top 100 users based on dollars spent
  • If you want to see the behaviour of the 1000 least active users

Example:
If you wanted to rank your top customers by total sales, DENSE_RANK() would be an appropriate function to use.

SELECT
    customerId
  , totalSales
  , DENSE_RANK() OVER (ORDER BY totalSales DESC) as rank
FROM
    customers

Top Five SQL Window Functions For Data Science Interviews


Become a data expert. Subscribe to our newsletter.