Types of Window Functions in SQL and Questions Asked by Airbnb, Netflix, Twitter, and Uber
Window functions in SQL allow us to access data in the records before or/and after the current record. In this blog, we will learn what these window functions are, different types of window functions, and data science interview questions asked by Airbnb, Netflix, Twitter, and Uber.
What are Window Functions in SQL?
Window functions in SQL are a group of functions that performs calculations across a set of rows that are related to your current row. They are also considered as advanced SQL functions and are often asked in data science interviews. A data scientist or business analyst uses these functions on a daily basis to solve many different types of data problems.
Let’s summarize the 4 most popular types of window functions in SQL and cover the ‘why’ and ‘when’ you can use them.
4 Types of Window Functions in SQL
- Regular Aggregate Functions: (i) Aggregates like AVG, MIN/MAX, COUNT, SUM, (ii) These functions are used to aggregate your data and group it by another column like month or year
- Ranking Functions: (i) ROW_NUMBER, RANK, RANK_DENSE, (ii) These are the functions that help you rank your data. You can either rank your entire dataset or rank them by groups like by month or country, (iii) Extremely useful to generate ranking indexes within groups
- Generating Statistics: (i) These are useful if you want to generate simple statistics like NTILE (percentiles, quartiles, medians), (ii) This can be used for your entire dataset or by group
- Handling Time Series Data: (i) A very common window function in SQL especially if you need to calculate trends like a month-over-month rolling average or a growth metric, (ii) LAG and LEAD are the two functions that allow you to do this.
For a video tutorial on window functions in SQL and how to work through each example of these window functions, head over to this Youtube video:
1. Regular Aggregate Functions
Regular aggregate functions are the functions like average, count, sum, min/max that are applied to columns. The goal of these functions is to apply the aggregate function if you want to apply aggregations to different groups in the dataset, like month.
This is similar to the type of calculation that can be done with an aggregate function that you’d find in the SELECT clause, but unlike regular aggregate functions, window functions do not group several rows into a single output row, they are grouped together or retain their own identities, depending on how you find them.
Avg() Function Example:
Let’s take a look at one of the avg() window functions in SQL examples implemented to answer a data analytics question. You can view the question and write your code in the link below:
This is a perfect example of using a SQL window function and applying an avg() to a month group. In this window function example, we are trying to calculate the average distance per dollar by the month. This is hard to do in SQL without the avg() window function. We have applied the avg() window function to the 3rd column where we have found the average value for the month-year for every month-year in the dataset. We can use this metric to calculate the difference between the month and date average for each request date in the table.
Let's write the code now to implement the SQL window function:
SELECT a.request_date, a.dist_to_cost, AVG(a.dist_to_cost) OVER(PARTITION BY a.request_mnth) AS avg_dist_to_cost FROM (SELECT *, to_char(request_date::date, 'YYYY-MM') AS request_mnth, (distance_to_travel/monetary_cost) AS dist_to_cost FROM uber_request_logs) a ORDER BY request_date
2. Ranking Functions
Ranking functions are an important utility for data scientists. You always rank and index your data to better understand which rows are the best in your dataset. Window functions in SQL give you three ranking utilities - RANK(), DENSE_RANK(), ROW_NUMBER() - depending on your exact use case. These ranking functions will help you list your data in order and groups based on what you desire.
Rank() Function Example:
Now, let’s take a look at one ranking() window function example to see how we can rank data within groups using window functions in SQL. Follow along interactively with this link: https://platform.stratascratch.com/coding-question?id=9898&python=
In this example, we have to find the top salaries by department and we can’t just find the top 3 salaries without a window function because it will just give us the top 3 salaries across all departments. So, we need to rank the salaries by departments individually. This can be done by rank() and partitioned by department. From there it’s easy to filter for top 3 across all departments.
Let's write the code to output this table. You can also copy and paste in the SQL editor (link above) and see the same output.
SELECT department, salary, RANK() OVER (PARTITION BY a.department ORDER BY a.salary DESC) AS rank_id FROM (SELECT department, salary FROM twitter_employee GROUP BY department, salary ORDER BY department, salary) a ORDER BY department, salary DESC
3. Generating Statistics - NTILE
NTILE is a very useful window function for data analysts, business analysts, and data scientist. When there is a deadline with statistical data, you often need to create robust statistics such as quartile, quintile, median, decile in your daily job, and NTILE makes it easy to generate these outputs.
NTILE function takes an argument of the number of bins (or basically how many buckets you want to split your data into), and then creates this number of bins by dividing your data into that many number of bins. You set how the data is ordered and partitioned, if you want additional groupings.
In this example, we will learn how to use NTILE to categorize the data into percentiles. Follow along interactively through the link here: https://platform.stratascratch.com/coding-question?id=10303&python=
Here you have to identify the top 5 percent of claims based on a score an algorithm outputs. But you can’t just find the top 5% and do an ORDER BY because you want to find the top 5% by state. So, we have one way to do this and that is using an NTILE() ranking function and then PARTITION by the state. You can then apply a filter in the WHERE clause to get the top 5%.
Here’s the code to get output the entire table above. Use this code to paste it in the link above.
SELECT policy_num, state, claim_cost, fraud_score, percentile FROM (SELECT *, NTILE(100) OVER(PARTITION BY state ORDER BY fraud_score DESC) AS percentile FROM fraud_score) a WHERE percentile <=5
4. Handling Time Series Data
LAG and LEAD are two SQL window functions that are useful for dealing with time series data. LEAD() is similar to LAG() but the only difference between LAG and LEAD window functions is whether you want to grab from the previous rows or following rows, almost like sampling from previous data or future data.
You can use LAG and LEAD window functions to calculate month-over-month growth or rolling averages. As a data scientist and business analyst, you always need to deal with time-series data and creating the time metrics.
In this example, we have to find the percentage growth year-over-year, which is a very common question that data scientists and business analysts deal with on a daily basis. The problem statement, data, and SQL editor is in the following link, we suggest you to try to code the solution on your own: https://platform.stratascratch.com/coding-question?id=9637&python=
What is hard about this problem is the data is set up -- you need to use the previous row’s value in your metric. But SQL isn’t built to do that. SQL is built to calculate anything you want as long as the values are on the same row. So we can use the lag() or lead() window function which will take the previous or subsequent rows and put it in your current row which is what this question is doing.
Here’s the code to output this table. You can also copy and paste the following code in the SQL editor:
SELECT year, current_year_host, prev_year_host, round(((current_year_host - prev_year_host) /(cast(prev_year_host AS numeric)))*100) estimated_growth FROM (SELECT year, current_year_host, LAG(current_year_host, 1) OVER (ORDER BY year) AS prev_year_host FROM (SELECT extract(year FROM host_since::date) AS year, count(id) current_year_host FROM airbnb_search_details WHERE host_since IS NOT NULL GROUP BY extract(year FROM host_since::date) ORDER BY year) t1) t2
Window functions in SQL are very useful as a data scientist on your daily job and are often asked by the interviewers in interviews. These SQL window functions are used to solve rankings and calculating growth problems and these window functions in SQL make those problems much easier than if you didn’t have these functions.