SQL Window Functions Interview Questions by Airbnb
Solving Time Series SQL Window Functions In Data Science Interviews
SQL Window Functions Interview Questions are common in data science and analytics interviews as data scientists handle time series data on a daily basis. It's the required part of a data science job to be able to manipulate and analyse all kinds of data but time series data is a data type that's almost always part of the analysis due to the rich nature of the insights that can come from analyzing users longitudinal journey or aggregating user behavior over time ranges.
In this blog, through a SQL window function interview question, we will learn what time series data is, when we should use them, and how we can implement SQL window functions to help manage time series data.
What is Time Series Data?
Time series data is the data that we collect at different points in time or we can say these are the variables within your data that have a time component. In such data, each value has either a date value or time value, and in some cases, they have both.
Now, let's have a look at some time series data examples:
- The daily stock prices is an example of time series data because each stock price is associated with a specific day.
- Forecasting at what hour during the day there is a peak consumption of electricity.
- Unique visits to a website over a month
- Monthly sales and revenue
- Daily logins or registrations for an app or a platform
LAG and LEAD SQL Window Functions
To handle time-series data, we need to have the future date or the previous date and their associated values. To accomplish this, we need these two SQL window functions: LAG AND LEAD. LAG and LEAD window functions are immensely useful in many situations to deal with time-related data.
Difference between LAG and LEAD
A LAG() function is used to get the data from the previous rows and LEAD() function does just the opposite, it is used to get the data from the following rows.
As a data analytics professional, you should be able to use these SQL window functions efficiently as you're very likely to work on time-related data. Your efficiency in LAG() and LEAD() functions can increase your performance and productivity.
A Data Science Interview Question That Requires a SQL Window Function
Let's go through an advanced SQL window functions interview question that was asked by Airbnb. SQL window functions is a common part of interview questions as well as your daily work. So, it's very important to know how you can use them.
SQL Window Functions Interview Question
Growth of Airbnb
Estimate the growth of Airbnb each year using the number of hosts registered as the growth metric. The rate of growth is calculated by taking ((number of hosts registered in the current year - number of hosts registered in the previous year) / the number of hosts registered in the previous year) * 100.
Output the year, number of hosts in the current year, number of hosts in the previous year, and the rate of growth. Round the rate of growth to the nearest percent and order the result in the ascending order based on the year. Assume that the dataset consists only of unique hosts, meaning there are no duplicate hosts listed.
Our approach to solving this SQL window functions interview question has three steps:
1. Count the host for the current year
The first thing is to count the number of hosts by year because we need to extract the year from the date values.
SELECT extract(year FROM host_since::date) as year, count(id) as current_year_host FROM airbnb_search_details WHERE host_since IS NOT NULL GROUP BY extract(year FROM host_since::date) ORDER BY year asc
2. Count the host for the previous year - SQL Window Function LAG()
The next step is where we will use the LAG SQL window function. We have to create a view for the year, number of hosts for that current year, and then number of hosts for that previous year. We'll learn here how to use the LAG window function to find the number of hosts for the previous year.
By using a LAG function and taking the last year's value and putting it in the same row as the year's count, we'll have three columns in the view - year, current year host count, and last year host count. The LAG() window function let us easily pull the last year host count in the row. This makes it easy to implement metrics like growth rate as we have all the values we need on one row.
SELECT year, current_year_host, LAG(current_year_host, 1) OVER (ORDER BY year) as prev_year_host FROM ( SELECT extract(year FROM host_since::date) as year, count(id) as current_year_host FROM airbnb_search_details WHERE host_since IS NOT NULL GROUP BY extract(year FROM host_since::date) ORDER BY year asc) a
3. Implement the growth metric
Now, we'll apply the growth rate calculation:
round(((current_year_host - prev_year_host)/prev_year_host::FLOAT*100))
SELECT year, current_year_host, prev_year_host, round(((current_year_host - prev_year_host)/prev_year_host:: FLOAT*100)) as rate_of_growth FROM ( SELECT year, current_year_host, LAG(current_year_host, 1) OVER (ORDER BY year) as prev_year_host FROM ( SELECT extract(year FROM host_since::date) as year, count(id) as current_year_host FROM airbnb_search_details WHERE host_since IS NOT NULL GROUP BY extract(year FROM host_since::date) ORDER BY year asc) a ) b
Run that in the SQL IDE and see what you get in the output.
This was one of extremely useful and common SQL window functions interview questions that you'd get in an interview or a question that you'd try to solve for on your day job. Specifically LAG and LEAD functions allows you to make your time series data manipulation much easier and allows you to easily solve problems like finding growth rates or year-over-year rolling averages. Such problems are dealt with every day by data scientists and business analysts.