The Ultimate Guide to SQL Window Functions

SQL Window Functions Guide
  • Author Avatar
    Written by:

    Tihomir Babic

This is a complete overview of the SQL window functions. Learn everything you need to know about them to excel at your job interview and an actual job.

SQL window functions commonly feature in data scientists' and analysts' day-to-day work. They use them for complex calculations that are not possible with other ‘regular’ SQL functions.

Because of that, interviewers at most companies expect job candidates to use window functions in their SQL code whenever possible.

Before doing that, you need to understand the basics of window functions.

What Are SQL Window Functions?

Formally speaking, window functions use values from multiple rows to calculate values for each row separately. In other words, they perform calculations on the set of rows related to the current row. That set of rows is known as a ‘window’, hence the name ‘window functions’.

This might sound like aggregate functions, but there is one significant difference. Window functions, unlike aggregate functions (SUM(), AVG(), etc.), keep individual rows while displaying calculated values.

There are several important aspects of SQL window functions you need to understand when writing a window function.

What Are SQL Window Functions

Why SQL Window Functions Matter in Interviews

Here’s why: window functions are one of the most common ways interviewers separate junior candidates from senior ones.

Technically, you don’t need window functions. Most SQL problems can be solved by using subqueries, self-joins, and GROUP BY. But with window functions, you solve the same problems more elegantly and in fewer code lines. That’s what interviewers are looking for. Anyone can write working but bloated SQL code that returns the expected result.

But if you know when to use window functions and which ones to use, without the interviewer explicitly asking for them? That shows the interviewer you’ve moved beyond simply memorizing the window function syntax and started thinking in patterns. You know down to the smallest detail what a particular window function does and why it’s suitable for a certain pattern.

The 6 SQL Window Function Patterns You Need for Interviews

SQL Window Function Patterns You Need for Interviews

I’ll now show you how these work in practice by solving scenario-based interview questions.

Interview Pattern #1: Find the Latest Row per User

The question from Amazon and Etsy tasks you with finding the callers whose first and last calls were to the same person on a given day.

Last Updated: October 2022

MediumID 2132

Given a phone log table that has information about callers' call history, find out the callers whose first and last calls were to the same person on a given day. Output the caller ID, recipient ID, and the date called.

Go to the Question

Dataset

There’s a table named caller_history.

Table: caller_history
Loading Dataset

As you can see, the grain level is a call.

Solution

1. Finding the first and the last call

The most straightforward window-function approach is to use FIRST_VALUE(). As you probably suspected, that’s a window function that returns the first value – i.e., the first row – within the window. 

I’ll use it to get the first call and also to get the last call, by sorting the window in descending order. 

(You might ask why not use LAST_VALUE() in the latter case. Be patient, we’ll come to that. But trust me when I say that FIRST_VALUE() is your safest option.)

The data is partitioned by caller ID and call date. I apply DATE() because without it, each call would be a separate partition, since they are timestamps. 

WITH first_and_last AS
  (SELECT *,
          FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called) ORDER BY date_called ASC) AS first_call,
          FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE (date_called) ORDER BY date_called DESC) AS last_call
FROM caller_history)

Be aware that window functions keep all the individual rows. So, the above SQL CTE would produce rows for every call made on a given day.

Let’s query it with a simple SELECT, so you know what I mean.

WITH first_and_last AS
  (SELECT *,
          FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE(date_called) ORDER BY date_called ASC) AS first_call,
          FIRST_VALUE(recipient_id) OVER(PARTITION BY caller_id, DATE (date_called) ORDER BY date_called DESC) AS last_call
FROM caller_history)

SELECT *
FROM first_and_last;

Here’s the output.

caller_idrecipient_iddate_calledfirst_calllast_call
122022-01-01 09:00:0024
132022-01-01 17:00:0024
142022-01-01 23:00:0024
252022-07-05 09:00:0053
252022-07-05 17:00:0053
232022-07-05 23:00:0053
252022-07-06 17:00:0055
232022-08-01 09:00:0033
232022-08-01 17:00:0033
242022-08-02 09:00:0044
252022-08-02 10:00:0044
242022-08-02 11:00:0044

2. Finding the first call by the caller

As you can see, we’re still on an analytical level. What we need to do here is find a unique combination of the columns required by the question: caller_id, recipient_id, and date_called. We again use DATE() on the date_called, because that’s also a requirement.

In the WHERE clause, we keep only the callers whose first call was to the same recipient as the last call.

PostgreSQL

Run the code above to see the output.

The Dangers of Using LAST_VALUE()

The thing is that, when you add ORDER BY to a window function, SQL quietly applies a default frame: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means that the window is all the rows before the current row + the current row.

That’s fine when you use FIRST_VALUE(); the window frame includes all the preceding rows, so the first value really is the first value of the partition.

But for LAST_VALUE() – with that default window definition – the last value is actually the current row.

That’s where you can easily fail your interview, because you think just replacing FIRST_VALUE() with LAST_VALUE() to find the last call, then sorting the partition in the opposite (ascending) order, will do the trick.

PostgreSQL

No, it won’t. You can run the code above and see that the solution is incorrect.

If you really insist on using LAST_VALUE() – which I don’t recommend, as it’s easy to forget about the window frame – then you should explicitly define the window frame as UNBOUNDED FOLLOWING`. (More about window frames in the section about Interview Pattern #5.)

WITH first_and_last AS (
    SELECT *,
           FIRST_VALUE(recipient_id) OVER (PARTITION BY caller_id, DATE(date_called) ORDER BY date_called ASC) AS first_call,
           LAST_VALUE(recipient_id) OVER (PARTITION BY caller_id, DATE(date_called) ORDER BY date_called ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_call
    FROM caller_history
)

SELECT DISTINCT caller_id,
                first_call AS recipient,
                DATE(date_called)
FROM first_and_last
WHERE first_call = last_call;

Safer Alternative: ROW_NUMBER()

Another way of solving this problem is to use ROW_NUMBER() instead of FIRST_VALUE().

While FIRST_VALUE() is a more elegant solution, you should also know how to use ROW_NUMBER() in this pattern. And when. Which is when you need the full row, need to handle ties deliberately, or need to operate on the first and last groups as separate datasets.

Here’s the interview question solution rewritten with ROW_NUMBER().

This window function assigns unique integers to each row within a partition. You rank them from the first to the last call, then from the last to the first call.

Then, in separate CTEs, keep only rows where the row number is 1; those are your first and last calls. 

Finally, join those two CTEs on the caller_id, call_date, and recipient_id, and there you have it.

WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY caller_id, DATE(date_called)
                              ORDER BY date_called ASC)  AS rn_first,
           ROW_NUMBER() OVER (PARTITION BY caller_id, DATE(date_called)
                              ORDER BY date_called DESC) AS rn_last
    FROM caller_history
),
first_calls AS (
    SELECT caller_id, recipient_id, DATE(date_called) AS call_date
    FROM ranked
    WHERE rn_first = 1
),
last_calls AS (
    SELECT caller_id, recipient_id, DATE(date_called) AS call_date
    FROM ranked
    WHERE rn_last = 1
)
SELECT f.caller_id,
       f.recipient_id,
       f.call_date
FROM first_calls f
JOIN last_calls l
    ON  f.caller_id  = l.caller_id
    AND f.call_date  = l.call_date
    AND f.recipient_id = l.recipient_id;

Interview Pattern #2: Rank Customers, Products, or Events

Ranking is another popular interview pattern and typically calls for using one of these functions.

SQL Window Ranking Functions

As for the practical use, let’s solve this interview question from the City of San Francisco.

We need to find the top two highest-paid City employees for each job title, ranked by total pay and benefits. The output should include the job title and the names of the highest- and second-highest-paid employees.

HardID 9989

Find the top 2 highest paid City employees for each job title. Use totalpaybenefits column for their ranking. Output the job title along with the corresponding highest and second-highest paid employees.

Go to the Question

Dataset

We’re working with the table named sf_public_salaries.

Table: sf_public_salaries
Loading Dataset

Solution

1. Ranking employees on their total pay and benefits

We must use ROW_NUMBER() in the solution. Why? We now know that it assigns unique integers, even if there are rows with the same value, i.e., duplicates. 

OK, but the question never explicitly says which ranking window function to use. Of course it doesn’t; it’s an interview question! Do you think your boss will hold your hand and tell you which function to use? No, ‘course not! You have to decide on your own. 

The key to this interview question is in the expectation that exactly one employee lands in position 1 and exactly one in position 2. Different positions, even if they have the same salary + benefits? That sounds like a job for ROW_NUMBER().

We use it to query employees, their job titles and total benefits, and to rank them from highest to lowest total benefits. The ranking is for each job title separately; that’s why I partitioned the data. 

SELECT employeename,
       jobtitle,
       totalpaybenefits,
       ROW_NUMBER() OVER (PARTITION BY jobtitle ORDER BY totalpaybenefits DESC) AS pos
FROM sf_public_salaries

As a standalone query, it gives you this output. You can see that employees within each job title are ranked separately.

employeenamejobtitletotalpaybenefitspos
PATRICIA JACKSONCAPTAIN III (POLICE DEPARTMENT)297608.921
ANNA BROWNCAPTAIN III (POLICE DEPARTMENT)238551.882
DOUGLAS MCEACHERNCAPTAIN III (POLICE DEPARTMENT)196494.143
JOHN LOFTUSCAPTAIN III (POLICE DEPARTMENT)192951.374
TERESA BARRETTCAPTAIN III (POLICE DEPARTMENT)192914.55
Lance A Obtinalla JrDeputy Sheriff137519.281
Arnel MarachaDeputy Sheriff137235.942
Patrick T TruongDeputy Sheriff136478.653
Yvette R GayDeputy Sheriff135221.294
Petra HahnDeputy Sheriff135073.935
Donald V OrtizDeputy Sheriff134987.816
Theresa CourtneyDeputy Sheriff133147.677
Stephen KendallDeputy Sheriff131854.18
Patrick CraneDeputy Sheriff129599.949
Rizaldy T TabadaDeputy Sheriff129071.1610
Ceasar GarciaDeputy Sheriff127283.3511
Otha CottonDeputy Sheriff126856.8412
Patrick T TruongDeputy Sheriff125480.9513
Aaron SchmidtDeputy Sheriff124012.6314
Sarah SilvaDeputy Sheriff123079.6715
Richard KendallDeputy Sheriff122344.816
Robert J LobergDeputy Sheriff119079.5917
Brian E OconnorDeputy Sheriff11175.9118
Kenia C CoronadoEligibility Worker55429.491
Tretha T StroughterEligibility Worker38346.822
Mary R CarrEligibility Worker37903.053
Barry HyunEligibility Worker35101.864
Arturo GalarzaEligibility Worker32276.855
Daniel Phillip BoutoteEligibility Worker24378.446
Heather E GutierresEligibility Worker6339.277
Mandisa MabreyEligibility Worker3572.428
Melinda KuochEligibility Worker2439.069
Claire M LefloreEligibility Worker2013.9410
Shickola RicksEligibility Worker1634.6911
Alexander M LamondEMT/Paramedic/Firefighter139818.11
Teresa L CavanaughEMT/Paramedic/Firefighter139524.662
Jared F CooperEMT/Paramedic/Firefighter132354.453
Michael MasonEMT/Paramedic/Firefighter131679.74
Graham P HoffmanEMT/Paramedic/Firefighter128504.145
Thomas RoEMT/Paramedic/Firefighter125213.26
Ryan J JamisonEMT/Paramedic/Firefighter117370.717
Jason L LandivarEMT/Paramedic/Firefighter116936.658
Patrick S RenshawEMT/Paramedic/Firefighter114467.019
James LockhartEMT/Paramedic/Firefighter79674.5510
Nicole J Lafata SharkEMT/Paramedic/Firefighter56376.8611
Christina A CouchEMT/Paramedic/Firefighter48397.6912
Neal K RodilEMT/Paramedic/Firefighter47402.3813
Emily AndersonEMT/Paramedic/Firefighter45310.0114
Sherry MahoneyEMT/Paramedic/Firefighter43766.8915
Seaborn ChilesEMT/Paramedic/Firefighter43628.5616
Andrew G ChenEstate Investigator113849.011
Grace LinEstate Investigator113066.82
Patrick MartinezEstate Investigator110648.973
Denise AlexanderEstate Investigator92145.74
Gregory B BovoFirefighter33739.711
Ryan C CrowFirefighter33739.712
Ernest E HaylesFirefighter15752.323
Elizabeth M LeahyFirefighter1886.384
Kari A JohnsonFirefighter832.15
NATHANIEL FORDGENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY567595.431
EDWARD REISKINGENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY230827.122
Dennis J GerbinoIS Programmer Analyst96972.321
Clark BellIS Programmer Analyst-Senior95399.51
Melinda WongJunior Administrative Analyst81039.651
Natalia Y Castillo VillegasJunior Administrative Analyst80020.192
Francis J MonsadaJunior Administrative Analyst79948.23
LEA YAMAGATALIBRARY PAGE17064.421
Nina DavisPS Aide Health Services9694.341
Ray TorresPublic Service Trainee1307.051
Alondra Correa AlmanzaPublic Service Trainee1117.72
Jack O'SullivanPublic Service Trainee1079.423
Eric J ChenPublic Service Trainee844.144
Patrick F McpartlandPublic Service Trainee506.145
Darrel A LachapellePublic Service Trainee323.26
Sophia WuPublic Service Trainee298.367
Jimmy TangPublic Service Trainee184.418
Pecola JonesPublic Svc Aide-Public Works4972.341
Roy HillPublic Svc Aide-Public Works2684.012
Alicia BrownPublic Svc Aide-Public Works2597.323
Kevin ClarkPublic Svc Aide-Public Works2542.054
Antionette L RileyPublic Svc Aide-Public Works2420.185
Clarence WalkerPublic Svc Aide-Public Works2375.726
Maiza PadillaPublic Svc Aide-Public Works2371.017
Willie CollinsPublic Svc Aide-Public Works2363.918
Mario ArmbristerPublic Svc Aide-Public Works2250.069
Jesus J ArominPublic Svc Aide-Public Works2144.710
Alan K TolbertPublic Svc Aide-Public Works2090.9911
Orlando M GonzalezPublic Svc Aide-Public Works2006.9612
Fernando BarajasPublic Svc Aide-Public Works1763.6713
Arnitial DonelyPublic Svc Aide-Public Works1623.1914
Ivan Y CastilloPublic Svc Aide-Public Works1613.8815
David Ayers JrPublic Svc Aide-Public Works1560.1116
Iesha M JonesPublic Svc Aide-Public Works1558.417
Glenn DanielsPublic Svc Aide-Public Works1138.818
Jose F GranadosPublic Svc Aide-Public Works667.8119
Marcus DutyPublic Svc Aide-Public Works578.1920
Silas D Moultrie JrPublic Svc Aide-Public Works322.2121
Andre ThomasPublic Svc Aide-Public Works172.1322
Brighton M LeungPublic Svc Aide-Public Works108.4323
Sara ParedesRegistered Nurse89513.971
Emilia C PatrickRegistered Nurse87357.272
Kathryn L FowlerRegistered Nurse86194.273
Raffaella V WilsonRegistered Nurse78763.724
Adina M DiamondRegistered Nurse77879.375
Mary C HustonRegistered Nurse77834.886
Karen M GomezRegistered Nurse68013.877
Ying Ying HuiRegistered Nurse60078.48
Christopher WilcoxRegistered Nurse49366.559
Lilly FungRegistered Nurse38692.4510
Beverly BagdorfRegistered Nurse32809.6511
Mo Ching WanRegistered Nurse29364.7712
Jeanne M D'ArcyRegistered Nurse29212.9513
Audrey NgoRegistered Nurse23237.1614
Margarita HerreraRegistered Nurse22320.1315
Liane AngusRegistered Nurse11499.0416
Robert MartinezRegistered Nurse9969.1117
Joann G SiobalRegistered Nurse7959.1818
Leanne Abrigo JohnsonRegistered Nurse4891.0719
Linda PizzornoRegistered Nurse1025.6520
Renato C GurionRegistered Nurse7.2421
Frankie JohnsonSenior Eligibility Worker100795.021
YEVA JOHNSONSENIOR PHYSICIAN SPECIALIST178760.581
Mark F ObrochtaSergeant 3195887.061
Kathryn WaalandSergeant 3189864.252
Lawrence ChanSergeant 3189689.573
Joseph M SalazarSergeant 3186630.424
Melonee AlvarezSergeant 3186535.525
Steven StockerSergeant 3186479.796
Omar BuenoSergeant 3185627.057
Patrick KennedySergeant 3177603.948
Ernanie RasqueroSpecial Nurse2739.941
Paul J OrtizSpecial Nurse2463.162
Elizabeth J DayritSpecial Nurse2334.23
Emilia C PatrickSpecial Nurse2198.094
Yolanda RamirezSpecial Nurse2129.095
Ayala MirandeSpecial Nurse1743.186
David FlemingSpecial Nurse1729.847
Jamie J DwyerSpecial Nurse1673.388
Jessica H LeeSpecial Nurse1517.169
Russell Patrick MangahasSpecial Nurse1431.5310
Justina DizonSpecial Nurse1368.0611
Juliet PalarcaSpecial Nurse1265.3212
John DonnellySpecial Nurse1232.3213
Alma Rosa GarciaSpecial Nurse1069.0414
David A FlemingSpecial Nurse777.1815
Reuben ReyesSpecial Nurse726.5316
Carla R GreenblattSpecial Nurse652.217
Catheryn WilliamsSpecial Nurse474.5418
Graciela M ArevaloSpecial Nurse350.3919
Connie Love-MilesSpecial Nurse332.2320
Blesilda P HuypungcoSpecial Nurse137.9421
Vanessa E AlmaguerSpecial Nurse88.1822
Grace SaludSpecial Nurse46.2723
Merter BozkurtTransit Operator43295.641
Matthew LuTransit Operator40864.942
Terry L JohnsonTransit Operator38919.753
Sedrick M McarthurTransit Operator36883.764
Alvin SosaTransit Operator31625.885
Pelzie L SmithTransit Operator28403.956
Linda EdwardsTransit Operator28298.637
Dhakir R ZakiTransit Operator27572.328
Miguel J Gonzalez JrTransit Operator25922.19
Wallina C PelletteTransit Operator24211.8110
Nicole FreeneyTransit Operator22228.3911
Annette KessTransit Operator19767.1812
Emmanuel R BorjaTransit Operator1911013
Victor N YeungTransit Operator18175.2814
Tim C GhigliazzaTransit Operator17821.3515
Tara A AmadoTransit Operator17271.7516
Niem TranTransit Operator14982.8417
George L Abrams JrTransit Operator1371318
Marie E MonsenTransit Operator13544.0319
Luana D Beavers-DeloachTransit Operator13142.6520
Larry J DavisTransit Operator13036.0521
Terri L MathisTransit Operator12023.922
Michaela T WomackTransit Operator11943.2923
Willie Daigle JrTransit Operator11916.0824
Monique C JacobsTransit Operator11808.5125
Samer BouriTransit Operator8794.5326
Ben D MaloneTransit Operator8681.4227
Carolyn E RobinsonTransit Operator8675.2128
Lawrence C Blakes JrTransit Operator8562.4429
Jarrett L LouieTransit Operator8554.7130
Robert E Lee JrTransit Operator8505.1131
John HuntTransit Operator8493.2732
George FudgeTransit Operator8308.3133
Ira NewmanTransit Operator8074.934
Tracy Y HigginsTransit Operator8014.3435
Georgina M PinedaTransit Operator7959.1836
Gwen FerdinandTransit Operator7751.4737
Jacquelyn VassarTransit Operator6953.8338
Wing C KwanTransit Operator5617.4339
Lenora HamiltonTransit Operator3993.8740
Kennis GrantTransit Operator3618.6641
Vivian A ShowersTransit Operator3038.4442
Darryl L ArmstrongTransit Operator2500.8643
Sen Cheong LaiTransit Operator1164.9144
Iris J LettTransit Operator1150.7145
Joannie KeysTransit Operator1127.2246
Maria E ZunigaTransit Operator114.5447

Scroll down a little until you reach firefighters. You’ll see ROW_NUMBER() doing its magic: Gregory B Bovo and Ryan C Crow – those must be real names! – have the same total compensation, but they have different ranks. How does it decide which gets which rank? It’s up to the database engine to decide, meaning ROW_NUMBER() is non-deterministic. Yes, run the query the second time, and you might get different results. For this task, it doesn’t really matter. It seems our engine decided to rank alphabetically when total benefits were the same; that’s perfectly fine.

2. Selecting only employees ranked as 1 and 2

In the next step, we turn the above query into a subquery. The outer query uses it to select only the first- and second-ranked employees, then uses CASE WHEN to assign a separate column to each employee. I named those columns primus and secundum. Why not flex your Latin?

SELECT jobtitle,
       CASE
           WHEN pos = 1 THEN employeename
           ELSE NULL
       END AS primus,
       CASE
           WHEN pos = 2 THEN employeename
           ELSE NULL
       END AS secundum
FROM
   (SELECT employeename,
           jobtitle,
           totalpaybenefits,
           ROW_NUMBER() OVER (PARTITION BY jobtitle ORDER BY totalpaybenefits DESC) AS pos
      FROM sf_public_salaries) one
WHERE pos <= 2

Here’s what the output looks like, so you get the feeling of what we’re doing here. That output is basically it; all the info we need is there. Only, we have to make it more readable, as there’s no need to have duplicate job title rows.

jobtitleprimussecundum
CAPTAIN III (POLICE DEPARTMENT)PATRICIA JACKSON
CAPTAIN III (POLICE DEPARTMENT)ANNA BROWN
Deputy SheriffLance A Obtinalla Jr
Deputy SheriffArnel Maracha
Eligibility WorkerKenia C Coronado
Eligibility WorkerTretha T Stroughter
EMT/Paramedic/FirefighterAlexander M Lamond
EMT/Paramedic/FirefighterTeresa L Cavanaugh
Estate InvestigatorAndrew G Chen
Estate InvestigatorGrace Lin
FirefighterGregory B Bovo
FirefighterRyan C Crow
GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITYNATHANIEL FORD
GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITYEDWARD REISKIN
IS Programmer AnalystDennis J Gerbino
IS Programmer Analyst-SeniorClark Bell
Junior Administrative AnalystMelinda Wong
Junior Administrative AnalystNatalia Y Castillo Villegas
LIBRARY PAGELEA YAMAGATA
PS Aide Health ServicesNina Davis
Public Service TraineeRay Torres
Public Service TraineeAlondra Correa Almanza
Public Svc Aide-Public WorksPecola Jones
Public Svc Aide-Public WorksRoy Hill
Registered NurseSara Paredes
Registered NurseEmilia C Patrick
Senior Eligibility WorkerFrankie Johnson
SENIOR PHYSICIAN SPECIALISTYEVA JOHNSON
Sergeant 3Mark F Obrochta
Sergeant 3Kathryn Waaland
Special NurseErnanie Rasquero
Special NursePaul J Ortiz
Transit OperatorMerter Bozkurt
Transit OperatorMatthew Lu

3. Grouping the data and finding the first- and second-ranked employees

Now, the final step: all the previous code becomes another subquery, embedded in an outer query. I use MAX() and GROUP BY to collapse two rows into a single row. Don’t worry, MAX() doesn’t perform any meaningful “maximum” calculation here. It’s just a trick I use to extract the single non-NULL value per group; MIN() would work, too. 

Here’s the whole code.

PostgreSQL
Go to the question on the platformTables: sf_public_salaries

Run the code above to see the output.

Why Do RANK() & DENSE_RANK() Fail Here?

Let’s go back to our first subquery, the one where all the ranking occurs. I’ve changed it slightly by adding RANK() and DENSE_RANK() and keeping only the firefighter job title (only for educational purposes); that’s where there are employees with the same total salary.

SELECT employeename,
       jobtitle,
       totalpaybenefits,
       ROW_NUMBER() OVER (PARTITION BY jobtitle ORDER BY totalpaybenefits DESC) AS row_number_pos,
       RANK() OVER (PARTITION BY jobtitle ORDER BY totalpaybenefits DESC) AS rank_pos,
       DENSE_RANK() OVER (PARTITION BY jobtitle ORDER BY totalpaybenefits DESC) AS dense_rank_pos   
FROM sf_public_salaries
WHERE jobtitle = 'Firefighter'

Here is what you get.

employeenamejobtitletotalpaybenefitsrow_number_posrank_posdense_rank_pos
Gregory B BovoFirefighter33739.71111
Ryan C CrowFirefighter33739.71211
Ernest E HaylesFirefighter15752.32332
Elizabeth M LeahyFirefighter1886.38443
Kari A JohnsonFirefighter832.1554

With RANK(), those two employees with the same total salary will both be ranked as 1. Even if we pick the correct first-ranked employee, this method doesn’t assign a rank 2, so the second_best column in the final code will be empty. That’s, of course, incorrect. 

As for DENSE_RANK(), the first part of the problem is the same; both employees are ranked as 1. However, there is an employee ranked as 2, which is great. But, it’s Ernest E Hayles, which is, again, incorrect.

You see where I’m going with this? Let’s see the final code with either method, and let’s keep only firefighters.

Here’s RANK()

SELECT jobtitle,
       MAX(primus) AS best,
       MAX(secundum) AS second_best
FROM
  (SELECT jobtitle,
          CASE
              WHEN pos = 1 THEN employeename
              ELSE NULL
          END AS primus,
          CASE
              WHEN pos = 2 THEN employeename
              ELSE NULL
          END AS secundum
   FROM
     (SELECT employeename,
             jobtitle,
             totalpaybenefits,
             RANK() OVER (PARTITION BY jobtitle
                                ORDER BY totalpaybenefits DESC) AS pos
                                
      FROM sf_public_salaries
      WHERE jobtitle = 'Firefighter') one
   WHERE pos <= 2) two
GROUP BY jobtitle;

As I predicted, the second_best column is empty. This is why using RANK() here would make you fail the interview. 

jobtitlebestsecond_best
FirefighterRyan C Crow

Here’s the same thing, but with DENSE_RANK().

SELECT jobtitle,
       MAX(primus) AS best,
       MAX(secundum) AS second_best
FROM
  (SELECT jobtitle,
          CASE
              WHEN pos = 1 THEN employeename
              ELSE NULL
          END AS primus,
          CASE
              WHEN pos = 2 THEN employeename
              ELSE NULL
          END AS secundum
   FROM
     (SELECT employeename,
             jobtitle,
             totalpaybenefits,
             DENSE_RANK() OVER (PARTITION BY jobtitle
                                ORDER BY totalpaybenefits DESC) AS pos
                                
      FROM sf_public_salaries
      WHERE jobtitle = 'Firefighter') one
   WHERE pos <= 2) two
GROUP BY jobtitle;

You can see in the output that the best and the second-best employees are incorrect. Gregory B Bovo and Ryan C Crow should be there. Instead, we got Ernest E Hayles. That’s the reason why you can’t use DENSE_RANK(), either. 

jobtitlebestsecond_best
FirefighterRyan C CrowErnest E Hayles

Interview Pattern #3: Calculate Running Totals and Cumulative Metrics

Running totals and cumulative metrics in general are calculated using the SUM() window function with the ORDER BY clause in OVER().

SQL Window Functions Interview Pattern

I’ll demonstrate this with the interview question from Goldman Sachs and Deloitte.

Last Updated: December 2021

HardID 2082

You are given a day worth of scheduled departure and arrival times of trains at one train station. One platform can only accommodate one train from the beginning of the minute it's scheduled to arrive until the end of the minute it's scheduled to depart. Find the minimum number of platforms necessary to accommodate the entire scheduled traffic.

Go to the Question

You’re given one day’s data about trains’ departures and arrivals at one train station. You need to find the minimum number of platforms necessary to accommodate the entire scheduled traffic. All that knowing that one platform can only accommodate one train at a time, i.e., from its arrival until its departure time.

Dataset

There are two tables; the first one is train_arrivals.

Table: train_arrivals
Loading Dataset

The second table is, as expected, train_departures.

Table: train_departures
Loading Dataset

Solution

In the CTE, we chain two SELECT statements: the first one marks the train’s arrival as 1, the second one marks its departure as -1. With UNION ALL, we merge the results of both statements into a single dataset.

1. Combine arrivals and departures in one dataset

The result is sorted by time in ascending order and by mark in descending order; in case of ties, arrivals are considered before departures.

WITH timetable AS(
    (SELECT train_id,
            arrival_time AS time,
            1 AS mark
     FROM train_arrivals
     
     UNION ALL 
     
     SELECT train_id,
            departure_time AS time,
            -1 AS mark
     FROM train_departures
     )
    ORDER BY time ASC, mark DESC),

Here’s what the data looks like now.

train_idtimemark
108:001
208:051
308:051
408:101
508:101
208:10-1
108:15-1
508:20-1
308:20-1
408:25-1
612:151
712:201
812:251
712:25-1
812:30-1
613:00-1
1015:001
1115:001
915:001
915:05-1
1215:061
1015:10-1
1115:15-1
1215:15-1
1320:001
1420:101
1320:15-1
1420:15-1

2. Calculating the cumulative sum

Next, we add another CTE where we calculate the running totals. In this case, it’s not cumulative revenue, but a cumulative number of trains waiting at the station at the same time. This will give us the answer to the interview question.

The cumulative is calculated using the SUM() window function and summing marks (i.e., departures and arrivals) from the earliest to the latest time. This is the core principle of cumulative metrics. You can’t sum just any random time periods one after the other; the time data has to be sorted in ascending order.

In addition, we sort the window by mark in descending order. This is to handle the tie-breaking edge case, when a train arrives, and another train departs at the exact same time.

Let’s clarify this. Here’s what happens in the edge case if we first process the departure.

SQL Window Functions Interview Pattern

This says that we need only one platform to accommodate these two trains. But that doesn’t make sense, because the first train hadn’t yet left when the second train should have already arrived. We know that one platform can accommodate only one train.

Therefore, we have to process arrivals first, which will increase the cumulative number of trains like this.

SQL Window Functions Interview Pattern

The cumulative sum peaks at two. That’s correct, we need two platforms for those two trains.

So, back to our query. By adding the cumsum CTE…

cumsum AS
  (SELECT *,
          SUM(mark) OVER (ORDER BY time ASC, mark DESC) AS trains_at_same_time
   FROM timetable
   )

…we get this output. It’s an overview of the number of platforms for each train’s departure and arrival.

train_idtimemarktrains_at_same_time
108:0011
208:0513
308:0513
408:1015
508:1015
208:10-14
108:15-13
508:20-11
308:20-11
408:25-10
612:1511
712:2012
812:2513
712:25-12
812:30-11
613:00-10
1015:0013
1115:0013
915:0013
915:05-12
1215:0613
1015:10-12
1115:15-10
1215:15-10
1320:0011
1420:1012
1320:15-10
1420:15-10

3. Aggregation and the final output

The only thing remaining is to aggregate that into a single number using MAX() to find the maximum cumulative value, which is – I know this might sound contradictory – the minimum number of platforms necessary to accommodate the entire scheduled traffic.

So, here’s the complete solution.

PostgreSQL
Go to the question on the platformTables: train_arrivals, train_departures

Here’s the output.

min_platforms
5

Interview Pattern #4: Compare the Current Row to the Previous or Next Row

For this pattern, the two main window functions you’ll need are LAG() and LEAD().

LAG and LEAD in SQL Window Functions

This question by ESPN is a great example of this pattern in practice.

HardID 9957

Find how the average male height changed between each Olympics from 1896 to 2016. Output the Olympics year, average height, previous average height, and the corresponding average height difference. Order records by the year in ascending order.

If avg height for some year is not found, assume that the average height of athletes for that year is 172.73.

Go to the Question

You need to find how the average male height changed between the Olympics from 1896 to 2016. The output should consist of the Olympics year, the average height, the previous Olympics’ average height, and the height difference between the two years.

The question also states that the average height is 172.73 if there’s no average height for that year.

Dataset

You’re given the olympics_athletes_events table.

Table: olympics_athletes_events
Loading Dataset

Solution

How would you solve this problem? By reading the question’s requirements, you should notice that we’re comparing the current year with the previous year. That should signal to you that LAG() is appropriate.

Let’s approach this systematically. 

The only non-calculated column that’s required in the output is year, so we select it. 

Next, we calculate the average height and round the result to two decimal places. 

Now, for each Olympics, we have to access the previous Olympics’ calculated average height. This is what we want to access, therefore AVG(height)::NUMERIC in LAG(). This accesses the previous row; to ensure that the previous row is actually the previous Olympics, we sorted the data within the window from the earliest to the latest year, i.e., ORDER BY year.

In addition, we use COALESCE() to ensure that the average of 172.73 is imputed in case there’s no height data for that year, and, again, round everything to two decimal places.

The third calculation is a simple subtraction of the second calculation from the first one, i.e., the previous Olympics’ average height from the current Olympics’ average height. 

The final touch includes selecting only male athletes in WHERE whose height is not NULL, grouping by year, and sorting the output by year in ascending order.

PostgreSQL
Go to the question on the platformTables: olympics_athletes_events

Run the code above to see the output.

What About LEAD()?

I won’t spend much time on this. Once you know LAG(), you know LEAD(), and vice versa. 

You know that LEAD() accesses the following row’s value. So, you can rewrite the code above by replacing LAG() with LEAD() and simply reversing the ORDER BY direction in OVER().

SELECT year,
       ROUND(AVG(height)::NUMERIC, 2) AS avg_height,
       ROUND(COALESCE(LEAD(AVG(height)::NUMERIC) OVER (ORDER BY year DESC), 172.73), 2) AS prev_avg_height,
       ROUND(ROUND(AVG(height)::NUMERIC, 2) - COALESCE(LEAD(AVG(height)::NUMERIC) OVER (ORDER BY year DESC), 172.73), 2) AS avg_height_diff
FROM olympics_athletes_events
WHERE sex = 'M'
  AND height IS NOT NULL
GROUP BY year
ORDER BY year;

The Old School Corner: How Data Scientists Wrote This in 2008

Do you think LAG() and LEAD() seem quite complex? If you do, I’ll change your mind in a couple of seconds. 

Before window functions became widely available in major databases, old-school data scientists solved that problem by using a self join.

Here’s what that code would’ve looked like in the olden days, when saber-toothed tigers roamed the Earth and window functions were even further away than a junior analyst’s first promotion. 

WITH agg AS (
    SELECT year, AVG(height) AS avg_height
    FROM olympics_athletes_events
    WHERE sex = 'M' AND height IS NOT NULL
    GROUP BY year
)
SELECT
    curr.year,
    ROUND(curr.avg_height::NUMERIC, 2) AS avg_height,
    ROUND(COALESCE(prev.avg_height::NUMERIC, 172.73), 2) AS prev_avg_height,
    ROUND(ROUND(curr.avg_height::NUMERIC, 2) - ROUND(COALESCE(prev.avg_height::NUMERIC, 172.73), 2), 2) AS avg_height_diff
FROM agg curr
LEFT JOIN agg prev
    ON prev.year = (
        SELECT MAX(year)
        FROM agg
        WHERE year < curr.year
    )
ORDER BY curr.year;

Now, compare this to the LAG() solution. What do you say now? I think you’ll be thankful now for the privilege of living in the age of window functions.

Interview Pattern #5: Rolling Averages and Time-Window Analysis

This pattern is used to smooth out noise, making it great for revealing the underlying trend. You can often find it in analyzing revenue trends, DAU/MAU, and product usage metrics

The window function used here is AVG(). Unlike in a simple average, a rolling or moving average doesn’t take into account all data points, but only a specified number of the preceding and/or following data points. Using preceding data points is more common. A centered moving average – the one that uses the current and the following rows – is less common, as those following rows might not exist yet; moving averages are typically used on live or recent data.  

You could calculate the 3-day, 7-day, 30-day, 12-month averages, or whatever variation you desire.

For example, a 7-day rolling average includes the 6 previous days + the current day = 7 days. You calculate like that for each day. As you go, the size of the calculation window stays the same (7 days); it just moves day by day.

Rolling Averages and Time Window Analysis in SQL Window Functions

We’re Going Advanced: The Frame Clause in Window Functions

To calculate rolling averages and perform a time window analysis in window functions, you need to define the frame clause.

The generic syntax is given below. 

<window_function>() OVER (
    [PARTITION BY column_name]
    [ORDER BY column_name]
    [ROWS | RANGE BETWEEN <frame_start> AND <frame_end>]
)

Here are the components of the frame clause.

Rolling Averages and Time Window Analysis in SQL Window Functions

All these components can be combined in many different ways, which, in turn, results in different frame definitions. Here are the most common combinations.

Rolling Averages and Time Window Analysis in SQL Window Functions

Now, let’s demonstrate this by solving one of the Amazon SQL interview questions.

Last Updated: December 2020

HardID 10314

Find the 3-month rolling average of total revenue from purchases given a table with users, their purchase amount, and date purchased. Do not include returns which are represented by negative purchase values. Output the year-month (YYYY-MM) and 3-month rolling average of revenue, sorted from earliest month to latest month.

A 3-month rolling average is defined by calculating the average total revenue from all user purchases for the current month and previous two months. The first two months will not be a true 3-month rolling average since we are not given data from last year. Assume each month has at least one purchase.

Go to the Question

We have to calculate a 3-month rolling average of the revenue from purchases. The negative purchase values – i.e., the returns – shouldn’t be included in the calculation. The output should show YYYY-MM and the 3-month rolling average, sorted from the earliest to the latest month.

Dataset 

There’s one table named amazon_purchases.

Table: amazon_purchases
Loading Dataset

Solution

Let me first explain the subquery, then we’ll finish it off with the rolling average calculation.

1. Purchase sum by month

The subquery’s purpose is to sum all the purchases by month. We do that by using TO_CHAR to convert the created_at column to the YYYY-MM format, then applying SUM() to calculate monthly revenue. 

The output is grouped by and sorted by month. 

SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
       SUM(purchase_amt) AS monthly_revenue
FROM amazon_purchases
WHERE purchase_amt>0
GROUP BY TO_CHAR(created_at::DATE, 'YYYY-MM')
ORDER BY TO_CHAR(created_at::DATE, 'YYYY-MM')

Here’s its output. 

monthmonthly_revenue
2020-0126292
2020-0220695
2020-0329620
2020-0421933
2020-0524700
2020-0627687
2020-0725309
2020-0823496
2020-0924827
2020-1015310

2. Calculate the moving average

Now, we embed it into the outer query. The outer query calculates the moving average on the data provided by the subquery. The window function used is – expectedly – AVG(), with the window defined as ROWS BETWEEN 2 PRECEDING AND CURRENT ROW. This window definition means that the current row and the two before it are included in the calculation. Each row is one month; in other words, we calculate the moving average on the current and the two previous months. 

Of course, the calculation has to run from the earliest to the latest month to make sense, so we sort the data in the window by month in ascending order in the OVER() clause. 

PostgreSQL
Go to the question on the platformTables: amazon_purchases

Interview Pattern #6: Retention, Reactivation, and Session Patterns

This is probably the hardest pattern. Not because of any especially difficult window functions, but because it requires chaining multiple window functions.

This pattern encompasses three distinct problem types.

Retention and Reactivation in SQL Window Functions

All three are instances of the classic islands-and-gaps problem.

Streak and session detection identify the periods of consecutive activity, i.e., the islands.

Reactivation focuses on gaps, i.e., detecting when a user’s silence exceeds a threshold before a new island begins.

This technique works in three steps.

Retention and Reactivation in SQL Window Functions

Let’s now apply this technique to an actual business problem.

Here’s an interview question from LinkedIn and Meta SQL interview questions

Last Updated: October 2022

HardID 2131

Provided a table with user ID and the dates they visited the platform, find the top 3 users with the longest continuous streak of visiting the platform up to August 10, 2022. Output the user ID and the length of the streak.

In case of a tie, display all users with the top three longest streak lengths.

Go to the Question

We need to find the top three users with the longest platform visit streaks up to 10 August 2022. The output must show the user ID and the streak length in days.

Dataset

We’ll use the user_streaks table.

Table: user_streaks
Loading Dataset

Solution

The solution chains several CTEs.

1. unique_visits CTE: Uses DISTINCT to find unique daily visits per user up to '2022-08-10'.

WITH unique_visits AS (
    SELECT DISTINCT user_id,
                    date_visited
    FROM user_streaks
    WHERE date_visited <= DATE '2022-08-10'
),

2. streak_flags CTE: In this CTE, we reference the previous CTE and use CASE WHEN to mark the new streak with 1. We calculate it by first using LAG() to access the date of the previous visit, then subtracting that date from the current session date. If the difference is 1 – one day – that’s a continuation of the streak, so we mark it as 0. Otherwise, it’s marked as 1.

streak_flags AS (
    SELECT *,
           CASE
               WHEN date_visited - LAG(date_visited) OVER (PARTITION BY user_id ORDER BY date_visited) = 1
               THEN 0
               ELSE 1
           END AS new_streak
    FROM unique_visits
),

3. streak_ids CTE: Here, we use another window function, namely SUM(), to calculate the cumulative sum of the streaks by user. The cumulative sum becomes a streak ID, assigned to each user’s date of visit. That way, we can identify the number of streaks and the number of visits within each streak.

streak_ids AS (
    SELECT *,
           SUM(new_streak) OVER (PARTITION BY user_id ORDER BY date_visited) AS streak_id
    FROM streak_flags
),

4. streak_lengths CTE: Now, we use the info from the previous CTE to COUNT(*) each user’s each streak’s length.

streak_lengths AS (
    SELECT user_id,
           streak_id,
           COUNT(*) AS streak_length
    FROM streak_ids
    GROUP BY user_id, streak_id
),

5. longest_per_user CTE: As a next step, we use MAX() to find the longest streak per user. 

longest_per_user AS (
    SELECT user_id,
           MAX(streak_length) AS streak_length
    FROM streak_lengths
    GROUP BY user_id
),

6. ranked_lengths CTE: Now, we use DENSE_RANK() to rank each user’s streaks by their length, from the longest to shortest.

ranked_lengths AS (
    SELECT DISTINCT streak_length,
           DENSE_RANK() OVER (ORDER BY streak_length DESC) AS len_rank
    FROM longest_per_user
),

7. top_lengths CTE: In this final CTE, we simply keep only the three top longest streaks. 

top_lengths AS (
    SELECT streak_length
    FROM ranked_lengths
    WHERE len_rank <= 3
)

Now, in the final step, everything comes together. We join the longest_per_user and top_lengths CTEs to connect the three longest streaks to their users, and show it as the output. Here’s the complete code. 

PostgreSQL

Why the Interviewers Love This Pattern

To apply this pattern, you first need to recognize that the problem can’t be solved in one pass. That’s already a plus in the interviewer's notebook. Then, you demonstrate thinking in layers, as I showed you above: detecting, grouping, counting, ranking – each CTE has a clear, distinct purpose.

That way of thinking is what distinguishes candidates who understand data (and the problem) from those who just know syntax.

Common SQL Window Function Mistakes in Interviews

There are several typical window function mistakes candidates make in interviews. Often, these mistakes are silent: the query runs, it returns results, but nothing tells you the result is wrong.

Common SQL Window Functions Mistakes

Practice SQL Window Function Interview Questions

The time has come for you to stop reading and start actively practicing what this article has taught you. Here are several hand-picked StrataScratch questions for each window function pattern.

1. Practice: Finding the Latest Row per User

Question #1: Finding Updated Records

Last Updated: November 2020

EasyID 10299

We have a table with employees and their salaries; however, some of the records are old and contain outdated salary information. Since there is no timestamp, assume salary is non-decreasing over time. You can consider the current salary for an employee is the largest salary value among their records. If multiple records share the same maximum salary, return any one of them. Output their id, first name, last name, department ID, and current salary. Order your list by employee ID in ascending order.

Go to the Question

Dataset

Table: ms_employee_salary
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: ms_employee_salary

Question #2: Streamer Sessions by Initial Viewers

Last Updated: January 2021

HardID 2012

Return the number of streamer sessions for each user whose very first session was as a viewer.

Include the user ID and count of streamer sessions for users whose earliest session (by session_start) was a 'viewer' session, regardless of whether they ever had a streamer session later. Sort the results by streamer session count in descending order, then by user ID in ascending order.

Go to the Question

Dataset

Table: twitch_sessions
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: twitch_sessions

Question #3: Highest Salary in the Department

Last Updated: April 2019

MediumID 9897

Find the employee with the highest salary per department. Output the department name, employee's first name along with the corresponding salary.

Go to the Question

Dataset

Table: employee
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL

2. Practice: Ranking Customers, Products, or Events

Question #1: Ranking Most Active Guests

MediumID 10159

Identify the most engaged guests by ranking them according to their overall messaging activity. The most active guest, meaning the one who has exchanged the most messages with hosts, should have the highest rank. If two or more guests have the same number of messages, they should have the same rank. Importantly, the ranking shouldn't skip any numbers, even if many guests share the same rank. Present your results in a clear format, showing the rank, guest identifier, and total number of messages for each guest, ordered from the most to least active.

Go to the Question

Dataset

Table: airbnb_contacts
Loading Dataset

Solution

Write, run, and check your solution in the code editor.

PostgreSQL
Go to the question on the platformTables: airbnb_contacts

Question #2: Best Selling Item

Last Updated: July 2020

HardID 10172

Find the best-selling item for each month (no need to separate months by year). The best-selling item is determined by the highest total sales amount, calculated as: total_paid = unitprice * quantity. A negative quantity indicates a return or cancellation (the invoice number begins with 'C'. To calculate sales, ignore returns and cancellations. Output the month, description of the item, and the total amount paid.

Go to the Question

Dataset

Table: online_retail
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL

Question #3: Rank Variance per Country 

Last Updated: February 2021

HardID 2007

Compare the total number of comments made by users in each country during December 2019 and January 2020.

For each month, rank countries by their total number of comments in descending order. Countries with the same total should share the same rank, and the next rank should increase by one (without skipping numbers).

Return the names of the countries whose rank improved from December to January (that is, their rank number became smaller).

Go to the Question

Dataset

Table: fb_comments_count
Loading Dataset
Table: fb_active_users
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: fb_comments_count, fb_active_users

3. Practice: Calculating Running Totals & Cumulative Metrics

Question #1: Maximum Number of Employees Reached

Last Updated: June 2021

HardID 2046

Write a query that returns every employee that has ever worked for the company. For each employee, calculate the greatest number of employees that worked for the company during their tenure and the first date that number was reached. The termination date of an employee should not be counted as a working day.

Your output should have the employee ID, greatest number of employees that worked for the company during the employee's tenure, and first date that number was reached.

Go to the Question

Dataset

Table: uber_employees
Loading Dataset

Solution

Write, run, and check your solution in the code editor.

PostgreSQL

Question #2: Customer Tracking

Last Updated: October 2022

HardID 2136

Given users' session logs, calculate how many hours each user was active in total across all recorded sessions.

Note: The session starts when state=1 and ends when state=0.

Go to the Question

Dataset

Table: cust_tracking
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL

4. Practice: Comparing Current Row to Previous or Next Row

Question #1: Monthly Percentage Difference

Last Updated: December 2020

HardID 10319

Given a table of purchases by date, calculate the month-over-month percentage change in revenue. The output should include the year-month date (YYYY-MM) and percentage change, rounded to the 2nd decimal point, and sorted from the beginning of the year to the end of the year. The percentage change column will be populated from the 2nd month forward and can be calculated as ((this month's revenue - last month's revenue) / last month's revenue)*100.

Go to the Question

Dataset

Table: sf_transactions
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: sf_transactions

Question #2: Growth of Airbnb

Last Updated: February 2018

HardID 9637

Calculate Airbnb's annual growth rate using the number of registered hosts as the key metric. The growth rate is determined by:

Growth Rate = ((Number of hosts registered in the current year - number of hosts registered in the previous year) / number of hosts registered in the previous year) * 100

Output the year, number of hosts in the current year, number of hosts in the previous year, and the growth rate. Round the growth rate to the nearest percent. Sort the results in ascending order by year.

Assume that the dataset consists only of unique hosts, meaning there are no duplicate hosts listed.

Go to the Question

Dataset

Table: airbnb_search_details
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: airbnb_search_details

Question #3: Finding User Purchases

Last Updated: December 2020

MediumID 10322

Identify returning active users by finding users who made a second purchase within 1 to 7 days after their first purchase. Ignore same-day purchases. Output a list of these user_ids.

Go to the Question

Dataset

Table: amazon_transactions
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: amazon_transactions

5. Practice: Rolling Averages and Time-Window Analysis

Question #1: Distance per Dollar

Last Updated: November 2020

HardID 10302

You’re given a dataset of Uber rides with the traveling distance (distance_to_travel) and cost (monetary_cost) for each ride. First, find the difference between the distance-per-dollar for each ride and the monthly distance-per-dollar for that year-month.

Distance-per-dollar for each ride is defined as the distance traveled divided by the cost of the ride. Monthly distance-per-dollar is defined as the total distance traveled in that month divided by the total cost for that month.

Use the calculated difference on each date to calculate absolute average difference in distance-per-dollar metric on monthly basis (year-month).

The output should include the year-month (YYYY-MM) and the absolute average difference in distance-per-dollar (Absolute value to be rounded to the 2nd decimal).

You should also count both success and failed request_status as the distance and cost values are populated for all ride requests. Also, assume that all dates are unique in the dataset. Order your results by earliest request date first.

Go to the Question

Dataset

Table: uber_request_logs
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: uber_request_logs

Question #2: Product Engagement Momentum Shifts

Last Updated: September 2024

HardID 10564

Identify all products that experienced a turnaround in user engagement: at least 3 consecutive months of declining monthly active users followed by at least 3 consecutive months of growth.

For each product that matches this pattern, return the product name, the month when the decline started, the month when growth resumed, and the growth ratio from the lowest point to the most recent peak, calculated as: (peak_users - lowest_users) / lowest_users.

Go to the Question

Dataset

Table: product_engagement
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: product_engagement

Question #3: Search Click Success Rate by User Segment

Last Updated: October 2025

HardID 10566

Calculate the search success rate for new users versus existing users. A successful search is one where the first click event occurs within 30 seconds of the search event.

Group all users into two segments: • new (registered within the last 30 days covered by the dataset — that is, on or after 30 days before the most recent date in the dataset) • existing (registered earlier).

Return one row per user segment with total searches, successful searches, and success rate.

Go to the Question

Dataset

Table: search_events
Loading Dataset
Table: accounts
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: search_events, accounts

6. Practice: Retention, Reactivation, and Session Patterns

Question #1: Consecutive Days

Last Updated: July 2021

HardID 2054

Find all the users who were active for 3 consecutive days or more.

Go to the Question

Dataset

Table: sf_events
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL

Question #2: Users by Average Session Time

Last Updated: July 2021

MediumID 10352

Calculate each user's average session time, where a session is defined as the time difference between a page_load and a page_exit. Assume each user has only one session per day. If there are multiple page_load or page_exit events on the same day, use only the latest page_load and the earliest page_exit. Only consider sessions where the page_load occurs before the page_exit on the same day. Output the user_id and their average session time.

Go to the Question

Dataset

Table: facebook_web_log
Loading Dataset

Solution

Write, run, and check your solution in the code editor. 

PostgreSQL
Go to the question on the platformTables: facebook_web_log

Question #3: First Day Retention Rate

Last Updated: February 2022

HardID 2090

Calculate the first-day retention rate of a group of video game players. The first-day retention occurs when a player logs in 1 day after their first-ever log-in. Return the proportion of players who meet this definition divided by the total number of players.

Go to the Question

Dataset

Table: players_logins
Loading Dataset

Solution

Write, run, and check your solution in the code editor.

PostgreSQL

Final Takeaway

This guide gave you a framework for recognizing which problem calls for which window function. I’ve shown you six patterns that cover the vast majority of SQL interview questions where you’ll encounter window functions.

Let’s recap them.

SQL Window Functions

The best way to build a real understanding of these patterns is to work through the questions. Write the code yourself, get it wrong, understand why, and fix it. This is the only way to stop thinking about the window function syntax and start thinking about business problem-solving. It sets you on the path to an interview and job success.

FAQs

1. What is the difference between GROUP BY and a window function?

GROUP BY collapses individual rows and returns one row per group. Window functions don’t do that; they perform a calculation along the individual rows, show the result, and keep the individual rows. 

2. Which SQL window functions are most common in interviews?

  • ROW_NUMBER(): Probably the most common one. Used in deduplication, finding the first or last row per group, and pivoting ranked results into columns. 
  • RANK() & DENSE_RANK(): They show up in top-N problems. 
  • LAG() & LEAD(): We use them for row-to-row comparisons, such as month-over-month change, time between events, and comparing a value to the previous period. 
  • SUM() OVER, AVG() OVER, COUNT() OVER: These are aggregate window functions. Used to calculate group-level metrics at the row level, most commonly running totals and moving averages. 
  • FIRST_VALUE() & LAST_VALUE(): Used to pull a value from the first or the last row of a partition. They are a common alternative to ROW_NUMBER() when you only need one column, not the whole row

3. When should I use ROW_NUMBER() vs RANK()?

The difference only matters when there are ties, i.e., two or more rows with the same value in the ORDER BY clause in OVER()

ROW_NUMBER() always assigns a unique integer to every row. So, it ignores ties and breaks them nondeterministically. You should use this window function whenever the logic requires WHERE rank = 1, i.e., returning a single row. Common examples are deduplication and pivoting positions into columns. 

RANK() should be used when tied rows should share a position and all should appear in the results, because this function assigns the same rank to tied rows, then skips the next rank. 

DENSE_RANK() is a third option, which assigns the same rank to tied rows – just like RANK() – but without skipping the next rank. Use it when you want the Nth-highest value always to be ranked as N, regardless of how many rows share rank N-1. 

4. How do I calculate a running total in SQL?

By using SUM() as a window function with ORDER BY inside OVER().

SELECT date,
       revenue,
       SUM(revenue) OVER (ORDER BY date) AS running_total
FROM daily_revenue;

Such use of the ORDER BY clause activates a default frame of ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means the sum includes all rows from the beginning of the result up to and including the current row. This is a textbook definition of a running or cumulative total.

You can, of course, do the same for each group by using the PARTITION BY clause. In the example below, running totals are calculated per user.

SELECT date,
       revenue,
       SUM(revenue) OVER (PARTITION BY user_id ORDER BY date)
FROM daily_revenue;

Note that ORDER BY in OVER() is what makes the SUM() window function calculate cumulatives. If you use the SUM() OVER() version, that gives the same grand total in every row. 

5. What do LAG() and LEAD() do?

LAG() lets you access the value from the previous row on N rows, ordered by the ORDER BY clause in OVER().

LEAD() does the same, only for the next row on N rows.

6. How do rolling windows work in SQL?

Rolling windows within window functions limit the calculation to a fixed number of preceding and or following rows, rather than all rows from the start. 

The window frame clause has two parts. The first part is the unit, i.e., ROWS or RANGE . ROWS counts physical rows, RANGE determines the included number of rows by value.

The second part of the window frame clause is the bounds BETWEEN … AND …, which define how far the window extends in each direction from the current window. Each bound endpoint – a start or an end – can be one of these five values:

  • UNBOUNDED PRECEDING – the first row of the partition
  • N PRECEDING – N rows before the current row
  • CURRENT ROW – the current row itself
  • N FOLLOWING – N rows after the current row
  • UNBOUNDED FOLLOWING – the last row of the partition

For example, a rolling window defined as ROWS BETWEEN 2 PRECEDING AND CURRENT ROW means the window function performs a calculation on 2 rows preceding the current row and the current row, which makes it a 3-row rolling window.

Share