Module 6: Window Operations25 min

Advanced Window Patterns

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Deduplication with Ranking

Data deduplication is a frequent real-world use of ranking. Your database has multiple versions of the same record (address changed, salary updated, status modified), and you need only the latest. The pattern: rank by date within each entity, keep rank 1. This works on any "keep the most recent" problem, and it’s cleaner than the groupby-max-then-merge alternative.

Python
# Rank by date within each entity, keep rank 1
df["rnk"] = df.groupby("entity_id")["updated_at"].rank(
    method="first", ascending=False
)
latest = df[df["rnk"] == 1].drop(columns="rnk")

Smoothing with .rolling()

Table: amazon_transactions
iduser_iditemcreated_atrevenue
1109milk2020-03-03123
2139biscuit2020-03-18421
3120milk2020-03-18176
4108banana2020-03-18862
5130milk2020-03-28333
1
Add a Rolling Average

The starter builds the pipeline. Add a 2-purchase rolling average of days gap per user. Output `user_id`, `created_at`, `days_gap`, and `avg_gap`.

Tables: amazon_transactions

Combining Everything

2
Full Analysis Pipeline

For each user: rank purchases by date, calculate days since previous purchase, and flag purchases with gaps over 5 days. Return user id, created at, purchase number, days gap, and long gap.

Tables: amazon_transactions

Rank Variance Per Country

Table: fb_comments_count
user_idcreated_atnumber_of_comments
182019-12-291
252019-12-211
782020-01-041
372020-02-011
412019-12-231
Table: fb_active_users
user_idnamestatuscountry
33Amanda LeonopenAustralia
27Jessica FarrellopenLuxembourg
18Wanda RamirezopenUSA
50Samuel MillerclosedBrazil
16Jacob YorkopenAustralia
3
Rank Variance Per Country
View solution

Compare the total number of comments made by users in each country during December 2019 and January 2020. For each month, rank countries by their total number of comments in descending order. Countries with the same total should share the same rank, and the next rank should increase by one (without skipping numbers). Return the names of the countries whose rank improved from December to January (that is, their rank number became smaller).

Tables: fb_comments_count, fb_active_users

Best Selling Item

Table: online_retail
invoicenostockcodedescriptionquantityinvoicedateunitpricecustomeridcountry
54458621890S/6 WOODEN SKITTLES IN COTTON BAG32011-02-212.9517338United Kingdom
54110484509GSET OF 4 FAIRY CAKE PLACEMATS32011-01-133.29United Kingdom
56077222499WOODEN UNION JACK BUNTING32011-07-204.96United Kingdom
55515022488NATURAL SLATE RECTANGLE CHALKBOARD52011-05-313.29United Kingdom
57052121625VINTAGE UNION JACK APRON32011-10-116.9512371Switzerland
4
Best Selling Item
View solution

Find the best-selling item for each month (no need to separate months by year). The best-selling item is determined by the highest total sales amount, calculated as: `total_paid = unitprice * quantity`. A negative `quantity` indicates a return or cancellation (the invoice number begins with `'C'`. To calculate sales, ignore returns and cancellations. Output the month, description of the item, and the total amount paid.

Tables: online_retail

Consecutive Days

Table: sf_events
record_dateaccount_iduser_id
2021-01-01A1U1
2021-01-01A1U2
2021-01-06A1U3
2021-01-02A1U1
2020-12-24A1U2
5
Consecutive Days
View solution

Find all the users who were active for 3 consecutive days or more.

Tables: sf_events

Key Takeaways

  • Deduplication: rank by date within groups, keep rank 1.
  • Chain techniques: sort → rank → shift → cumsum → flag.
  • .rolling(n) for moving averages within groups (watch the MultiIndex).
  • Always sort before any positional operation.

Your learning journey starts here

Complete lessons to track your progress through the path.

0%

What You Can Do Now

  1. Filter, sort, and aggregate data across grouped categories
  2. Merge multiple DataFrames to answer cross-table questions
  3. Clean messy strings, extract date parts, and apply custom logic
  4. Compare rows to their group averages and their neighbors
  5. Build ranked leaderboards, running totals, and period-over-period reports
  6. Chain multi-step analysis pipelines from filter to final output