Module 6: Window Operations•25 min

Advanced Window Patterns

Progress Tracking

Deduplication with Ranking

Data deduplication is a frequent real-world use of ranking. Your database has multiple versions of the same record (address changed, salary updated, status modified), and you need only the latest. The pattern: rank by date within each entity, keep rank 1. This works on any "keep the most recent" problem, and it’s cleaner than the groupby-max-then-merge alternative.

Python

# Rank by date within each entity, keep rank 1
df["rnk"] = df.groupby("entity_id")["updated_at"].rank(
    method="first", ascending=False
)
latest = df[df["rnk"] == 1].drop(columns="rnk")

Smoothing with .rolling()

Table: amazon_transactions

id	user_id	item	created_at	revenue
1	109	milk	2020-03-03	123
2	139	biscuit	2020-03-18	421
3	120	milk	2020-03-18	176
4	108	banana	2020-03-18	862
5	130	milk	2020-03-28	333

Tables: amazon_transactions

Combining Everything

Tables: amazon_transactions

Rank Variance Per Country

Table: fb_comments_count

user_id	created_at	number_of_comments
18	2019-12-29	1
25	2019-12-21	1
78	2020-01-04	1
37	2020-02-01	1
41	2019-12-23	1

Table: fb_active_users

user_id	name	status	country
33	Amanda Leon	open	Australia
27	Jessica Farrell	open	Luxembourg
18	Wanda Ramirez	open	USA
50	Samuel Miller	closed	Brazil
16	Jacob York	open	Australia

Tables: fb_comments_count, fb_active_users

Best Selling Item

Table: online_retail

invoiceno	stockcode	description	quantity	invoicedate	unitprice	customerid	country
544586	21890	S/6 WOODEN SKITTLES IN COTTON BAG	3	2011-02-21	2.95	17338	United Kingdom
541104	84509G	SET OF 4 FAIRY CAKE PLACEMATS	3	2011-01-13	3.29		United Kingdom
560772	22499	WOODEN UNION JACK BUNTING	3	2011-07-20	4.96		United Kingdom
555150	22488	NATURAL SLATE RECTANGLE CHALKBOARD	5	2011-05-31	3.29		United Kingdom
570521	21625	VINTAGE UNION JACK APRON	3	2011-10-11	6.95	12371	Switzerland

Tables: online_retail

Consecutive Days

Table: sf_events

record_date	account_id	user_id
2021-01-01	A1	U1
2021-01-01	A1	U2
2021-01-06	A1	U3
2021-01-02	A1	U1
2020-12-24	A1	U2

Tables: sf_events

Key Takeaways

Deduplication: rank by date within groups, keep rank 1.
Chain techniques: sort → rank → shift → cumsum → flag.
.rolling(n) for moving averages within groups (watch the MultiIndex).
Always sort before any positional operation.

Your learning journey starts here

Complete lessons to track your progress through the path.

What You Can Do Now

Filter, sort, and aggregate data across grouped categories
Merge multiple DataFrames to answer cross-table questions
Clean messy strings, extract date parts, and apply custom logic
Compare rows to their group averages and their neighbors
Build ranked leaderboards, running totals, and period-over-period reports
Chain multi-step analysis pipelines from filter to final output