Advanced Window Patterns
Progress Tracking
Log in to save this lesson and continue from where you left off.
Deduplication with Ranking
Data deduplication is a frequent real-world use of ranking. Your database has multiple versions of the same record (address changed, salary updated, status modified), and you need only the latest. The pattern: rank by date within each entity, keep rank 1. This works on any "keep the most recent" problem, and it’s cleaner than the groupby-max-then-merge alternative.
# Rank by date within each entity, keep rank 1
df["rnk"] = df.groupby("entity_id")["updated_at"].rank(
method="first", ascending=False
)
latest = df[df["rnk"] == 1].drop(columns="rnk")Smoothing with .rolling()
| id | user_id | item | created_at | revenue |
|---|---|---|---|---|
| 1 | 109 | milk | 2020-03-03 | 123 |
| 2 | 139 | biscuit | 2020-03-18 | 421 |
| 3 | 120 | milk | 2020-03-18 | 176 |
| 4 | 108 | banana | 2020-03-18 | 862 |
| 5 | 130 | milk | 2020-03-28 | 333 |
The starter builds the pipeline. Add a 2-purchase rolling average of days gap per user. Output `user_id`, `created_at`, `days_gap`, and `avg_gap`.
Combining Everything
For each user: rank purchases by date, calculate days since previous purchase, and flag purchases with gaps over 5 days. Return user id, created at, purchase number, days gap, and long gap.
Rank Variance Per Country
| user_id | created_at | number_of_comments |
|---|---|---|
| 18 | 2019-12-29 | 1 |
| 25 | 2019-12-21 | 1 |
| 78 | 2020-01-04 | 1 |
| 37 | 2020-02-01 | 1 |
| 41 | 2019-12-23 | 1 |
| user_id | name | status | country |
|---|---|---|---|
| 33 | Amanda Leon | open | Australia |
| 27 | Jessica Farrell | open | Luxembourg |
| 18 | Wanda Ramirez | open | USA |
| 50 | Samuel Miller | closed | Brazil |
| 16 | Jacob York | open | Australia |
Compare the total number of comments made by users in each country during December 2019 and January 2020. For each month, rank countries by their total number of comments in descending order. Countries with the same total should share the same rank, and the next rank should increase by one (without skipping numbers). Return the names of the countries whose rank improved from December to January (that is, their rank number became smaller).
Best Selling Item
| invoiceno | stockcode | description | quantity | invoicedate | unitprice | customerid | country |
|---|---|---|---|---|---|---|---|
| 544586 | 21890 | S/6 WOODEN SKITTLES IN COTTON BAG | 3 | 2011-02-21 | 2.95 | 17338 | United Kingdom |
| 541104 | 84509G | SET OF 4 FAIRY CAKE PLACEMATS | 3 | 2011-01-13 | 3.29 | United Kingdom | |
| 560772 | 22499 | WOODEN UNION JACK BUNTING | 3 | 2011-07-20 | 4.96 | United Kingdom | |
| 555150 | 22488 | NATURAL SLATE RECTANGLE CHALKBOARD | 5 | 2011-05-31 | 3.29 | United Kingdom | |
| 570521 | 21625 | VINTAGE UNION JACK APRON | 3 | 2011-10-11 | 6.95 | 12371 | Switzerland |
Find the best-selling item for each month (no need to separate months by year). The best-selling item is determined by the highest total sales amount, calculated as: `total_paid = unitprice * quantity`. A negative `quantity` indicates a return or cancellation (the invoice number begins with `'C'`. To calculate sales, ignore returns and cancellations. Output the month, description of the item, and the total amount paid.
Consecutive Days
| record_date | account_id | user_id |
|---|---|---|
| 2021-01-01 | A1 | U1 |
| 2021-01-01 | A1 | U2 |
| 2021-01-06 | A1 | U3 |
| 2021-01-02 | A1 | U1 |
| 2020-12-24 | A1 | U2 |
Find all the users who were active for 3 consecutive days or more.
Key Takeaways
- Deduplication: rank by date within groups, keep rank 1.
- Chain techniques: sort → rank → shift → cumsum → flag.
.rolling(n)for moving averages within groups (watch the MultiIndex).- Always sort before any positional operation.
Your learning journey starts here
What You Can Do Now
- Filter, sort, and aggregate data across grouped categories
- Merge multiple DataFrames to answer cross-table questions
- Clean messy strings, extract date parts, and apply custom logic
- Compare rows to their group averages and their neighbors
- Build ranked leaderboards, running totals, and period-over-period reports
- Chain multi-step analysis pipelines from filter to final output