Advanced Window Patterns
Progress Tracking
Log in to save this lesson and continue from where you left off.
Deduplication with Ranking
Data deduplication is a frequent real-world use of ranking. Your database has multiple versions of the same record (address changed, salary updated, status modified), and you need only the latest. The pattern: rank by date within each entity, keep rank 1. This works on any "keep the most recent" problem, and it’s cleaner than the groupby-max-then-merge alternative.
# Rank by date within each entity, keep rank 1
df["rnk"] = df.groupby("entity_id")["updated_at"].rank(
method="first", ascending=False
)
latest = df[df["rnk"] == 1].drop(columns="rnk")Smoothing with .rolling()
The starter builds the pipeline. Add a 2-purchase rolling average of days gap per user. Output `user_id`, `created_at`, `days_gap`, and `avg_gap`.
Combining Everything
For each user: rank purchases by date, calculate days since previous purchase, and flag purchases with gaps over 5 days.
Rank Variance Per Country
| hotel_address | additional_number_of_scoring | review_date | average_score | hotel_name | reviewer_nationality | negative_review | review_total_negative_word_counts | total_number_of_reviews | positive_review | review_total_positive_word_counts | total_number_of_reviews_reviewer_has_given | reviewer_score | tags | days_since_review | lat | lng |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 Western Gateway Royal Victoria Dock Newham London E16 1AA United Kingdom | 359 | 2017-07-05 | 8.5 | Novotel London Excel | United Kingdom | coffee and tea at breakfast were not particularly hot Otherwise everything else was fine | 16 | 1158 | we were allocated the newly refurbished rooms and so everything was fresh and the bed was very comfortable the hotel is ideally situated near City Airport although eventually we travelled by train | 34 | 2 | 10 | [' Leisure trip ', ' Family with young children ', ' Standard Double Room with Two Single Beds ', ' Stayed 2 nights ', ' Submitted from a mobile device '] | 29 days | 51.51 | 0.02 |
| 35 Charles Street Mayfair Westminster Borough London W1J 5EB United Kingdom | 252 | 2015-08-29 | 9.1 | The Chesterfield Mayfair | Israel | No Negative | 0 | 1166 | We liked everything The hotel is simply a boutique the staff were all polite and helpfull The room was clean and been serviced daily Wifi was completely free Breakfast was simply great I so much want to get back | 41 | 8 | 10 | [' Leisure trip ', ' Couple ', ' Classic Double Room ', ' Stayed 4 nights ', ' Submitted from a mobile device '] | 705 day | 51.51 | -0.15 |
| 14 Rue Stanislas 6th arr 75006 Paris France | 40 | 2017-05-23 | 9.1 | Hotel Le Six | United States of America | There is currently utility construction taking place on the street in front of the hotel so a little noisy at times and barriers in place | 27 | 177 | Neat boutique hotel Some of the most comfortable hotel beds I have ever come across Staff was wonderful Loved the location Not too touristy Luxembourg gardens close by and a great place for a morning run walk | 39 | 3 | 9.2 | [' Leisure trip ', ' Family with young children ', ' Deluxe Double Room ', ' Stayed 4 nights ', ' Submitted from a mobile device '] | 72 days | 48.84 | 2.33 |
| Gran V a De Les Corts Catalanes 570 Eixample 08011 Barcelona Spain | 325 | 2016-08-25 | 8.2 | Sunotel Central | United Kingdom | Coffee at breakfast could be better When you spend this amount in a hotel I expect better coffee in the morning | 22 | 3870 | Great bed nice to have a coffee machine in the room love the air conditioning and basically loved the attitude of the staff Really great | 26 | 2 | 9.2 | [' Leisure trip ', ' Group ', ' Comfort Double or Twin Room ', ' Stayed 1 night ', ' Submitted from a mobile device '] | 343 day | 41.38 | 2.16 |
| Rathausstra e 17 01 Innere Stadt 1010 Vienna Austria | 195 | 2015-09-17 | 8.5 | Austria Trend Hotel Rathauspark Wien | United Kingdom | A bit out of the way location wise | 9 | 1884 | Clean modern rooms and bathroom well equipped | 9 | 2 | 7.5 | [' Leisure trip ', ' Couple ', ' Comfort Room ', ' Stayed 2 nights ', ' Submitted from a mobile device '] | 686 day | 48.21 | 16.36 |
Compare the total number of comments made by users in each country during December 2019 and January 2020. For each month, rank countries by their total number of comments in descending order. Countries with the same total should share the same rank, and the next rank should increase by one (without skipping numbers). Return the names of the countries whose rank improved from December to January (that is, their rank number became smaller).
Best Selling Item
| product_id | promotion_id | cost_in_dollars | customer_id | date_sold | units_sold |
|---|---|---|---|---|---|
| 1 | 1 | 2 | 1 | 2022-04-01 | 4 |
| 3 | 3 | 6 | 3 | 2022-05-24 | 6 |
| 1 | 2 | 2 | 10 | 2022-05-01 | 3 |
| 1 | 2 | 3 | 2 | 2022-05-01 | 9 |
| 2 | 2 | 10 | 2 | 2022-05-01 | 1 |
Find the best-selling item for each month (no need to separate months by year). The best-selling item is determined by the highest total sales amount, calculated as: `total_paid = unitprice * quantity`. A negative `quantity` indicates a return or cancellation (the invoice number begins with `'C'`. To calculate sales, ignore returns and cancellations. Output the month, description of the item, and the total amount paid.
Consecutive Days
| record_date | account_id | user_id |
|---|---|---|
| 2021-01-01 | A1 | U1 |
| 2021-01-01 | A1 | U2 |
| 2021-01-06 | A1 | U3 |
| 2021-01-02 | A1 | U1 |
| 2020-12-24 | A1 | U2 |
Find all the users who were active for 3 consecutive days or more.
Key Takeaways
- Deduplication: rank by date within groups, keep rank 1.
- Chain techniques: sort → rank → shift → cumsum → flag.
.rolling(n)for moving averages within groups (watch the MultiIndex).- Always sort before any positional operation.
Your learning journey starts here
Complete lessons to track your progress through the path.
What You Can Do Now
- Filter, sort, and aggregate data across grouped categories
- Merge multiple DataFrames to answer cross-table questions
- Clean messy strings, extract date parts, and apply custom logic
- Compare rows to their group averages and their neighbors
- Build ranked leaderboards, running totals, and period-over-period reports
- Chain multi-step analysis pipelines from filter to final output
Where to Go from Here
- Practice is what turns knowledge into fluency.StrataScratch has hundreds of pandas questions from real company interviews — start with the ones tagged at your level and work up.
- If you haven’t already, try the SQL learning path as well. Most data roles expect both, and the concepts map closely: groupby is GROUP BY, merge is JOIN, transform is a window function. Knowing both makes you faster in each.
- The best next step is a real project. StrataScratch Data Projects give you guided, end-to-end analyses on real datasets — pick one that interests you and put your skills to work. That’s where learning becomes craft.