Amazon SQL Interview Questions

Amazon SQL Interview Questions
  • Author Avatar
    Written by:

    Sara Nobrega

Amazon SQL Interview Questions: The Real Patterns, Traps, and Query Skills Amazon Actually Tests.

What Amazon SQL Interviews Actually Test

As the knowledge and understanding of SQL are the key requirements for both data scientists and data analysts at Amazon, many interview questions require writing solutions using this language.

In fact, Amazon asks far fewer questions about theoretical concepts and its products than other tech companies. This is why it’s crucial to practice solving SQL interview questions before an Amazon interview.

The Amazon SQL interview questions test a wide range of SQL concepts, but some notions appear in them more often. Nearly 70% of Amazon SQL interview questions concern data stores in multiple tables and ask for merging the data using JOIN clauses, Common Table Expressions, or subqueries.

Another highly prevailing concept is data aggregation using SQL functions such as COUNT() or SUM() in combination with the GROUP BY clause or using the more complicated window functions. Many questions also require data filtering and sorting using the WHERE and ORDER BY clauses.

Amazon-Style Dataset Schema

Before writing SQL, lock in the grain of each table. Most Amazon-style questions are simple once you stop mixing row levels (transaction rows vs user rows vs order rows).

In Amazon-style datasets, similar archetypes show up again and again. Let’s explore some of them.

Customers / Users table (dimension)

This is your customers-type table.

  • Typical columns: user_id (or id), name fields, city, country, signup date (sometimes)
  • Grain: one row per user
  • What it’s for: segmentation (“by city”), cohort cuts (“new users”), deduping
Table: customers
Loading Dataset

Orders table (order header or order line)

This is your orders-type table.

  • Typical columns: order_id, user_id, order_date, maybe total_order_cost, maybe order_details
  • Grain warning: if there’s an order_details / item / product field and it varies per row, it’s often one row per order line, not one row per order.
Table: orders
Loading Dataset

Transactions / Purchases table (fact table)

This is the amazon_transactions / amazon_purchases vibe.

  • Typical columns: transaction_id (or id), user_id, created_at, revenue or purchase_amt
  • Grain: one row per purchase event (or per transaction / line item)
  • Common twist: money can be negative (returns/refunds)
Table: amazon_transactions
Loading Dataset

Since amazon_transactions is one row per transaction, “monthly revenue” means grouping created_at by month and summing revenue. If you switch to a per-user question (like repeat buyers), you’ll usually dedupe to one row per user per day or per order first.

Amazon SQL Interview Questions (with Solutions)

Let’s start with the easier ones.

Easy Interview Questions

Amazon Easy SQL Interview Questions

Amazon SQL Interview Questions #1: Workers With The Highest Salaries

Last Updated: July 2021

EasyID 10353

Management wants to analyze only employees with official job titles. Find the job titles of the employees with the highest salary. If multiple employees have the same highest salary, include all their job titles.

Go to the Question

Data View

Table: worker
Loading Dataset
Table: title
Loading Dataset

A quick note on what matters here:

  • worker.salary holds the salary value we’re trying to maximize.
  • title.worker_title indicates whether someone has an “official” job title (we only consider workers who appear in this table with a non-null title).
  • We want job titles, not employee names, and we want all titles tied to the maximum salary.

Interview Framing (How It’s Asked + Typical Follow-Ups)

In interviews, this usually isn’t phrased as “find the max salary.” It’s framed as “Find the highest-paid employee(s) among the population that qualifies for analysis, then report a related attribute.” 

The follow-up is often about scope: “Does the maximum come from all workers or only titled workers?” and “What if multiple workers tie for the maximum?” If the interviewer wants to push, they’ll ask about the ’affected_from’ column: “If titles change, which title should you use?”

A senior-sounding answer is acknowledging the decision point: “I’ll compute the max salary over only workers who have an official title, because the prompt explicitly restricts the analysis population.”

Common Mistakes (And How To Avoid Them)

A common mistake is computing MAX(salary) from worker first, then joining with title afterward. That can silently drop the max-paid worker if they don’t have a title row, and you end up returning the wrong max for the restricted population.

Another mistake is using MAX(worker_title) instead of max salary, or returning only one title by using LIMIT 1, which fails the “include all ties” requirement. If the data allowed multiple title rows per worker, failing to think about “current vs historical title” can also create duplicates or mismatched titles.

Evaluation Criteria / Rubric (How Interviewers Score It)

A strong solution makes the population restriction explicit, preserves ties, and keeps the logic readable.

A great answer computes the maximum salary over the correctly filtered population (workers with titles), returns all titles tied at that salary, and doesn’t rely on DISTINCT to patch join mistakes.

An okay answer returns the right output on clean data, but doesn’t clearly justify whether the max is computed before or after applying the “official title” constraint. A weak answer computes the max salary from all workers, drops records in the join, or returns only one title when ties exist.

Solution

1) Find the maximum salary among workers who have an official title

This subquery is the key filter that enforces the “official title only” requirement. We join worker to title so untitled employees are excluded from consideration. Then we compute MAX(w.salary) over that restricted set.

PostgreSQL
Tables: worker, title

2) Pull all job titles whose worker salary equals that maximum

Now we join worker and title again so we can return the actual titles. We keep only rows where the worker’s salary equals the maximum salary computed above. If multiple employees share the same highest salary, this filter naturally includes all of them (and therefore all of their titles).

PostgreSQL

Output

best_paid_title
Asst. Manager
Manager

Amazon SQL Interview Questions #2: Total Cost Of Orders

Last Updated: July 2020

EasyID 10183

Find the total cost of each customer's orders. Output customer's id, first name, and the total order cost. Order records by customer's first name alphabetically.

Go to the Question

Data View

Table: customers
Loading Dataset
Table: orders
Loading Dataset

This is a “looks trivial” aggregation, but interviewers are usually checking whether you notice the grain shift. orders are multiple rows per customer, so the only correct way to get “total cost per customer” is to collapse to one row per customer with a GROUP BY.

Also, notice the prompt says “each customer’s orders,” which implies you’re summarizing at the customer level, not per order, not per day.

Grain (what one output row means): one row per customer who has at least one order in orders (with this solution).

Interview Framing (How It’s Asked + Typical Follow-Ups)

In interviews, this often comes as “roll up the fact table to the dimension grain,” and the follow-up is almost always about inclusivity: “Do you want customers with zero orders?”

If they do, you’ll need a LEFT JOIN from customers and COALESCE(SUM(...), 0). Another common follow-up is about what you’re allowed to group by: “Is first_name unique?” If not, you’d group by ’id’ from customers (and select the name as a dependent attribute).

Common Mistakes (And How To Avoid Them)

The most common mistake is grouping by first_name only. That merges different customers who share a name and produces a total that looks reasonable but is wrong.

Another mistake is selecting columns that aren’t grouped or aggregated (or “fixing” it with DISTINCT, which hides the grain problem instead of solving it). People sometimes also order by the sum instead of by first_name, which does not follow the prompt.

Validation Checks

A quick check is that the output should have one row per customer (for customers with orders), so the number of rows should match the number of distinct cust_id values present in orders.

Another check is that the total of your per-customer sums should equal the total revenue in orders over the same rows, if those don’t match, you likely duplicated rows via a bad join.

Finally, spot-check a customer you can compute mentally (like a customer with one order): their total should equal that single total_order_cost.

Solution

1) Start by joining orders to customers

We first need the customer details (like first_name) alongside each order. The join happens on customers.id = orders.cust_id. At this stage, we’re not aggregating yet, just creating the combined dataset.

PostgreSQL
Tables: customers, orders

2) Aggregate order costs per customer

Now that each order is linked to a customer, we can sum up spending per customer. We group by the customer columns we want to return (id and first_name). This produces one row per customer with their total cost.

PostgreSQL
Tables: customers, orders

3) Final query: add sorting by first name (complete solution)

The question asks us to order records alphabetically by the customer’s first name. So we keep the aggregation exactly the same and simply add ORDER BY customers.first_name ASC. This final query is the full solution.

PostgreSQL
Go to the question on the platformTables: customers, orders

Output

idfirst_namesum
12Eva205
3Farida440
5Henry80
7Jill535
1Mark275
15Mia540
4William140

Amazon SQL Interview Questions #3: Workers by Department Since April

EasyID 9847

Find the number of workers by department who joined on or after April 1, 2014.

Output the department name along with the corresponding number of workers.

Sort the results based on the number of workers in descending order.

Go to the Question

Data View

Table: worker
Loading Dataset

This is a single-table aggregation with a date filter.

The “scenario” part is simple: you’re counting headcount by department for a time-bounded cohort (“joined on or after”). The main thing to get right is the boundary condition and the output grain.

Grain (what one output row means): one row per department in the result set.

Interview Framing (How It’s Asked + Typical Follow-Ups)

In interviews, this is often framed as “give me a breakdown by category for a cohort,” and the follow-up is usually about the boundary: “Is April 1 inclusive?” (the prompt says “on or after,” so yes).

Another common follow-up is whether departments with zero workers should appear; in this question, they won’t, because we’re grouping only over workers who meet the filter.

Common Mistakes (What Candidates Do Wrong + How To Avoid)

The most common mistake is using > instead of >= and accidentally excluding workers who joined exactly on 2014-04-01.

Another is grouping by too much (like department, joining_date), which changes the grain into daily counts. People also sometimes sort alphabetically by department instead of sorting by the computed count, which fails the prompt.

Evaluation Criteria / Rubric (How Interviewers Score It)

A strong answer applies the correct inclusive date filter, groups at the correct grain (department), and sorts by count descendingly.

An okay answer gets the right departments but doesn’t follow the sorting requirement, or uses a CTE when it isn’t needed (not wrong, just extra).

A weak answer changes the grain by grouping on unnecessary columns, or misinterprets the date boundary, and returns the wrong counts.

Solution

1) Filter only workers who joined on or after April 1, 2014

We start by narrowing the dataset to only the workers we care about. This keeps the next steps clean and avoids mixing in earlier join dates. Using a CTE here makes the query easier to read.

PostgreSQL
Tables: worker

2) Count workers per department

Now that we have only qualifying workers, we group them by department. COUNT(worker_id) gives us how many workers are in each department. At this stage, we’re producing one row per department.

PostgreSQL
Tables: worker

3) Final query: add sorting by number of workers (complete solution)

The final requirement is to sort departments by worker count in descending order. We keep the same aggregation and simply add ORDER BY num_workers DESC. This gives the final ranked result.

PostgreSQL

Output

departmentnum_workers
Admin4
Account1
HR1

Medium Interview Questions

Amazon Medium Difficult SQL Interview Questions

Amazon SQL Interview Questions #4: Finding User Purchases

Last Updated: December 2020

MediumID 10322

Identify returning active users by finding users who made a second purchase within 1 to 7 days after their first purchase. Ignore same-day purchases. Output a list of these user_ids.

Go to the Question

Data View

Table: amazon_transactions
Loading Dataset

This table is an event log at the transaction level. A single user can have multiple purchases on the same day, which matters because the prompt is about “first purchase date” and “second purchase date,” not “first two rows in the table.”

  • Grain (what one output row means): one row per qualifying user_id.
  • What the table implies: multiple rows per user per day are possible, so you need to decide whether “purchase” means “transaction” or “purchase day.” The solution treats it as distinct purchase dates, which matches “ignore same-day purchases.”

Interview Framing (How It’s Asked + Typical Follow-Ups)

This is typically framed as a retention metric: “Who comes back within a week?” 

The follow-ups are usually about definitions: “Are multiple purchases on day 1 considered returning?” (prompt says ignore same-day, so no), “Do we care about the second transaction or the second day with activity?”, and “What if the second purchase is 20 days later but there’s also one on day 3?” In other words, interviewers are testing whether you can lock the definition of “second purchase” before you start coding.

Trade-Offs / Decision Rules (Why This Approach vs Alternatives)

The key decision is to dedupe to one row per user-day first. That makes “ignore same-day purchases” automatic, because multiple transactions collapse into a single purchase_date.

From there, using ROW_NUMBER() to label the first and second purchase dates is straightforward and readable in an interview. 

An alternative is MIN() for the first date and then MIN(purchase_date) FILTER (WHERE purchase_date > first_date) for the second date (dialect-dependent), or a self join to find the earliest later purchase. 

The window approach is usually easiest to explain out loud: “I’m ranking purchase days and grabbing the first two.”

Edge Cases (Assumptions + What Breaks the Solution)

If a user has two purchases on the same calendar day only, they should not qualify: deduping to user-day enforces that. 

If a user has purchases on many days, you still only care about the second distinct day, not “any purchase within 7 days.” This is subtle: a user who buys on day 1, then day 20, then day 3 doesn’t exist chronologically, but if your query isn’t careful, you can accidentally check “any later purchase within 7 days” rather than the second purchase day.

Finally, timezone can matter if created_at is a timestamp; casting to date assumes the business definition of “day” matches the database timezone.

Solution

1) Remove same-day duplicates by keeping unique purchase dates per user

First, we convert created_at into a date and use DISTINCT so multiple same-day purchases count as a single purchase day. This is important because the prompt explicitly says to ignore same-day purchases. The result is a clean set of (user_id, purchase_date) pairs.

PostgreSQL
Tables: amazon_transactions

2) Rank purchase dates per user to identify first vs second purchase

Now we assign an order to each purchase date using ROW_NUMBER(). The earliest purchase date per user gets rn = 1, and the next one gets rn = 2. This makes it easy to pick out the first two purchase days for each user.

PostgreSQL
Tables: amazon_transactions

3) Final query: extract first/second dates and filter to 1-7 day returning users (complete solution)

Next, we pivot the first two ranked rows into two columns: first_date and second_date. We filter out users who don’t have a second purchase day (second_date is NULL). Finally, we compute (second_date - first_date) and keep only those between 1 and 7 days.

PostgreSQL
Go to the question on the platformTables: amazon_transactions

Output

user_id
100
103
105
109
111
114
117
122
130
131
133
141
143

Amazon SQL Interview Questions #5: Top-Rated Support Employees

Last Updated: May 2023

MediumID 10554

You're analyzing employee performance at a customer support center. Management wants to identify which support agents are providing the best customer experience based on satisfaction scores.

Rank employees by their average customer satisfaction score for resolved tickets. Return the top 3 ranks, where ranks should be consecutive and should not skip numbers even if there are ties. For example, if the scores are [4.9, 4.7, 4.7, 4.5], the rankings would be [1, 2, 2, 3].

Return the employee ID, employee name, average satisfaction score, and employee rank.

Go to the Question

Data View

Table: amazon_support_tickets
Loading Dataset

This is a ticket-level table. Each row is a support ticket with a status and an optional satisfaction score. The prompt quietly enforces two filters: you only score employees on resolved tickets, and you must ignore null satisfaction scores. The output grain is employee-level.

  • Grain (what one output row means): one row per employee, with their average satisfaction over qualifying tickets, plus a rank.

Interview Framing (How It’s Asked + Typical Follow-Ups)

This usually shows up as “Who are our top performers?” but the follow-ups are where the question becomes interview-grade.

Common follow-ups include: “Do unresolved or escalated tickets count?” (here, no, only resolved), “What do you do with missing satisfaction scores?” (exclude them), and “How do you rank with ties without skipping rank numbers?”

That last line is the real requirement: they’re pushing you toward DENSE_RANK() rather than RANK().

Trade-Offs / Decision Rules (Why This Approach vs Alternatives)

The clean decision rule is to separate the problem into two stages: compute a stable employee-level metric first (average satisfaction), then rank those averages. That keeps the grain controlled and avoids ranking ticket rows by accident.

For ranking, DENSE RANK is the right fit because the prompt explicitly says ranks should not skip numbers under ties. RANK() would skip (e.g., 1, 2, 2, 4), which violates the requirement. ROW_NUMBER() would force a unique ordering even when there is an average tie, which is also not what they asked.

Evaluation Criteria / Rubric (How Interviewers Score It)

A strong answer filters to resolved tickets, excludes null satisfaction values, aggregates to employee-level correctly, uses DENSE_RANK() (not RANK()), and returns all employees whose rank is within the top 3 ranks (including ties at rank 3). 

An okay answer gets the averages right but uses the wrong ranking function, or returns only three employees instead of “top 3 ranks.” A weak answer ranks ticket rows directly, includes unresolved/escalated tickets, or treats null satisfaction as zero, and drags averages down silently.

Solution

1) Compute average satisfaction per employee (resolved tickets only)

We first filter the table to include only tickets that were resolved. Then we exclude null satisfaction scores so they don’t distort the average. Finally, we group by employee and compute AVG(customer_satisfaction).

PostgreSQL
Tables: amazon_support_tickets

2) Rank employees by average satisfaction using DENSE_RANK()

Now that we have one row per employee, ranking becomes straightforward. We use DENSE_RANK() so ties share the same rank and the next rank doesn’t skip numbers. This matches the requirements, such as [4.9, 4.7, 4.7, 4.5] -> [1, 2, 2, 3].

PostgreSQL
Tables: amazon_support_tickets

3) Final query: return only top 3 ranks (complete solution)

Finally, we filter to keep only employees whose dense rank is 1, 2, or 3. This automatically includes all employees tied within those ranks. The output includes exactly what the prompt asks for: id, name, average score, and rank.

PostgreSQL
Go to the question on the platformTables: amazon_support_tickets

Output

employee_idemployee_nameavg_satisfactionemployee_rank
EMP-201Alice Johnson51
EMP-207Grace Hill4.52
EMP-203Carol Lee33
EMP-209Ivy Scott33

Hard Interview Questions

Amazon Hard SQL Interview Questions

Amazon SQL Interview Questions #6: Monthly Percentage Difference

Last Updated: December 2020

HardID 10319

Given a table of purchases by date, calculate the month-over-month percentage change in revenue. The output should include the year-month date (YYYY-MM) and percentage change, rounded to the 2nd decimal point, and sorted from the beginning of the year to the end of the year. The percentage change column will be populated from the 2nd month forward and can be calculated as ((this month's revenue - last month's revenue) / last month's revenue)*100.

Go to the Question

Data View

Table: sf_transactions
Loading Dataset

This is transaction-level revenue. Each row is a purchase event with a created_at date and a value. The key move is that the metric is at the month level, so you need to roll up to one row per month first, then compare each month to the prior month.

Grain (what one output row means): one row per month.

Interview framing (how it’s asked + typical follow-ups)

In interviews, this is typically framed as “compute MoM growth” or “build a monthly KPI trend with deltas.” 

The follow-ups are usually about definitions: “Do we bucket by calendar month in UTC or local time?” and “What happens in the first month?” (no prior month, so NULL). If the interviewer wants to push, they’ll ask what you do when the prior month's revenue is zero, because the formula would divide by zero.

Trade-Offs / Decision Rules (Why This Approach vs Alternatives)

The decision rule is: aggregate monthly revenue first, then use a window function (LAG()) on the monthly totals. That keeps the logic readable and avoids self-joining the monthly table to itself.

You can compute the month key as a string (TO_CHAR) or as a real date (first day of month via date_trunc('month', created_at)), then format at the end. Using a date-month key is often safer because it preserves true chronological ordering, while strings can bite you if formatting is inconsistent. 

The provided solution is ordered by the same YYYY-MM string it groups by, which works because the format is fixed-width and lexicographically sortable.

Evaluation Criteria / Rubric (How Interviewers Score It)

A strong solution produces exactly one row per month, uses LAG() to access prior month revenue, applies the formula correctly, and returns NULL for the first month. It also sorts chronologically and rounds to 2 decimals. 

An okay solution gets the numbers right but has fragile ordering (e.g., ordering by month name or by a non-sortable string), or it repeats the same LAG(...) call multiple times without making the intent clear. 

A weak solution computes daily deltas instead of monthly, compares the wrong months, or calculates percent change incorrectly (common sign errors or dividing by the current month instead of the prior month).

Common Mistakes (What Candidates Do Wrong + How To Avoid)

A common mistake is applying LAG() directly on raw transaction rows. That gives you the previous transaction, not the previous month’s total, so the percent change becomes meaningless. 

Another mistake is grouping by MONTH(created_at) without including the year, which merges January 2019 and January 2020 into the same bucket. People also often forget to handle the first month (they try to force a value instead of leaving it NULL), or they write the percent formula as (this - last) / this, which produces a different metric.

Edge Cases (Assumptions + What Breaks the Solution)

If the prior month’s revenue is zero, the percent change is undefined (division by zero). In real pipelines, you’d usually return NULL, infinity, or define a business rule (e.g., treat as 100% if this month > 0). 

Missing months are another edge case: if a month has no transactions, it won’t appear unless you generate a calendar table. The window function will then compare non-adjacent months (e.g., March compared to January), which may or may not match the business definition of “month-over-month.”

Solution

1) Aggregate revenue to the monthly level (one row per month)

PostgreSQL
Tables: sf_transactions

2) Bring in last month’s revenue using LAG() (window function)

Now we add a window function so each month can “see” the revenue from the previous month. LAG(monthly_revenue) shifts the monthly revenue down by one row in chronological order. This step sets us up to calculate month-over-month change cleanly without a self-join.

PostgreSQL
Tables: sf_transactions

3) Final query: calculate the MoM % change and round (complete solution)

Finally, we compute the percentage change using the formula in the prompt. We round it to 2 decimals using ROUND(..., 2). The first month will naturally have a NULL percentage change because there’s no “previous month” to compare against.

PostgreSQL
Go to the question on the platformTables: sf_transactions

Output

year_monthrevenue_diff_pct
2019-01
2019-02-28.56
2019-0323.35
2019-04-13.84
2019-0513.49
2019-06-2.78
2019-07-6
2019-0828.36
2019-09-4.97
2019-10-12.68
2019-111.71
2019-12-2.11

Amazon SQL Interview Questions #7: Exclusive Amazon Products

HardID 9608

Find products which are exclusive to only Amazon and therefore not sold at Top Shop and Macy's. Your output should include the product name, brand name, price, and rating.

Two products are considered equal if they have the same product name and same maximum retail price (mrp column).

Go to the Question

Data View

Table: innerwear_macys_com
Loading Dataset
Table: innerwear_topshop_com
Loading Dataset
Table: innerwear_amazon_com
Loading Dataset

The important modeling detail is the equality rule: “same product” is defined by the pair (product_name,mrp), not by URL, not by price, not by brand. That means you’re doing a set difference problem on a composite key.

  • Grain (what one output row means): one row per Amazon product that does not have an equivalent entry (by product_name + mrp) in TopShop or Macy’s.

Interview Framing (How It’s Asked + Typical Follow-Ups)

In interviews, this is usually framed as “find exclusives” or “dedupe across sources,” and the follow-ups are where people lose points: “What does ‘same product’ mean?”, “Do we compare by name only or name + MRP?”, and “If Macy’s lists the same product twice, does it matter?” 

Those questions are a hint that the core skill is choosing the right join key/comparison key before you write SQL.

A common follow-up is portability: “Would you solve this with NOT IN, NOT EXISTS, or an anti-join?” Because the correctness depends on null behavior and key definition.

Trade-Offs / Decision Rules (Why This Approach vs Alternatives)

The provided solution uses a tuple NOT IN against a union of competitor keys. That’s concise and reads like the business requirement: “keep Amazon rows whose (name, mrp) pair is not present elsewhere.”

In practice, many people prefer NOT EXISTS (or a left anti-join) because NOT IN can behave unexpectedly if the subquery contains nulls. If product_name or mrp can be null in the competitor catalogs, a single null can make the NOT IN filter out everything, depending on dialect and query shape. 

So the decision rule is: NOT IN is fine if the key columns are guaranteed non-null; otherwise, NOT EXISTS is the safer interview choice.

Edge Cases (Assumptions + What Breaks the Solution)

The biggest edge case is nulls in the comparison key. If any competitor row has product_name or mrp null, the tuple NOT IN can become unreliable. 

Another edge case is normalization: if two retailers format names differently (“Calvin Klein Women’s…” vs “Calvin Klein Womens…”), they won’t match even if they are conceptually the same product. 

This question explicitly avoids fuzzy matching by defining equality as exact product_name + mrp. Finally, duplicates don’t change the logic here because you’re comparing set membership, not counts, but you still want DISTINCT on the competitor key list to keep the subquery smaller.

Validation Checks (Fast Sanity Tests)

A quick correctness check is to take a couple of returned Amazon products and search for the same (product_name, mrp) pair in the TopShop and Macy’s tables; there should be zero matches. 

Then pick a known overlapping product from Macy’s or TopShop and confirm it does not appear in the Amazon-exclusive output. If the result set is unexpectedly empty, the first thing to test is whether competitor keys contain nulls: rewrite using NOT EXISTS and compare counts to diagnose null-related behavior.

Solution

Let’s code our logic.

1) Build the list of product keys sold at Macy’s or TopShop

We first extract only the comparison keys: product_name and mrp. UNION ALL combines both retailer lists into one dataset (duplicates don’t really hurt, but we can DISTINCT later). This gives us “everything that exists outside Amazon,” according to the equality definition.

PostgreSQL
Tables: innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com

2) Deduplicate that competitor key list (so matching is clean)

Now we wrap the union and select DISTINCT to get a unique set of (product_name, mrp) pairs. This avoids repeated keys showing up multiple times if the same product appears in multiple rows. The result is the exact list we want to exclude from Amazon.

PostgreSQL
Tables: innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com

3) Final query: return Amazon-only products (complete solution)

Finally, we select products from Amazon and exclude anything that appears in the competitor key set. The (product_name, mrp) NOT IN (...) check matches your equality rule exactly. Then we return the required Amazon columns: product_name, brand_name, price, and rating.

PostgreSQL
Go to the question on the platformTables: innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com

Output

product_namebrand_namepricerating
Calvin Klein Women's Bottoms Up Hipster PantyCalvin-Klein$11.004.5
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
Calvin Klein Women's Carousel 3 Pack ThongCalvin-Klein$19.994
b.tempt'd by Wacoal Women's Lace Kiss Braletteb-temptd$11.654
Wacoal Women's Front Close T-Back BraWacoal$46.004.2
Calvin Klein Women's Modern Cotton Bralette and Bikini SetCalvin-Klein$44.004.6
Calvin Klein Women's 3 Pack Invisibles Hipster PantyCalvin-Klein$29.753.9
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$11.954.2
Wacoal Women's Underwire Sport BraWacoal$65.004.3
Hanky Panky Women's Vikini PantyHanky-Panky$30.004
Wacoal Women's Halo Underwire BraWacoal$48.004.4
Wacoal Women's Basic Beauty Contour BraWacoal$55.004
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.5
Wacoal Women's How Perfect Soft Cup BraWacoal$51.524.2
Wacoal Women's Body By Wacoal Underwire BraWacoal$43.004.4
Wacoal Women's How Perfect Soft Cup BraWacoal$60.004.1
Wacoal Women's Retro Chic Contour BraWacoal$65.004.2
Wacoal Women's Halo Strapless BraWacoal$46.004.2
Calvin Klein Women's Modern Cotton Boyshort PantyCalvin-Klein$19.494.4
Wacoal Women's Underwire Sport BraWacoal$52.674.3
b.tempt'd by Wacoal Womens Ciao Bella Tanga Pantyb-temptd$19.004.1
Wacoal Women's Embrace Lace BraWacoal$44.504.3
Calvin Klein Women's Naked Glamour Strapless Push Up BraCalvin-Klein$28.273.9
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$41.404.3
Wacoal Women's Amazing Assets Contour BraWacoal$65.003.8
Wacoal Women's Sport Zip Front Contour BraWacoal$69.843.8
Wacoal Women's Awareness Underwire BraWacoal$65.004.4
Wacoal Women's Bodysuede Underwire BraWacoal$60.004.4
Wacoal Women's Underwire Sport BraWacoal$45.504.3
Wacoal Women's How Perfect Soft Cup BraWacoal$60.004.1
Wacoal Embrace Lace Bikini PantyWacoal$27.004.3
Wacoal Women's Halo Underwire BraWacoal$34.954.3
Wacoal Women's Lace Affair Bikini PantyWacoal$22.004.8
Wacoal Women's Awareness Underwire BraWacoal$65.004.4
Wacoal Women's Basic Beauty Contour BraWacoal$55.004
Calvin Klein Women's Naked Glamour Strapless Push Up BraCalvin-Klein$29.993.9
Wacoal Women's Retro Chic Underwire BraWacoal$48.004.4
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$46.004.3
Calvin Klein Women's 3 Pack Carousel Thong PantyCalvin-Klein$20.994.7
Wacoal Women's Retro Chic Underwire BraWacoal$48.004.4
Calvin Klein Women's Sheer Marquisette Demi Unlined BraCalvin-Klein$36.004.6
Wacoal Women's Basic Beauty Front Close Contour BraWacoal$55.004.2
Wacoal Women's Retro Chic Contour BraWacoal$52.974.2
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$40.484.3
Wacoal Women's Body By Wacoal Underwire BraWacoal$39.464.4
Calvin Klein Women's Standard Radiant Cotton Bikini PantyCalvin-Klein$9.884.4
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
b.tempt'd by Wacoal Women's Ciao Bella Balconette Brab-temptd$38.004.3
Wacoal Women's Bodysuede Underwire BraWacoal$60.004.4
Wacoal Women's Halo Underwire BraWacoal$48.004.4
Wacoal Women's Embrace Lace BraWacoal$35.004.3
Calvin Klein Women's Naked Glamour Strapless Push Up BraCalvin-Klein$29.993.8
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$54.774.2
Wacoal Women's Embrace Lace Bikini PantyWacoal$27.004.3
Calvin Klein Women's Sheer Marquisette Demi Unlined BraCalvin-Klein$26.194.5
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$55.004.2
Calvin Klein Women's Everyday Lightly Lined Demi BraCalvin-Klein$38.003.8
Calvin Klein Women's Bottoms Up Bikini PantyCalvin-Klein$8.004.3
Calvin Klein Women's ID Wide Waistband Unlined Triangle Cotton BraletteCalvin-Klein$19.013.8
Wacoal Women's Body By Wacoal Underwire BraWacoal$46.004.4
Calvin Klein Women's Sheer Marquisette Demi Unlined BraCalvin-Klein$27.004.6
Wacoal Women's Retro Chic Underwire BraWacoal$69.994.4
b.tempt'd by Wacoal Women's Ciao Bella Balconette Brab-temptd$38.004.3
Wacoal Women's Red Carpet Strapless BraWacoal$65.004.4
Calvin Klein Women's Seductive Comfort Lift Strapless Multiway BraCalvin-Klein$39.604.1
Wacoal Women's Slimline Seamless Minimizer BraWacoal$65.004.3
Wacoal Women's Awareness Underwire BraWacoal$55.254.4
Wacoal Women's Embrace Lace BraWacoal$50.004.3
Hanky Panky Women's Bare Godiva Thong PantyHanky-Panky$25.004.2
Calvin Klein Women's Seductive Comfort Lift Strapless Multiway BraCalvin-Klein$39.603.9
Wacoal Women's Halo Underwire BraWacoal$48.004.4
Wacoal Women's Retro Chic Underwire BraWacoal$48.004.4
Calvin Klein Women's 4 Pack Stretch Lace Bikini PantyCalvin-Klein$28.014.1
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$34.504.3
Wacoal Women's Red Carpet Strapless BraWacoal$65.004.4
Calvin Klein Women's ID Tanga Wide Waistband Cotton PantyCalvin-Klein$16.974.3
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
Calvin Klein Women's Ombre 5 Pack ThongCalvin-Klein$59.995
Wacoal Women's How Perfect Soft Cup BraWacoal$60.004.2
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$55.004.2
Wacoal Women's Underwire Sport BraWacoal$42.904.3
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
Wacoal Women's Retro Chic Underwire BraWacoal$54.604.4
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$55.004.2
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$55.004.2
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$41.404.3
Calvin Klein Women's Modern Cotton BikiniCalvin-Klein$11.364.7
Wacoal Women's Sport Contour BraWacoal$47.404.2
b.tempt'd by Wacoal Women's Ciao Bella Balconette Brab-temptd$21.894.3
Wacoal Women's Sport Contour BraWacoal$65.964.2
Wacoal Women's Bodysuede Underwire BraWacoal$60.004.3
Wacoal Women's Halo Strapless BraWacoal$46.004.2

Amazon SQL Interview Questions #8: First Day Retention Rate

Last Updated: February 2022

HardID 2090

Calculate the first-day retention rate of a group of video game players. The first-day retention occurs when a player logs in 1 day after their first-ever log-in. Return the proportion of players who meet this definition divided by the total number of players.

Go to the Question

Data View

Table: players_logins
Loading Dataset

This dataset looks simple (just player_id and login_date), but the interview goal is whether you translate “first-ever login” into the correct baseline per player, and then avoid double-counting players who logged in multiple times.

Grain (what one output row means): one row total, a single retention_rate for the whole population.

What the table implies: multiple rows per player_id (a player can log in many times), so any join back to the raw logins can multiply rows and inflate counts if you don’t use COUNT DISTINCT.

Interview Framing (How It’s Asked + Typical Follow-Ups)

Interviewers usually phrase this as “day-1 retention” or “D1 retention,” then follow up with clarifications like “Is it exactly one day after, or within 24 hours?” and “Do multiple logins count multiple times?” 

In SQL interviews, the expected interpretation is usually a calendar-day difference (since the column is a date here), and retention is a per-player event (a player is either retained or not).

A common follow-up is: “What if someone logs in multiple times on day 1?”, which is basically a test of whether you count players, not sessions.

Trade-Offs / Decision Rules (Why This Approach vs Alternatives)

This CTE approach is clean because it separates the baseline computation (first login date) from the retention check (did they show up on first_day + 1). It’s also easy to audit step-by-step.

Alternatives that are also valid:

  • A correlated EXISTS (“does a day+1 login exist for this player?”) can be simpler to read, and it avoids join-multiplication issues naturally.
  • Pre-aggregating to distinct (player_id, login_date) first can protect you if the raw data has duplicates.

If you’re unsure about duplicates, default to COUNT(DISTINCT player_id) in numerator/denominator, or use EXISTS.

Evaluation Criteria / Rubric (How Interviewers Score It)

A strong answer usually hits these points: it correctly defines each player’s first-ever login date, checks for a login exactly one day after that baseline, counts retained players uniquely (not per row), keeps the denominator as all unique players, and returns a non-integer proportion (float/decimal), not truncated integer division.

Common Mistakes (What Candidates Do Wrong + How To Avoid)

A frequent mistake is counting rows instead of players after joining back to logins, which overstates retention for players with multiple logins. 

Another common slip is using BETWEEN 0 AND 1 (which accidentally includes same-day logins) even though the prompt explicitly says “1 day after.” People also sometimes compute “day-1 after each login” instead of after the first-ever login, which answers a different question.

Edge Cases (Assumptions + What Breaks the Solution)

If a player’s first login is near the end of the dataset window, they might not have an opportunity to log in the next day, and this affects interpretation, but the question doesn’t mention censoring, so we include them in the denominator. 

If timestamps (not dates) were used, you’d need to define whether “1 day” means “next calendar day” or “within 24 hours.” Also, if the table can contain duplicate (player_id, login_date) rows, you’ll want a distinct day-level table (or keep COUNT(DISTINCT ...)) to prevent inflation.

Solution

Let’s code our logic.

1) Find each player’s first-ever login date

We start by grouping by player_id and taking the minimum login_date. This gives us a single first_day anchor date per player. Once we have that, we can compare all other logins against it.

PostgreSQL
Tables: players_logins

2) Identify players who logged in exactly 1 day after their first login

Now we join the first login dates back to the raw login table. This lets us compute login_date - first_day for each login event per player. We then filter to keep only events where the difference equals 1 day, which matches the retention definition.

PostgreSQL
Tables: players_logins

3) Final query: compute retention rate as retained / total (complete solution)

Finally, we count how many distinct players satisfy the first-day retention condition and divide by the total number of distinct players. We cast the numerator to FLOAT so we get a decimal result instead of integer division. The final output is a single value: the first-day retention rate.

PostgreSQL

Output

retention_rate
0.5

How to Solve Amazon SQL Questions (Interview Framework)

How to Solve Amazon SQL Interview Questions

Common Amazon SQL Mistakes (and How to Avoid Them)

Amazon-style SQL interviews are designed to test whether you can work with data the way it actually exists in production: messy, full of edge cases, and requiring careful thought. Here are the most common mistakes candidates make:

Common Amazon SQL Interview Mistakes

1) Losing track of table grain (then patching it with DISTINCT)

This happens when you join a one-row-per-user table to a many-rows-per-user fact table without thinking through the implications. Suddenly, one customer's revenue gets multiplied by however many transactions they have, and your metrics are wildly inflated.

The solution: Always state the grain out loud before writing your joins, "this table has one row per purchase, this one has one row per user." Aggregate to the grain you need before joining, or be very deliberate about when and how you group after the join.

2) Forgetting about returns, refunds, and cancellations

Negative values in your data can throw off revenue calculations significantly. If the prompt asks you to exclude returns but you forget to filter them out, your numbers will be off or even negative.

Handle this upfront: decide whether you're including or excluding these transactions, and filter explicitly (for example, WHERE purchase_amount >= 0). Amazon prompts typically call this out directly, so read the requirements carefully.

3) Treating dates as strings

This is a subtle but common error. When you group or sort dates that have been converted to strings too early, you end up with alphabetical sorting instead of chronological, 2024-02 appearing after 2024-10, for instance.

Always use proper date/timestamp functions for grouping and sorting, and only format dates for display purposes at the very end of your query. This is especially important when output requirements specify formats such as YYYY-MM.

4) Rolling metrics without accounting for partial windows

When calculating rolling averages or other window metrics, your first few time periods often don't have complete data windows. If you're not careful, these partial windows get treated as if they're complete, skewing your results.

Follow the prompt's definition precisely. If your data starts midstream, acknowledge that early windows are incomplete and either keep them labeled as partial or filter them out, depending on what the question actually requires.

5) Window functions that appear correct but have logical errors

Forgetting to include PARTITION BY, or using a default window frame that doesn't match the question, is extremely common. You might think you're calculating a metric per user when you're actually calculating it across all users.

Always specify what you're partitioning by (user, product, time period, etc.), order by the appropriate column, and verify your logic by tracing through one example manually. Amazon interview guidance specifically highlights missing partitions as a frequent pitfall.

6) LEFT JOINs accidentally becoming INNER JOINs

This happens when you write a LEFT JOIN to preserve all rows from your left table, then add a WHERE clause that filters on the right table. That filter drops all the non-matching rows, and converts your LEFT JOIN into an INNER JOIN.

If you want to filter the right table while keeping unmatched left rows, put those conditions in the ON clause instead of WHERE. This is one of the classic interview mistakes.

7) NULL handling errors

WHERE column = NULL doesn't work, it returns nothing. NULLs also behave unexpectedly in counts, comparisons, and aggregations, often leading to incorrect results.

Use IS NULL and IS NOT NULL for comparisons, and think through what NULL actually represents in your context: unknown, not applicable, or zero. Amazon interview materials specifically call out NULL behavior as a frequent trap.

8) Confusion between WHERE and HAVING in grouped queries

Using WHERE to filter aggregate results generates an error, while filtering row-level data in HAVING can change your metrics in unexpected ways.

The rule is straightforward: row-level filters belong in WHERE, and aggregate filters belong in HAVING. This is one of those "easy to miss" mistakes that Amazon interview guides consistently highlight.

2-Week Amazon SQL Prep Plan

This plan assumes you’re practicing with Amazon-style, scenario-driven SQL questions where the real test is translating a question prompt into a clean definition (grain + constraints), then picking a pattern that survives duplicates, ties, and missing rows.

How to use this plan day-to-day

Aim for 60-90 minutes a day (or more, if you are able).

For each question, force yourself to write down (mentally or on paper) the grain and the “trap” before you touch SQL. After you solve it, do a 2-minute validation pass: row counts, spot checks, and “did I accidentally change the grain with joins?”

Week 1: Fundamentals + Amazon patterns (Easy to Medium)

Amazon SQL Interview Patterns

Day 1: Grain discipline + basic aggregation
Do “Total Cost of Orders” (ID 10183). Focus on grouping at the correct grain (customer) and validating totals.

Day 2: Filtering cohorts by date + group and sort
Do “Workers by Department Since April” (ID 9847). Treat it as “cohort + breakdown,” and be strict about inclusive boundaries (>=).

Day 3: Max-with-constraints + ties
Do “Workers With The Highest Salaries” (ID 10353). The key is computing the max over the qualified population (titled workers) and returning all ties.

Day 4: Retention-style definition (second purchase within 1-7 days)
Do “Finding User Purchases” (ID 10322). Lock the definition of “second purchase” (second distinct day, not second row), then implement with a clear stepwise approach.

Day 5: Ranking done right (ties + top N ranks)
Do “Top-Rated Support Employees” (ID 10554). Practice choosing DENSE_RANK vs RANK vs ROW_NUMBER and explaining why.

Day 6: Review day (speed + explanation)
Re-do Day 1-5 questions under a timer. Your goal is not new SQL, it’s cleaner reasoning: state grain, state trap, then code. If you rely on DISTINCT to “fix” duplicates, stop and re-check joins/grain.

Day 7: Mock interview mini-loop
Pick any 2 questions from the week and “talk through” your approach out loud. If you can’t explain why you chose a pattern (CTE vs EXISTS vs window), you’re not interview-ready yet.

Week 2: Advanced patterns (Medium to Hard)

Amazon SQL Interview Patterns

Day 8: Time-series KPI shaping (monthly rollup + MoM change)
Do “Monthly Percentage Difference” (ID 10319). The must-have habit: aggregate to the month first, then LAG the monthly totals.

Day 9: Set difference across sources (anti-join / NOT EXISTS / NOT IN)
Do “Exclusive Amazon Products” (ID 9608). Practice arguing the trade-off: NOT IN is concise but can be risky with NULLs; NOT EXISTS is safer.

Day 10: Retention rate as a ratio (dedupe + denominator correctness)
Do “First Day Retention Rate” (ID 2090). Your focus is counting players, not login rows, and preventing join multiplication.

Day 11: Mixed practice set (3 questions, back-to-back)
Do one easy + one medium + one hard from the list above. This simulates interview context switching. Keep your walkthrough structure consistent: assumptions,  grain,  trap, solution, and validate.

Day 12: Failure-mode day (break your own query)
Pick 2 hard-ish questions and actively try to break your solution: add a tie, add duplicates, add NULLs, remove a month, add multiple rows per key. If your query changes meaning, rewrite it in a more robust pattern.

Day 13: Full mock (45-60 minutes)
Do 2 questions under time pressure: one ranking/window question and one set logic question. Afterward, rewrite each solution to be more readable (fewer moving parts, clearer grain control).

Day 14: Final consolidation checklist
You’re ready if you can do all of this reliably: define grain first, pre-aggregate before ranking, preserve ties intentionally, choose the right anti-join pattern, and validate outputs with quick sanity checks (counts, totals, spot checks).

Practical rule of thumb for progression

Start with the Easy questions to lock in grain + aggregation discipline, move to Medium for window functions and definition traps, then finish with Hard for set logic and time-based analysis. If you’re getting stuck, it’s usually not “SQL syntax”; it's that the data definition isn’t pinned down yet.

More Big Tech SQL Interview Guides (Meta, DoorDash, Uber, etc.)

If you’re prepping for Amazon, it helps to practice on other Big Tech question styles too. The datasets change, but the patterns repeat: user activity logs, orders/transactions, time windows, and window functions.

FAQs:

How hard is the Amazon SQL interview?

Amazon’s SQL questions can range from easy to medium and hard, and they tend to be practical coding problems rather than product/theory trivia. 

For example, this interview question is a medium-level Amazon SQL question, while in this one, the rolling-average problem is explicitly described as hard because it requires a clear understanding of data manipulation and time-based calculations.

So, difficulty mostly depends on your comfort with multi-step business queries and careful requirements.

What SQL topics does Amazon ask most?

Nearly 70% of Amazon SQL questions involve data in multiple tables and require JOINs, CTEs, or subqueries. It also highlights aggregation (COUNT()/SUM() with GROUP BY) and “more complicated” window functions as other highly prevalent concepts. 

For data engineering roles, EDA/validation, aggregation/metrics, joins, and text/datetime manipulation are also hot topics.

Do I need window functions for Amazon?

Yes! Plan to know window functions.

Amazon questions often use them, as interviewers generally expect candidates to use window functions when possible. For instance, in this interview question, the walkthrough is built around using DENSE_RANK() with PARTITION BY to rank sellers within each product category and filter the top three per group.

What’s the best way to practice Amazon SQL?

Do targeted, repetitive practice on real interview-style questions.

It is crucial to practice solving SQL interview questions before an Amazon interview. You can build your own solution based on the framework explained in this article, try other approaches, and then deliberately cover edge cases. 

How long should I spend preparing?

If you already know SQL fundamentals, plan from 2 to 4 weeks of focused practice; if you’re still learning joins/CTEs/subqueries, budget closer to 2 to 3 months to reach an interview-ready intermediate level. 

That longer window matters for Amazon because its SQL questions often involve multi-table merging (JOIN/CTE/subquery) and frequently use window functions.

Share