Wine Magazine SQL Interview Question on RANK function

SQL Interview Question on RANK function from Wine Magazine


An exhaustive walkthrough to a hard-level SQL question involving windows functions, PARTITION BY clauses, CASE statements, and Common Table Expressions.

Are you a wine enthusiast? If you are, then you’ve probably heard of Wine Magazine. Let us solve one of the advanced SQL interview questions on RANK function. The question is taken from Wine Magazine that involves clauses like PARTITION BY, UNION  ALL, CASE statements, and window functions. We will use a 3-step framework to solve the problem that can be used to solve any coding problem.

Interview Question: Find the Cheapest and the Most Expensive Wine


Table: winemag_p1

Link to the question: https://platform.stratascratch.com/coding/10041-most-expensive-and-cheapest-wine

This is the question we also made a video for, so make sure you take a look.

The goal of the problem is to display the most expensive and cheapest wine for each region mentioned in the dataset. Speaking of the dataset, let us take a peek at the dataset itself.

1. Exploring the Dataset

The schema and the preview of the dataset are as follows:

Table: winemag_p1
idcountrydescriptiondesignationpointspriceprovinceregion_1region_2varietywinery
126576USRich and round, this offers plenty of concentrated blackberry notes enveloped in warm spices and supple oak. There's a hint of green tomato leaves throughout, but the lush fruit combined with sturdy grape tannins and high acidity would pair well with fatty short ribs or braised pork.Estate Club8732VirginiaVirginiaMerlotVeramar
127077ItalyThis luminous sparkler offers measured aromas of yellow fruit and honey and delivers a broad, wide approach in the mouth. The wine's texture is creamy and full and there is a spicy point of effervescence on the palate.Extra Dry8519VenetoProsecco di ValdobbiadeneProseccoVaraschin
143029USThere are some pretty features to this wine, namely the lovely tannins and rich fruit, but it's just too soft and sweet. Tastes like blackberry jam, with extra white sugar spinkled on top.Signature8345CaliforniaPaso RoblesCentral CoastCabernet SauvignonByington
98813USLovely nose, which runs from violets and cherries right into coffee and sweet chocolate. In the mouth the wine coats the palate with Napa-like intensity, only buttressed with more vivid acidity. It's both smooth and deep. The aromas are more detailed, interesting and really clean, loaded with dark flavors of soy, coffee, molasses and black cherry liqueur.Champoux & Loess Vineyards Reserve93100WashingtonWashingtonWashington OtherCabernet SauvignonBoudreaux Cellars
43172USTastes sweeter and fruitier than the 2009, which was dry, so if you like orange marmalade, pineapple tart and vanilla cookie sugariness in your Sauvignon Blancs, it's for you.Les Pionniers8640CaliforniaSonoma CountySonomaSauvignon BlancCapture

The table winemag_p1 contains columns describing the wine, its designation, the price, the province, the variety, and the region from which it comes. You must note that there are two region columns. From closer inspection, the two regions, in some cases, seem to be interrelated. For instance, Virginia is listed as the region in both columns for the Merlot from the Veramar winery. For someone who has not understood the question, this can cause confusion.

Looking back at the question, it is clear that the price applies to both regions, which means that they are not interchangeable. They are two separate regions to which the wine and its price apply. In situations like these, it is best to communicate with the interviewer and clarify your assumptions about the dataset and your approach.

There also seem to be NULL values in the columns designation, price, region_1, and region_2. These need to be filtered out while coding, especially when performing windows functions over these columns. So, it is crucial to keep an eye out for those pesky NULL values.

2. Writing out the Approach

Before beginning to code, getting the approach figured out is a smart way to go about solving coding problems. It gives you an opportunity to think about and structure your code effectively. What’s more, you can discuss your approach with the interviewer and clarify any questions you might have regarding the dataset and how it is formed. Communicating such intricacies with the interviewer demonstrates your ability to solve the problem in a well-rounded manner.

As for this problem, here is the approach we will take.

Step 1: Merge the two region columns into one.

The first obstacle in solving this problem is the region conundrum. The dataset has provided us with two columns. The wine and its price apply to both the regions mentioned in the dataset. So, it only makes sense to merge the two columns into one. Here, we will make use of UNION ALL to merge the two subsets of data with respect to the regions.

Step 2: Use CTEs or Common Table Expressions for each subquery.

Although using CTEs might look like a lot of extra lines of code, it makes the code cascade seamlessly and is easy on the eyes. Any other developer or even you would find going back to such well-organized code easy to understand and maintain. So, we will section the subqueries into CTEs as we work on the resultant table of the latest CTE.

Step 3: Rank the most expensive and the cheapest wine.

Now, we get to the nitty-gritty of the problem. Ranking. Our objective is to find the most expensive and cheapest wine for each region. So, we will use two markers or columns named ‘expensive_rank’ and ‘cheap_rank’ to rank the wine appropriately for each region from the newly merged region column.

RANK() is a windows function that we will perform over the region column using the OVER(PARTITION BY) clause. Moreover, we will perform an order on the price column, either ascending or descending, in the same OVER(PARTITION BY) clause in order to rank the wine.

Step 4: Filter out the NULL values.

If you recall seeing NULL values in the dataset preview, this is where they will appear. Since we are ranking the regions based on price in both ascending and descending orders, rows with NULL values will inevitably occupy the resultant dataset. So, the obvious next step is to exclude the rows with NULL values.

Step 5: Use CASE statements to select the 1st rank in each of the ranking columns.

Once the NULL values are removed, we have a viable dataset to work with. Now, we can use the rank columns and pick the most expensive and cheapest wine variety for each region using CASE statements.

Step 6: Select the required output columns using the Max function.

Finally, we can select the required columns, as demanded by the question, with the help of the MAX() function. We would see two rows for each region from the latest resultant table since that was a CASE statement, not an aggregation function.

So, now it is time to use an aggregate function like MAX() or MIN() so that we can group them in the end to display just one row per region. In this case, we will use MAX() function.

3. Coding the Solution

Now, let’s get our hands dirty!

Step 1: Merge the two region columns into one.

We will select region_1, variety, and the price columns in one subset and region_2 and the same two columns in another subset and perform UNION ALL. This way, we will merge the two region columns into one, along with the duplicates.

SELECT region_1 AS region,
       variety,
       price
FROM winemag_p1
UNION ALL
SELECT region_2 AS region,
       variety,
       price
FROM winemag_p1

Let us run the query and check the output.

All required columns and the first 5 rows of the solution are shown

regionvarietyprice
VirginiaMerlot32
Prosecco di ValdobbiadeneProsecco19
Paso RoblesCabernet Sauvignon45
WashingtonCabernet Sauvignon100
Sonoma CountySauvignon Blanc40


Our base table is ready.

Step 2: Use CTEs or Common Table Expressions for each subquery.

Let us turn this query into a CTE (Common Table Expression) such that it acts a temporary result set named ‘cte1’.

WITH cte1 AS
  (SELECT region_1 AS region,
          variety,
          price
   FROM winemag_p1
   UNION ALL SELECT region_2 AS region,
                    variety,
                    price
   FROM winemag_p1)


Step 3: Rank the most expensive and the cheapest wine.

From cte1, we will create two new columns named ‘expensive_rank’ and ‘cheap_rank’. We will use these columns to rank the wine for each region in terms of price. The column ‘expensive_rank’ will sort in descending order of price, and ‘cheap_rank’ will sort in ascending order. We are using the OVER(PARTITION BY) clause here.

WITH cte1 AS
  (SELECT region_1 AS region,
          variety,
          price
   FROM winemag_p1
   UNION ALL SELECT region_2 AS region,
                    variety,
                    price
   FROM winemag_p1)
SELECT region,
       variety,
       price,
       RANK() OVER(PARTITION BY region
                   ORDER BY price DESC) AS expensive_rank,
       RANK() OVER(PARTITION BY region
                   ORDER BY price) AS cheap_rank
FROM cte1


Run the query.

All required columns and the first 5 rows of the solution are shown

regionvarietypriceexpensive_rankcheap_rank
Alexander ValleyMerlot1931
Alexander ValleyCabernet Sauvignon3522
Alexander ValleyCabernet Sauvignon4513
Anderson ValleyPinot Noir5211
BarbarescoNebbiolo2431


We can see that the rows with NULL values for region and price have been arranged at the top of the result set. Let us remove those in the following step.

Step 4: Filter out the NULL values.

Let us filter out rows that contain NULL values in the region and price columns.

WHERE region IS NOT NULL
  AND price IS NOT NULL

Let us run the query and look at the difference in output.

All required columns and the first 5 rows of the solution are shown

regionvarietypriceexpensive_rankcheap_rank
Alexander ValleyMerlot1931
Alexander ValleyCabernet Sauvignon3522
Alexander ValleyCabernet Sauvignon4513
Anderson ValleyPinot Noir5211
BarbarescoNebbiolo2431

Here we are. Let us move on to picking the highest ranks in each of the ranking columns.

Step 5: Use CASE statements to select the 1st rank in each of the ranking columns.

Firstly, we need to make the current query into a CTE named ‘cte2’. Now, we can designate the most expensive and the cheapest wine of each region using CASE statements. And then, we will name the columns accordingly, as you can see in the query below.

WITH cte1 AS
  (SELECT region_1 AS region,
          variety,
          price
   FROM winemag_p1
   UNION ALL SELECT region_2 AS region,
                    variety,
                    price
   FROM winemag_p1),
     cte2 AS
  (SELECT region,
          variety,
          price,
          RANK() OVER(PARTITION BY region
                      ORDER BY price DESC) AS expensive_rank,
          RANK() OVER(PARTITION BY region
                      ORDER BY price) AS cheap_rank
   FROM cte1
   WHERE region IS NOT NULL
     AND price IS NOT NULL)
SELECT region,
       CASE
           WHEN expensive_rank =1 THEN variety
       END AS most_expensive,
       CASE
           WHEN cheap_rank =1 THEN variety
       END AS cheapest
FROM cte2
WHERE expensive_rank= 1
  OR cheap_rank=1


Run the query.

All required columns and the first 5 rows of the solution are shown

regionmost_expensivecheapest
Alexander ValleyMerlot
Alexander ValleyCabernet Sauvignon
Anderson ValleyPinot NoirPinot Noir
BarbarescoNebbiolo
BarbarescoNebbiolo

Now the resultant table contains two rows for each region, with each row showing the most expensive and the cheapest wine respectively. It is time to aggregate the rows to show the expected output.

Step 6: Select the required output columns using the Max function.

The final step is to use MAX() function to pick the NON-NULL value out of the two columns and assign the rank. To perform on the previous query, we will again section that into another CTE named ‘cte3’.

In the end, we will use the GROUP BY clause since we’re using an aggregate function in this query. And we will ORDER BY 1. Here is the final query.

WITH cte1 AS
  (SELECT region_1 AS region,
          variety,
          price
   FROM winemag_p1
   UNION ALL SELECT region_2 AS region,
                    variety,
                    price
   FROM winemag_p1),
     cte2 AS
  (SELECT region,
          variety,
          price,
          RANK() OVER(PARTITION BY region
                      ORDER BY price DESC) AS expensive_rank,
          RANK() OVER(PARTITION BY region
                      ORDER BY price) AS cheap_rank
   FROM cte1
   WHERE region IS NOT NULL
     AND price IS NOT NULL),
     cte3 AS
  (SELECT region,
          CASE
              WHEN expensive_rank =1 THEN variety
          END AS most_expensive,
          CASE
              WHEN cheap_rank =1 THEN variety
          END AS cheapest
   FROM cte2
   WHERE expensive_rank = 1
     OR cheap_rank=1)
SELECT region,
       MAX(most_expensive) AS most_expensive,
       MAX(cheapest) AS cheapest
FROM cte3
GROUP BY 1
ORDER BY 1


Let us run the query and see the final solution to our problem.

All required columns and the first 5 rows of the solution are shown

regionmost_expensive_varietycheapest_variety
Alexander ValleyCabernet SauvignonMerlot
Anderson ValleyPinot NoirPinot Noir
BarbarescoNebbioloNebbiolo
Brunello di MontalcinoSangioveseSangiovese
CaliforniaPinot NoirSauvignon Blanc

Conclusion

And, there it is! We have solved the problem with advanced SQL concepts like CASE statements, PARTITION BY clauses, as well as RANK() and MAX() functions. As we mentioned earlier, once you lay down your approach, the rest is a walk in the park. Hope you had fun solving this advanced problem. See you here again!

SQL Interview Question on RANK function from Wine Magazine


Become a data expert. Subscribe to our newsletter.