The Ultimate Guide to Sorting in SQL
This ultimate guide is an overview of SQL ranking functions, their syntax, and real-life interview examples sourced from StrataScratch interview problems.
Data Scientists use SQL to query and manipulate datasets, and sorting results according to business needs are ubiquitous in real-world applications. In addition to real-world applications, SQL sorting problems frequently arise in interviews as well, providing an incentive for data scientists both budding and experienced to master them.
For example, in a business scenario data scientists may be tasked with scripting a query for the twenty customers with the largest decrease in year-over-year order volume. After identifying the firms that are less active in the current year, business leaders can develop appropriate strategies for each and improve business performance.
In data analytics or data science interviews, you may be asked to sort the most busy day for web traffic in descending order. Familiarity with SQL sorting principles will help you to write efficient and accurate queries.
In this article, we take a look at SQL sorting problems using real-world interview examples from StrataScratch’s proprietary database. We will review best practices to approach and the skills required to answer SQL sorting problems correctly.
SQL Sorting Questions
Sorting problems in SQL often require data scientists to wrangle the relevant data and sort for the top or bottom N results. While parameters and business cases may vary, the syntax of sorting queries is fairly consistent and through practice you can improve your performance on these important and common problems. Sorting interview problems often require data science candidates to understand how to identify the appropriate data and sort it according to an attribute.
In this article, we will cover how to sort pre-process data and sort the data to accurately come to a result. More specifically, we help you to develop the technical framework necessary to understand grouping observations, sorting in a specific order by a column, adding criteria to exclude irrelevant rows, and how to determine the number of rows to include in the final result. We will accomplish this by reviewing three example interview problems. These range in difficulty from relatively simple problems to more complex, but the underlying principles are the same.
- First, view your table and ensure that you understand your dataset. Look at the column names and a few values in each column if possible. If there is only one table, this will be the table referenced by the FROM command. If there are more than one, you may be required to join these tables.
- Next, identify the attributes that your question concerns. For example, if your question is referencing ‘company’ or ‘seller’, these are likely to be included in your SELECT function.
- For our third step, determine if there are any additional criteria such as ‘include results which happened in Asia’ or ‘in the past 6 months’. If so, these are addressed using WHERE and reduce the size of our data by excluding data which doesn’t meet our criteria.
- Next, identify if there are categories mentioned to group together. GROUP BY converts values to a summary of records which meet that criteria. For example, you may be asked to rank orders by fiscal month. In that case, you’d want to GROUP BY on the fiscal month attribute.
- Lastly, we specify the attribute to sort our results on. This can be performed in an ascending or descending manner, meaning from least to greatest or greatest to least. If we’re asked to find the top 5 baseball teams that hit the most home runs in 2021’s MLB season, we’re interested in ‘home runs’ and would sort descending (greatest to least) with a limit of 5. The limit is accomplished using LIMIT, and following with numerical value.
While other commands are found in sorting problems, an understanding of SELECT, FROM, WHERE, ORDER BY, GROUP BY, and LIMIT will position you well to address these common questions. Let’s begin.
Example 1 - Forbes Interview Question
In a recent Forbes interview, a candidate was asked to find the most profitable financial company, and to return both its name and continent.
Link to the Forbes interview question on most profitable company.
A possible solution is to approach this question as below - using SELECT, FROM, WHERE, ORDER BY, and LIMIT.
SELECT company, continent FROM forbes_global_2010_2014 WHERE sector = 'Financials' ORDER BY profits DESC LIMIT 1;
First, we use the SELECT command in a fairly straightforward way, as the request only requires the company’s name and its associated continent. While more complicated examples may require you to complete aggregate functions or to define variables early in a query, this one only needs the name of the appropriate fields to return.
Be sure to not miss points here. Double-check the question and confirm the relevant attributes. For phone interviews, you can follow-up with the interviewer to ensure that you understand the attributes of interest. In a work or real-world situation, be sure to ask questions to understand the problem prior to running your query. For firm’s that use cloud services, wasteful queries may result in higher service costs.
Next, we use the FROM statement in a simple manner to specify the table that we should use to run our query from. While in more complicated queries you may be required to perform JOINs, here we only need to specify the desired table. Check out our post on SQL JOIN interview questions that covers the most common JOIN interview questions.
Even in the event of a more complicated query, the general approach remains the same. FROM directs SQL on the table to perform the desired query.
For the next step, let’s remember that we were asked to include only financial firms in the result. Using WHERE, we can accomplish this easily by setting a criteria to include attributes where the ‘sector’ column’s value is equal to ‘Financials’. By selecting only the rows with only financial firms, we remove irrelevant rows from our result. It’s also worth noting that you can exclude criteria explicitly by writing the same query but instead including an exclamation point prior to the equal sign (!=). This simply means ‘is not equal to’.
ORDER BY is of particular importance to the successful completion of our query. The mechanical aspect of ‘sorting’ is accomplished with ORDER BY, as it delivers the instructions on which column to ‘sort’ the results on. For this problem, we’re most interested in the profit column.
ORDER BY is a flexible function that can be performed in either ‘ascending’ or ‘descending’ manner. Ascending order begins with the lowest value in a series, and counts upward towards greater values. By default, SQL sorts in ascending order. Descending order begins with the greatest value in a column, and sorts the series from ‘greatest-to-least’.
Lastly, LIMIT sets a maximum number of results to include. If we ran the above query without the LIMIT command, our result would include every row in the table instead of only the most profitable company. By setting a limit of one, our query returns only the first result. In the event that you are asked to include more results, adjust this number accordingly.
Now that we’ve gone through the entire query, let’s do a quick review of the logic of the problem.
From the ‘forbes_global_2010_2014’ table, we select the company name and related continent values. We filter out all instances in which the sector is financial, and then sort the remaining results by profit value beginning with the greatest number. We then set a limit to include only the top result. This process adequately answers our initial question and is easily read by any other data scientist or analyst which follows you.
Example 2 - Airbnb Interview Question
Next, we have an Airbnb question where the candidate is asked to find the top ten hotels with the most positive reviews in the summer of 2017, but including only results between June and August.
Link to the Airbnb question on top ten hotels.
Then, we are expected to output the hotel name along with the corresponding number of positive reviews. These results are to be sorted based on the number of positive reviews in descending order.
A community submitted solution took the following approach. Do you notice how most of the functions were also used in our prior example? Let’s take a look through this solution to get an idea of how to solve slightly more complex SQL sorting problems.
select hotel_name,count(positive_review) as n_positive_reviews from hotel_reviews where review_date BETWEEN '6/1/17' and '8/31/17' and positive_review != 'No Positive' group by hotel_name order by n_positive_reviews desc limit 10;
SELECT is again used to select relevant fields from our table, but in this example we also need to calculate the number of positive reviews. In order to count the number of positive reviews, we use the aggregate function ‘count’, and define an alias using AS. We call our alias ‘n_positive_reviews. Our alias is a temporarily stored variable that we define manually, and allows for calculated values to have identifiable names.
FROM again identifies the appropriate table which contains our data. Similarly to our last problem, all data is stored in a single table.
The WHERE function is used to define our parameters for both the date range and to only include positive results. This is accomplished in tandem with the BETWEEN function and defining the dates. Note that the date format in the query matches the format in the table. By specifying the dates between the first of June and the last day of August, this ensures that our result table only includes the relevant dates - the months of June, July, and August. Additionally, AND is used to reference the ‘positive_review’ column, and to exclude observations in which the value is ‘No Positive’. The ‘!=’ in the query ultimately means ‘to exclude results where the value is equal to the following ___’. More concisely, it is used to mean ‘is not equal to’, and originates in mathematics.
The GROUP BY statement here instructs the query to aggregate results based on the value in the ‘hotel_name’ column. Remember - our problem asked us to find the top ten hotels with the most positive reviews between June and August. The name of the hotels are stored in ‘hotel_name’, so make sure to include this in your result.
Next, we sort the results with the ORDER BY function. By referencing the variable that was initiated earlier (n_positive_reviews), the query will sort based on the count of positive reviews using this column. Performing the sort based on a descending order ensures that our greatest values are at the top of the table.
Finally, we LIMIT our result to include ten results, and due to our prior ORDER BY statement, our top ten greatest values.
Example 3 - Yelp Interview Question
In our final example, we are tasked with finding the 5 states with the most five star businesses.
Link to the Yelp question on five star businesses.
A possible approach is to view the problem as follows, using SELECT, FROM, the aggregate function COUNT, GROUP BY, ORDER BY, and LIMIT to address the request.
select state, five_stars from (select state, count(stars) as five_stars from yelp_business where stars=5 group by state) a order by five_stars desc limit 5;
First, we use SELECT to include the two variables that are of interest - state, and a variable that we will define later, ‘five_stars’. As the question specifies the importance of the ‘stars’ column, our variable ‘five_stars’ references ‘stars’ in its calculation.
Next, FROM is a bit more active here than in other examples. Rather than simply specifying the name of the table that we intend to reference, FROM is used to accomplish a more sophisticated task. In order to better understand what is happening, let’s read from the end of line 2 backward. First, we explicitly define the table that we are referencing - ‘yelp_business’. This is practically identical to our other practice problems. Next, we define our variable called ‘five_stars’ from our SELECT function. ‘Five_stars’ is the alias used for a table which contains ‘state’ and performs COUNT ‘stars’. Next, the code instructs to only include rows WHERE the ‘stars’ column is equal to 5, and groups on state using GROUP BY.
Let’s take a moment and see what we’ve done so far. We’ve queried the yelp_business table on its own ‘state’ column and an additional column that we created through an alias (five_stars). The variable ‘five_stars’ contains ‘state’ and pairs it with an additional column which counts the number of instances where the ‘stars’ column is equal to 5. Our initial request was to find the five states with the most five star restaurants, and by counting five star restaurants by state, we’re well on our way to completing the problem.
Now that we have a result that counts the number of five star restaurants, all that’s left is to sort the results greatest-to-least and to include the top 5.
Next, we use ORDER BY to prepare to sort by recalling our ‘five_stars’ table, and indicate to sort descendingly. Remember, this means to return the greatest value at the top of the table.
Finally, we set our LIMIT function to include only the top 5 results in accordance with our problem’s parameters.
Also, check out our ultimate guide on SQL Window Functions to have an overview of the types of SQL window functions, their syntax and real-life examples of how to use them.
Mastering sorting in SQL is useful for both amateur and advanced programmers regardless of industry. Data scientists are likely to come across many use cases of sorting, and perfecting your skills in the technique improves your ability to make an impact with your team. The general principles of sorting in SQL are consistent, and efficiency is developed through repetition. StrataScratch offers thousands of practice problems to prepare you for your next interview or to improve your data skills for your job in data science.