An In-Depth Guide to SQL SELECT DISTINCT Statement
Categories
SQL SELECT DISTINCT is a powerful statement for removing duplicates from your data. Learn how to use it in different scenarios and what its alternatives are.
Being unique is what you should be, they tell you all the time. You shouldn’t be someone’s duplicate! If only your data had listened to this advice and sorted itself out!
But, no, you need to go through every database row and decide if the row is duplicate or not. Or, at least, that’s a story you tell your boss to paint your job as being of colossal difficulty, possibly also to cause some pity and a pay raise.
As if you don’t use SELECT DISTINCT to remove duplicates in a matter of seconds. You don’t? You really go through each and every row to find duplicates. Don’t tell me; you export data to Excel and then use conditional formatting to highlight duplicates.No, no, there’s an easier way. Let me show it to you so you can spend more time doing something useful at your work: solving sudoku, drinking coffee, or watching a video of a girl burping in a car on repeat.
What is a SELECT DISTINCT Statement in SQL?
The SELECT DISTINCT statement is used to retrieve unique records from the database. It filters out duplicate values in the result set, ensuring that every row returned is different in some specified aspect. This functionality is vital in scenarios where only distinct entries are required from a large set of data.
When working with data, you’ll often have to make sure that your report or analysis includes only unique values. This is where SELECT DISTINCT can help you.
What is the DISTINCT Clause, and How Does It Modify the Behavior of the SELECT Statement
The DISTINCT clause’s primary function is to ensure that the results returned by a query are unique (or distinct) by removing duplicate rows from the result set. It acts as a filter, screening out repeated occurrences of the same data.
Under normal circumstances, you know that the SELECT statement will retrieve all rows from the specified columns in a table. But when used with DISTINCT, it will remove duplicates and return only unique rows of the column (when one column is selected) or a combination of columns (when multiple columns are selected).
The database executes the SELECT DISTINCT statement exactly in this order:
- SELECT is executed to return all the requested columns
- DISTINCT is executed to return duplicate rows from the output.
Due to that, the SELECT DISTINCT statement requires additional processing, which can be resource-intensive on large datasets or complex queries. Therefore, it’s essential to use DISTINCT judiciously, especially in environments where performance and response time are critical.
Syntax of SQL SELECT DISTINCT
The basic syntax of the SQL SELECT DISTINCT statement is very simple. You write a regular SELECT statement, only with DISTINCT after the keyword SELECT, like this. Or before the columns, depending on how you look at it.
SELECT DISTINCT column_1
FROM table;
Practical Applications of SELECT DISTINCT
Oh, boy, you’ll be using SQL SELECT DISTINCT a lot once you realize how useful it is in scenarios that revolve around data uniqueness.
Use Cases Where SELECT DISTINCT Is Essential
Here are examples and scenarios where data uniqueness is important, and SELECT DISTINCT comes in handy. These are all situations that are very common in the everyday work of most data professionals.
1. Data Reporting and Analysis
Ensuring data accuracy in your reporting and analysis is crucial, so you’ll find SELECT DISTINCT useful when:
- Generating Unique Lists: For creating reports that require a list of unique items, such as a list of all the unique products sold in a store.
- Summarizing Data: Such as finding the total number of unique customers who made purchases within a specific period.
2. Data Cleaning and Preprocessing
Data often contains duplicates, which can skew analysis results. SELECT DISTINCT helps you with:
- Removing Duplicates: By using SELECT DISTINCT before analysis, you can identify and remove duplicate entries from the dataset.
- Consistency Checks: Used for consistency checks in data, it ensures that entries like IDs or codes are unique as required.
3. Database Design and Restructuring
When designing and restructuring databases, you can use SELECT DISTINCT in:
- Identify Unique Values: This is required for creating a primary key or when wanting to understand the relationship between different tables.
- Normalization Process: In the database normalization process, SELECT DISTINCT can help identify redundant data, which can then be moved to separate tables to reduce duplication.
4. Identifying Unique Combinations
When used with multiple columns, SELECT DISTINCT can help you find unique combinations of data across multiple columns—for instance, identifying unique combinations of country and city names in a customer database.
5. Data Integrity Checks
Data integrity is an essential concept for anyone working with data. SELECT DISTINCT can be used when doing data integrity checks by:
- Checking Redundancies: Regular checks for redundancies or anomalies in the data, especially in large databases where manual checking is impractical.
- Ensuring Uniqueness Constraints: Verifying that data supposed to be unique, such as email addresses or usernames, remains unique.
6. Business Intelligence and Decision Making
For strategic business decisions, SQL SELECT DISTINCT can be used in the following ways:
- Market Analysis: Understanding the diversity of a product range, customer base, or vendor list by identifying unique entities.
- Customer Segmentation: Segmenting customers based on unique characteristics which requires identifying those unique characteristics first.
7. Performance Optimization
When wanting to optimize data performance, SELECT DISTINCT is helpful as it can reduce data load. It reduces the amount of data processed in subsequent operations, leading to more efficient query performance in some cases.
8. Unique Record Extraction for Joins
When joining two, three or more tables in SQL, you could easily end up with duplicate values when you don’t want them. With SELECT DISTINCT, you can prepare data for joins. This ensures that the data being joined does not have duplicates, which could lead to incorrect join results.
9. Generating Distinct Counts
When doing statistical analysis, you can use SELECT DISTINCT in conjunction with COUNT to find the count of unique entries in a column.
Example From Real-World Databases
Now, let’s see how SQL SELECT DISTINCT is used with real-world data and in actual code. It's probably the thing you’re most interested in, as you want to see how to ditch Excel for something much more attractive.
Here’s an interview question by Spotify. The scenario it poses is the one many data professionals can relate to, such as data engineers, data scientists, BI analysts, data analysts, and ML engineers.
Find all the songs that were top-ranked (at first position) at least once in the past 20 years
Link to the question: https://platform.stratascratch.com/coding/10283-find-the-top-ranked-songs-for-the-past-30-years
The question asks you to find all the songs that were top-ranked at least once in the past 20 years. The question is phrased this way because you need to work with Spotify data. But the task is applicable to many other similar situations, like finding a top-ranked product, salesman, customer (by number of orders or their value), the most expensive product, etc.
The question gives you one table to work with, namely billboard_top_100_year_end.
year | year_rank | group_name | artist | song_name | id |
---|---|---|---|---|---|
1956 | 1 | Elvis Presley | Elvis Presley | Heartbreak Hotel | 1 |
1956 | 2 | Elvis Presley | Elvis Presley | Don't Be Cruel | 2 |
1956 | 3 | Nelson Riddle | Nelson Riddle | Lisbon Antigua | 3 |
1956 | 4 | Platters | Platters | My Prayer | 4 |
1956 | 5 | Gogi Grant | Gogi Grant | The Wayward Wind | 5 |
Why do you think we need a SELECT DISTINCT to solve this problem?
First of all, it’s the logic of the problem. It requires you to show songs ranked first at least once in the last 20 years. Why would anyone show one song every time it ranks first?
Knowing how fast songs go in and out of fashion in pop music, you could argue that it’s highly unlikely that one song was top-ranked in more than one year. You’re probably right about that. But, here enters the second reason why you should use DISTINCT: knowing your dataset and its logic.
Here’s a snapshot of the dataset.
year | year_rank | group_name | artist | song_name | id |
---|---|---|---|---|---|
2012 | 1 | Gotye feat. Kimbra | Gotye | Somebody That I Used To Know | 6203 |
2012 | 1 | Gotye feat. Kimbra | Kimbra | Somebody That I Used To Know | 6204 |
As you can see, the song ‘Somebody That I Used To Know’ is top-ranked but appears twice. Why? Because it’s a feature song, it appears once under artist Gotye and the second time under artist Kimbra. As we want to output only the song name, this and all other songs like this would appear at least two times. You see now why it’s important to know your dataset and then use DISTINCT to avoid skewing your little report.
OK, let’s see the code.
The syntax is the same as explained earlier: write DISTINCT after SELECT and then the column you want to select. In this example, it’s the column song_name.
We also need to apply certain criteria in WHERE to filter data. The first condition is that the ranking equals one. Also, only songs that were top-ranked in the last 20 years are accepted. So, I use the DATE_PART() function to get the year part of the current date and then subtract the column year from it. That way, you get the number of years between the current year and the year of the song’s top rank. This result must be less than or equal to 20.
SELECT DISTINCT song_name
FROM billboard_top_100_year_end
WHERE year_rank = 1 AND
DATE_PART('year', CURRENT_DATE) - year <= 20;
The output is a list of songs that satisfy the requirements.
song_name |
---|
Bad Day |
Boom Boom Pow |
Irreplaceable |
Low |
Rolling In The Deep |
Now, the above example is an elementary application of DISTINCT. Of course, it’s not the only one, as DISTINCT can also be used with multiple columns as well as with other SQL clauses and functions.
SQL SELECT DISTINCT With Multiple Columns
When using SELECT DISTINCT with multiple columns, you write DISTINCT only once and then list all the columns you want.
Be careful when doing this! The output is not unique values for each column separately. The output is all unique combinations of all the columns you listed.
Also, you can’t somehow exclude one of the listed columns from the DISTINCT clause; they are all always included in the distinct combination.
I’ll solve this question by the City of San Francisco to show you how this works.
Find the business names that scored less than 50 in inspections. Output the result along with the corresponding inspection date and the score.
Link to the question: https://platform.stratascratch.com/coding/9720-find-the-business-names-that-have-inspection-scores-of-less-than-50
You need to find the names of the businesses with a score lower than 50, along with the corresponding inspection date and score.
Here’s the overview of the table sf_restaurant_health_violations we need to work with.
business_id | business_name | business_address | business_city | business_state | business_postal_code | business_latitude | business_longitude | business_location | business_phone_number | inspection_id | inspection_date | inspection_score | inspection_type | violation_id | violation_description | risk_category |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5800 | John Chin Elementary School | 350 Broadway St | San Francisco | CA | 94133 | 37.798 | -122.403 | {'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 5800_20171017 | 2017-10-17 | 98 | Routine - Unscheduled | 5800_20171017_103149 | Wiping cloths not clean or properly stored or inadequate sanitizer | Low Risk | |
64236 | Sutter Pub and Restaurant | 700 Sutter St | San Francisco | CA | 94102 | 37.789 | -122.412 | {'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 64236_20170725 | 2017-07-25 | 88 | Routine - Unscheduled | 64236_20170725_103133 | Foods not protected from contamination | Moderate Risk | |
1991 | SRI THAI CUISINE | 4621 LINCOLN Way | San Francisco | CA | 94122 | 37.764 | -122.508 | {'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 1991_20171129 | 2017-11-29 | 86 | Routine - Unscheduled | 1991_20171129_103139 | Improper food storage | Low Risk | |
3816 | Washington Bakery & Restaurant | 733 Washington St | San Francisco | CA | 94108 | 37.795 | -122.406 | {'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 3816_20160728 | 2016-07-28 | 67 | Routine - Unscheduled | 3816_20160728_103108 | Contaminated or adulterated food | High Risk | |
39119 | Brothers Restaurant | 4128 GEARY Blvd | San Francisco | CA | 94118 | 37.781 | -122.464 | {'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'} | 39119_20160718 | 2016-07-18 | 79 | Routine - Unscheduled | 39119_20160718_103133 | Foods not protected from contamination | Moderate Risk |
Solving this question requires using SELECT DISTINCT on multiple columns.
The syntax is straightforward: write SELECT DISTINCT and then list all the columns the question asks you to, namely business_name, inspection_date, and inspection_score.
After referencing the dataset, use WHERE to apply a filtering condition. It will output only data where the inspection score is lower than 50.
SELECT DISTINCT business_name,
inspection_date,
inspection_score
FROM sf_restaurant_health_violations
WHERE inspection_score < 50;
The code returns two businesses with the relevant data.
business_name | inspection_date | inspection_score |
---|---|---|
Da Cafe | 2016-09-07 | 48 |
Lollipot | 2018-05-22 | 45 |
Now, the question is, could we have solved this question without DISTINCT? In terms of returning the names of the same two restaurants, the answer is yes. Simply remove the DISTINCT keyword, and you get this code.
SELECT business_name,
inspection_date,
inspection_score
FROM sf_restaurant_health_violations
WHERE inspection_score < 50;
Yes, this solution is also accepted; run it to see. However, you see what your output looks like, right?
business_name | inspection_date | inspection_score |
---|---|---|
Lollipot | 2018-05-22 | 45 |
Lollipot | 2018-05-22 | 45 |
Lollipot | 2018-05-22 | 45 |
Da Cafe | 2016-09-07 | 48 |
Lollipot | 2018-05-22 | 45 |
Lollipot | 2018-05-22 | 45 |
Lollipot | 2018-05-22 | 45 |
Da Cafe | 2016-09-07 | 48 |
Da Cafe | 2016-09-07 | 48 |
Da Cafe | 2016-09-07 | 48 |
Mhm, it returns Lollipot and De Cafe – the same as the earlier solution with DISTINCT – but they are both repeated several times. Why does the query do that despite the same business name, inspection date, and inspection score?
Well, again, knowing your data comes in handy. There are columns violation_id and violation_description in the dataset. There can be (and there are, obviously!) multiple violation IDs – hence, numerous violation descriptions – by each inspection. So, the values we need – business name, inspection date, and score – will be repeated whenever there’s more than one violation per inspection.
That is why using SELECT DISTINCT on multiple columns is a better option here. If nothing, the output is nicer without the redundant data.
SQL DISTINCT With Aggregate Functions
The SQL distinct can also be used with the SQL aggregate functions. The purpose is to include only the unique values in the calculation. For example, the most common use of DISTINCT is with COUNT(). That combination returns the number of unique values, i.e., counting the same data value only once.
Let’s see how this works by solving the medium-difficulty interview question and using DISTINCT with COUNT().
Interview Question Date: December 2020
You are given a table of product launches by company by year. Write a query to count the net difference between the number of products companies launched in 2020 with the number of products companies launched in the previous year. Output the name of the companies and a net difference of net products released for 2020 compared to the previous year.
Link to the question: https://platform.stratascratch.com/coding/10318-new-products
You need to count the net difference between the number of products companies launched in 2020, and the number of products the companies launched in the previous year.
The question gives you the table car_launches, where you can find the car make, model, and launch year.
year | company_name | product_name |
---|---|---|
2019 | Toyota | Avalon |
2019 | Toyota | Camry |
2020 | Toyota | Corolla |
2019 | Honda | Accord |
2019 | Honda | Passport |
The solution query is a bit more complicated, so I’ll start writing it from the subqueries.
The first subquery shows companies and car models that were launched in 2020. The filter is applied in WHERE on the column year.
SELECT company_name,
product_name AS brand_2020
FROM car_launches
WHERE year = 2020;
The second query is exactly the same, except that it shows models launched in 2019.
SELECT company_name,
product_name AS brand_2019
FROM car_launches
WHERE year = 2019;
I need to join these two subqueries in the FROM clause. I use FULL JOIN because I want all data from both subqueries. The subqueries are joined on the company name.
SELECT *
FROM
(SELECT company_name,
product_name AS product_2020
FROM car_launches
WHERE year = 2020) a
FULL OUTER JOIN
(SELECT company_name,
product_name AS product_2019
FROM car_launches
WHERE year = 2019) b ON a.company_name = b.company_name;
Here’s the sample of the above query’s output.
company_name | product_2020 | company_name | product_2019 |
---|---|---|---|
Toyota | Corolla | Toyota | Camry |
Toyota | Corolla | Toyota | Avalon |
Honda | Pilot | Honda | Civic |
Honda | Pilot | Honda | CR-V |
Honda | Pilot | Honda | Passport |
As you can see, the models launched in 2020 can appear more than once. That’s because they are, when FULL JOINed, paired with the belonging company’s every model launched in 2019. So, if there are several models launched in 2019, the models launched in 2020 will appear multiple times.
The same is true in the opposite case.
So, we need to use DISTINCT somehow and exclude this duplicate data. First, let’s replace the asterisk (*) in SELECT with the actual columns we need. First, we select the company name.
Then, we need to calculate the difference between the number of products launched in 2020 and the year before. So, we first count the number of products launched in 2020. We do that by using COUNT() and then DISTINCT with the column brand_2020. This will count each product only once.
Remember: DISTINCT goes inside COUNT(), not outside!
Then, we do the same for the products launched in 2019. After that, we subtract these two values and name the column net_products.
To get these values by company, we must group them by the company name. As a final touch, let’s sort the data alphabetically by company name using ORDER BY.
SELECT a.company_name,
(COUNT(DISTINCT a.product_2020)-COUNT(DISTINCT b.product_2019)) AS net_products
FROM
(SELECT company_name,
product_name AS product_2020
FROM car_launches
WHERE year = 2020) a
FULL OUTER JOIN
(SELECT company_name,
product_name AS product_2019
FROM car_launches
WHERE year = 2019) b ON a.company_name = b.company_name
GROUP BY a.company_name
ORDER BY company_name;
The output shows five companies and the net difference between the number of products in two consecutive years.
company_name | net_products |
---|---|
Chevrolet | 2 |
Ford | -1 |
Honda | -3 |
Jeep | 1 |
Toyota | -1 |
SQL DISTINCT With HAVING
You can also use DISTINCT in the HAVING clause. Since HAVING is used for filtering aggregated data, you can use aggregate functions in it. And this leads to the conclusion that you can also use both aggregate functions and DISTINCT in HAVING.
Let’s see how to do that! Here’s a question from the Burtch Works, Glassdoor, and Salesforce interviews.
Interview Question Date: May 2019
Find departments with at more than or equal 5 employees.
Link to the question: https://platform.stratascratch.com/coding/9911-departments-with-5-employees
The question wants you to find departments with five or more employees.
It gives you the table employee to work with. It’s a list of the company’s employees and their details.
id | first_name | last_name | age | sex | employee_title | department | salary | target | bonus | city | address | manager_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | Max | George | 26 | M | Sales | Sales | 1300 | 200 | 150 | Max@company.com | California | 2638 Richards Avenue | 1 |
13 | Katty | Bond | 56 | F | Manager | Management | 150000 | 0 | 300 | Katty@company.com | Arizona | 1 | |
11 | Richerd | Gear | 57 | M | Manager | Management | 250000 | 0 | 300 | Richerd@company.com | Alabama | 1 | |
10 | Jennifer | Dion | 34 | F | Sales | Sales | 1000 | 200 | 150 | Jennifer@company.com | Alabama | 13 | |
19 | George | Joe | 50 | M | Manager | Management | 100000 | 0 | 300 | George@company.com | Florida | 1003 Wyatt Street | 1 |
To solve this problem, let’s start by selecting the departments from the table. Then, we can group by department to show each department only once.
Now that we have the list of departments, we can find those that have five or more employees. We use HAVING to filter data. The filtering condition uses COUNT() with DISTINCT to find the count of each employee only once and returns only those rows where the result is greater than or equal to five.
SELECT department
FROM employee
GROUP BY department
HAVING COUNT(DISTINCT id) >= 5;
The output shows that the sales department is the only one with five or more employees.
department |
---|
Sales |
SQL DISTINCT With CASE WHEN
Another possible use of the DISTINCT clause is in the CASE WHEN expression. One good example of this is the interview question by Apple and Google.
Find the number of Apple product users and the number of total users with a device and group the counts by language. Assume Apple products are only MacBook-Pro, iPhone 5s, and iPad-air. Output the language along with the total number of Apple users and users with any device. Order your results based on the number of total users in descending order.
Link to the question: https://platform.stratascratch.com/coding/10141-apple-product-counts
You’re required to find the number of Apple product users and the number of total users with a device and group the counts by language.
The question gives you two tables. The first is playbook_events, with the preview shown below.
user_id | occurred_at | event_type | event_name | location | device |
---|---|---|---|---|---|
6991 | 2014-06-09 18:26:54 | engagement | home_page | United States | iphone 5 |
18851 | 2014-08-29 13:18:38 | signup_flow | enter_info | Russia | asus chromebook |
14998 | 2014-07-01 12:47:56 | engagement | login | France | hp pavilion desktop |
8186 | 2014-05-23 10:44:16 | engagement | home_page | Italy | macbook pro |
9626 | 2014-07-31 17:15:14 | engagement | login | Russia | nexus 7 |
The second table is playbook_users.
user_id | created_at | company_id | language | activated_at | state |
---|---|---|---|---|---|
11 | 2013-01-01 04:41:13 | 1 | german | 2013-01-01 | active |
52 | 2013-01-05 15:30:45 | 2866 | spanish | 2013-01-05 | active |
108 | 2013-01-10 11:04:58 | 1848 | spanish | 2013-01-10 | active |
167 | 2013-01-16 20:40:24 | 6709 | arabic | 2013-01-16 | active |
175 | 2013-01-16 11:22:22 | 4797 | russian | 2013-01-16 | active |
Let’s start solving the question by first joining the tables. The tables are joined using INNER JOIN on the column user_id. I also gave aliases to both tables, so I don ’t need to write their full names in other parts of the query.
SELECT *
FROM playbook_users u
INNER JOIN playbook_events e
ON u.user_id = e.user_id;
Now, let’s add the columns to SELECT. The first one is language, as showing data by language is the requirement.
Next, I want to calculate the number of Apple users. I’ll do that by using COUNT() and the CASE WHEN expression in it. CASE WHEN shows the IDs of the users who used MacBook-Pro, iPhone 5s, or iPad-air; the question tells us we can assume these are the only Apple products. The COUNT() function will then count these IDs, and this will represent the number of Apple users.
However, such an approach means the users will be counted every time they appear in the table. This would skew the total number of users, so we need to count each user only once. We can achieve this by adding DISTINCT before the CASE WHEN expression.
The third column counts the total number of users with DISTINCT in COUNT().
Finally, we can group the output to show all these calculations by each language separately. Also, it’s sorted by the number of total users descendingly, as required by the question.
SELECT u.language,
COUNT (DISTINCT CASE
WHEN device IN ('macbook pro',
'iphone 5s',
'ipad air') THEN u.user_id
ELSE NULL
END) AS n_apple_users,
COUNT(DISTINCT u.user_id) AS n_total_users
FROM playbook_users u
INNER JOIN playbook_events e ON u.user_id = e.user_id
GROUP BY u.language
ORDER BY n_total_users DESC;
The output shows the number of Apple and total users by each language.
language | n_apple_users | n_total_users |
---|---|---|
english | 11 | 45 |
spanish | 3 | 9 |
japanese | 2 | 6 |
french | 0 | 5 |
russian | 0 | 5 |
SQL DISTINCT With JOIN
DISTINCT can also be used with JOIN to limit the data you get from one or both tables.
I’ll show you how to do this by solving this AirBnb interview question.
Interview Question Date: May 2020
Display the average number of times a user performed a search which led to a successful booking and the average number of times a user performed a search but did not lead to a booking. The output should have a column named action with values 'does not book' and 'books' as well as a 2nd column named average_searches with the average number of searches per action. Consider that the booking did not happen if the booking date is null. Be aware that search is connected to the booking only if their check-in dates match.
Link to the question: https://platform.stratascratch.com/coding/10124-bookings-vs-non-bookings
The question asks you to show the average number of times users performed a search that led to a successful booking and the average number of times users performed a search but did not lead to a booking.
You’ll work with two tables. One is airbnb_contacts, with a data snapshot shown below.
id_guest | id_host | id_listing | ts_contact_at | ts_reply_at | ts_accepted_at | ts_booking_at | ds_checkin | ds_checkout | n_guests | n_messages |
---|---|---|---|---|---|---|---|---|---|---|
86b39b70-965b-479d-a0b0-719b195acea2 | 1dfb22ec-c20e-4bf9-b161-1607afa25c5a | d668de42-122a-45cd-b91f-91a70895f902 | 2014-04-18 09:32:23 | 2014-04-18 09:39:06 | 2014-12-31 | 2015-01-02 | 7 | 5 | ||
14f943bb-74e9-458b-be55-203dc7220688 | 3347390d-8670-4870-9dab-da30f3700141 | 14c47fb8-e831-4044-9674-9b3fd0499193 | 2014-10-06 06:55:45 | 2014-10-06 10:06:38 | 2014-10-06 10:06:38 | 2014-10-06 10:06:38 | 2014-11-03 | 2014-11-07 | 2 | 8 |
425aa1ed-82ab-4ecf-b62f-d61e1848706d | 02cafb86-5445-45cc-80f2-405291578356 | c5a4a913-a094-4a9d-82e2-0b2d4f9d9eeb | 2014-10-04 05:02:39 | 2014-10-04 23:10:01 | 2014-11-02 | 2014-11-09 | 2 | 2 | ||
bb490ede-8a70-4d61-a2e8-625855a393e2 | f49c3095-58de-4b8d-9d5b-3bfceceb47d8 | 27f4b429-d544-464f-b4b5-3c09fd5992e7 | 2014-08-31 11:46:11 | 2014-08-31 16:48:28 | 2014-11-03 | 2014-11-07 | 2 | 5 | ||
b2fda15a-89bb-4e6e-ae81-8b21598e2482 | 71f1d49e-2ff4-4d72-b8e6-fd4c67feaa74 | 95fb78ca-8e6e-436a-9830-949d995ad14f | 2014-10-08 15:07:56 | 2014-10-08 15:32:12 | 2014-10-08 15:32:12 | 2014-10-08 22:21:41 | 2014-11-06 | 2014-11-09 | 2 | 10 |
The second table is airbnb_searches.
ds | id_user | ds_checkin | ds_checkout | n_searches | n_nights | n_guests_min | n_guests_max | origin_country | filter_price_min | filter_price_max | filter_room_types | filter_neighborhoods |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2014-10-11 | 67aece73-e112-4e9e-9e05-8a2a94b003b9 | 5 | 1 | 1 | IT | 0 | 99 | ,Private room | ||||
2014-10-01 | 6cbb33d1-6ecc-4f74-8b6a-a43d07d484b6 | 2014-10-04 | 2014-10-07 | 11 | 3 | 3 | 3 | ES | 0 | 567 | ,Entire home/apt,Entire home/apt,Private room,Entire home/apt,Private room,Shared room | |
2014-10-03 | aa9cf5bf-5667-4212-8018-1cb8beee530e | 2014-11-14 | 2014-11-16 | 17 | 2 | 2 | 2 | GB | 0 | 171 | ,Entire home/apt | |
2014-10-09 | 3e6c2466-74fe-44c0-a6f3-dda79755d30a | 2015-02-26 | 2015-03-02 | 9 | 4 | 1 | 4 | GB | 0 | 240 | ,Entire home/apt | |
2014-10-13 | a09bf912-b21d-4859-b194-8512c30695f6 | 2014-10-18 | 2014-10-22 | 7 | 4 | 1 | 2 | GB | ,Entire home/apt |
Let’s again start by first explaining the JOIN part of the code. The question logic suggests that we should take all the searches and then calculate the average of the searches that did and did not lead to booking.
So, this means using LEFT JOIN. Our left table is airbnb_searches, and our right table is actually no table at all but a subquery.
In this subquery, I’m using DISTINCT to find the unique combinations of the guest ID, check-in date, and the booking timestamp and return only those where the booking timestamp is not NULL, i.e., there was a booking. In other words, the subquery returns all the bookings.
The table and the subquery are joined on the columns id_user and id_guest. The additional joining condition is where the columns ds_checkin are equal. This is, of course, because we want to find the same users/guests. Also, the question gives the information that the search is connected to the booking only if their check-in dates match.
SELECT *
FROM airbnb_searches s
LEFT JOIN (SELECT DISTINCT id_guest,
ds_checkin,
ts_booking_at
FROM airbnb_contacts
WHERE ts_booking_at IS NOT NULL) c
ON s.id_user = c.id_guest AND s.ds_checkin = c.ds_checkin;
The output contains all the data from the table airbnb_searches and only unique combinations of columns from the subquery.
ds | id_user | ds_checkin | ds_checkout | n_searches | n_nights | n_guests_min | n_guests_max | origin_country | filter_price_min | filter_price_max | filter_room_types | filter_neighborhoods | id_guest | ds_checkin | ts_booking_at |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2014-10-11 | 67aece73-e112-4e9e-9e05-8a2a94b003b9 | 5 | 1 | 1 | IT | 0 | 99 | ,Private room | |||||||
2014-10-01 | 6cbb33d1-6ecc-4f74-8b6a-a43d07d484b6 | 2014-10-04 | 2014-10-07 | 11 | 3 | 3 | 3 | ES | 0 | 567 | ,Entire home/apt,Entire home/apt,Private room,Entire home/apt,Private room,Shared room | ||||
2014-10-03 | aa9cf5bf-5667-4212-8018-1cb8beee530e | 2014-11-14 | 2014-11-16 | 17 | 2 | 2 | 2 | GB | 0 | 171 | ,Entire home/apt | aa9cf5bf-5667-4212-8018-1cb8beee530e | 2014-11-14 | 2014-10-03 17:51:58 | |
2014-10-09 | 3e6c2466-74fe-44c0-a6f3-dda79755d30a | 2015-02-26 | 2015-03-02 | 9 | 4 | 1 | 4 | GB | 0 | 240 | ,Entire home/apt | ||||
2014-10-13 | a09bf912-b21d-4859-b194-8512c30695f6 | 2014-10-18 | 2014-10-22 | 7 | 4 | 1 | 2 | GB | ,Entire home/apt |
Now, I can start adding the columns in SELECT.
The first column will label the search as the one resulting in booking or not. This is done with the help of the CASE WHEN expression. (You’re familiar with it already!) The search will be labeled as ‘books’ if the time of the booking is not NULL (obviously, as time shows the booking occurred) and the check-in dates from both tables are equal (again, the criteria stated in the question). Otherwise, the search will be labeled as ‘does not book’.
The second column in SELECT calculates the average number of searches.
We now simply group the output by the first column to get the average number of searches for both search categories, and we have an answer to the question.
SELECT CASE
WHEN c.ts_booking_at IS NOT NULL AND c.ds_checkin = s.ds_checkin THEN 'books'
ELSE 'does not book'
END AS action,
AVG(n_searches) AS average_searches
FROM airbnb_searches s
LEFT JOIN (SELECT DISTINCT id_guest,
ds_checkin,
ts_booking_at
FROM airbnb_contacts
WHERE ts_booking_at IS NOT NULL) c
ON s.id_user = c.id_guest AND s.ds_checkin = c.ds_checkin
GROUP BY 1;
The output shows two categories and the average number of searches.
action | average_searches |
---|---|
books | 23.333 |
does not book | 22.009 |
Optimizing Queries With SELECT DISTINCT in SQL
Since the SELECT DISTINCT query has to go through all the data and identify unique values, its performance might sometimes be inefficient, especially on a large amount of data.
Of course, this shouldn’t discourage you from using it. However, sometimes, you might want to optimize your SELECT DISTINCT query using some of these strategies.
1. Indexing
- Appropriate Indexing: If the columns in SELECT DISTINCT are indexed, this might speed up identifying the unique values.
- The Same Index and Column Selection Order: Ensure the index aligns with the order of columns in SELECT DISTINCT, as that will make the database efficiently use the indexes to find unique values. This is because the index is essentially a sorted list of rows based on those columns.
2. Column Selection
You should include only the necessary columns in DISTINCT. Don’t show your generosity here, as the more columns there are in DISTINCT, the harder the database has to work to find unique combinations.
3. Filtering Data Earlier
- Use WHERE: If you use the WHERE clause before the DISTINCT clause, this can significantly reduce the number of rows that need to be processed to find unique values.
- Use Subqueries: The subqueries can also be used for filtering data and can sometimes be more efficient than DISTINCT.
4. Pay Attention to Data Types
Wherever you can choose the data types you use. And think about whether they’ll be required in DISTINCT. The reason is that smaller data types (e.g., integers) are processed faster than larger ones (e.g., strings or varchars).
5. Understanding the Data
We already saw in some practical examples how important this is in deciding whether to use DISTINCT. If a column is already unique, there’s no point in using DISTINCT.
6. Query Execution Plan Analysis
No matter what database you use, it has some kind of query execution plan analysis tool. Use it to understand how your SELECT DISTINCT query is being executed. That way, you can learn which query part can be optimized.
7. Avoid Overusing
Too much of anything isn’t good. The same goes for DISTINCT, no matter how cool this SQL clause actually is. Use it only when necessary. In some situations, duplicate data might be acceptable or can be removed more efficiently in later stages.
8. Testing and Monitoring
You should regularly check your queries’ performance, which can change based on data volume, distribution, and updates.
Common Mistakes and Misconceptions About the DISTINCT Clause
While DISTINCT isn’t that complicated, some misconceptions about it can lead to making mistakes, especially if you’re an SQL beginner.
These are some of them.
1. Misunderstanding the DISTINCT Scope
- Row-Level Uniqueness: There’s a misconception that DISTINCT applies to individual columns. While this might seem true when there’s only one column in the clause, this misconception becomes glaring if two or more columns are selected. Actually, SELECT DISTINCT considers the uniqueness of the row based on the combination of values in all selected columns.
- Example: We mentioned this combination of values when solving a question requiring multiple columns in DISTINCT. To really bring this home to you, let me say this once again: the below query will not return all the unique first names and then all the unique last names; it will return unique combinations of the first and last names. So, if two people are named Arthur Clark, it will appear only once. But if there are several other Arthurs with other last names or Clarkes not named Arthur, you will see all of them. In other words, you’ll probably see many Arthurs and Clarkes, but only one Arthur Clarke.
SELECT DISTINCT first_name,
last_name
FROM employee;
2. Overestimating Performance Impact
- Always Assuming High Cost: Yes, I know I dedicated the whole section to optimizing SELECT DISTINCT and drawing your attention to its possible inefficiency. But the key word here is possible. Put differently, SELECT DISTINCT doesn’t always lead to (significant) performance degradation. The impact mainly depends on the data size, indexing, and the particular RDBMS you’re using.
- Avoiding DISTINCT Even When It’s Necessary: Of course, there are ways to work around DISTINCT and get unique values another way. But there are also scenarios where DISTINCT is unavoidable, so don’t insist on avoiding it just for fear of performance issues. First, test and analyze the actual impact, and then decide if this impact is significant or not.
3. Using DISTINCT as a Default
Avoiding DISTINCT at all costs is not good, but using it as a default in every query is not good either. Try to find the balance. Assess whether you need DISTINCT or not. If not, don’t use it. By avoiding unnecessary use, you also avoid unnecessary processing, especially when duplicate data is minimal or acceptable.
4. Confusing DISTINCT With GROUP BY
- Misusing GROUP BY: DISTINCT is often confused with SQL GROUP BY because they both can be used to return unique values. However, DISTINCT is a simpler and more direct method of achieving this. While you can sometimes get the same effect with GROUP BY, it is primarily used for aggregating data.
- Misusing DISTINCT: This is the mirror image of the above issue. This time, you’re using DISTINCT to aggregate data when using aggregate functions with GROUP BY is more adequate for that task.
5. Incorrect Use With Aggregate Functions
- Misusing With COUNT: It’s a common mistake to use DISTINCT with COUNT() – as in SELECT COUNT(DISTINCT column_name) – without really understanding that this will count only the unique values of the column and getting unexpected or skewed results.
- Redundant Use: Using DISTINCT with some aggregate functions, such as MIN() or MAX(), is redundant, as they already return a single value based on the entire set
6. Misinterpreting DISTINCT in JOINs
Using DISTINCT in complex joins without really understanding how it interacts with joined tables can lead to unexpected results, especially if there are duplicates in the joined tables.
7. Ignoring NULL Values
When your data has NULLs, you should be aware of the fact that DISTINCT treats all the NULLs as the same value. If a column contains multiple NULL values, SELECT DISTINCT will treat them as a unique single value.
Alternatives to SELECT DISTINCT
We learned that SELECT DISTINCT is an elegant and powerful tool for finding unique values. But there are also other ways to achieve the same result, which can sometimes be more efficient or appropriate.
Let’s see what these alternatives are.
1. GROUP BY Clause
- Use: Even though its primary use is data aggregation, GROUP BY can also be used to remove duplicates when used without aggregate functions.
- Why Use It?: There are some cases where GROUP BY is more efficient than DISTINCT, for example, when used with indexes or large data.
- Example Usage: If you’re only interested in the unique values of a single column, then this query,
SELECT column
FROM table
GROUP BY column;
can be an alternative to the one below.
SELECT DISTINCT column
FROM table;
2. Window Functions
- Use: SQL Window functions provide a way to perform complex calculations across a set of rows related to the current row. While they don't inherently remove duplicates, they can be used to achieve similar results.
- How to Use Them?: One common approach is to assign a row number to each row within a partition of data and then filter based on this row number to achieve uniqueness.
- Example Usage: The below query using the window functions returns unique values.
SELECT column_1,
column_2
FROM (SELECT column_1,
column_2,
ROW_NUMBER() OVER (PARTITION BY column_1 ORDER BY column_2) AS rn
FROM table) sub
WHERE rn = 1;
3. Subqueries
- Use: Sometimes, using a subquery to filter the data before applying SELECT DISTINCT on the outer query can be more efficient.
- How to Use It?: This approach involves nested queries where the inner query handles a part of the data manipulation, and the outer query applies the DISTINCT.
- Example Usage: This is how you can use subqueries to remove duplicates.
SELECT DISTINCT column
FROM (SELECT column
FROM table
WHERE condition) sub;
4. EXISTS Clause
- Use: The EXISTS clause is a logical operator used to test the existence of rows in a subquery. It can also be used to remove duplicate values.
- How to Use It?: You can use it in the WHERE or HAVING clauses and with a subquery.
- Example Usage: Here’s how you can use it.
SELECT *
FROM orders o
WHERE EXISTS (
SELECT 1
FROM customers c
WHERE c.customer_id = o.customer_id
AND c.country = 'USA'
);
5. SET Operators
- Use: The UNION operator is used to combine the outputs of two or more queries into one output. It inherently removes duplicates. Using UNION can sometimes be a better idea than combining multiple SELECT DISTINCT queries.
- How to Use It?: Like you would use UNION with two SELECT statements in any other scenario.
- Example Usage:
SELECT column_1
FROM table_1
UNION
SELECT column_1
FROM table_2;
6. DISTINCT ON (PostgreSQL Specific)
- Use: DISTINCT ON is a feature in PostgreSQL that allows you to get distinct values based on specific columns.
- When to Use It?: When you want selective distinctiveness. In other words, when you want to fetch rows that are distinct in one or more columns but don’t require all columns to be distinct.
- Example Use: Here’s the example of a query using DISTINCT ON to return rows that are unique with respect to column_1.
SELECT DISTINCT ON (column_1) column_1,
column2
FROM TABLE;
Conclusion
The SQL SELECT DISTINCT statement is a relatively straightforward SQL statement designed to remove data duplicates. If you know SELECT, you’ll intuitively know how to use DISTINCT in most cases.
However, the DISTINCT clause can also be used with multiple columns, in aggregate functions, or with HAVING, CASE WHEN, or JOIN. While DISTINCT doesn’t change its nature, it’s crucial – especially in those more complex use cases – that you know precisely what DISTINCT does. We mentioned some misconceptions and typical SQL errors people make when using DISTINCT.
While DISTINCT is tremendous and probably the best choice for removing duplicates in most cases, you’ll sometimes need to consider alternatives. Sometimes GROUP BY, a subquery, or any other alternative we covered earlier are more adequate and resource-efficient for finding unique data values.
You see that knowing DISTINCT is essential. But also leads to the imperative of understanding some other important SQL concepts that can give you flexibility in your daily work with data. No matter which stage of your data career, you should ensure you’re familiar with the topics covered in the SQL Interview Questions article.