An In-Depth Guide to SQL SELECT DISTINCT Statement

A Guide to SQL SELECT DISTINCT Statement


SQL SELECT DISTINCT is a powerful statement for removing duplicates from your data. Learn how to use it in different scenarios and what its alternatives are.

Being unique is what you should be, they tell you all the time. You shouldn’t be someone’s duplicate! If only your data had listened to this advice and sorted itself out!

But, no, you need to go through every database row and decide if the row is duplicate or not. Or, at least, that’s a story you tell your boss to paint your job as being of colossal difficulty, possibly also to cause some pity and a pay raise.

As if you don’t use SELECT DISTINCT to remove duplicates in a matter of seconds. You don’t? You really go through each and every row to find duplicates. Don’t tell me; you export data to Excel and then use conditional formatting to highlight duplicates.No, no, there’s an easier way. Let me show it to you so you can spend more time doing something useful at your work: solving sudoku, drinking coffee, or watching a video of a girl burping in a car on repeat.

What is a SELECT DISTINCT Statement in SQL?

The SELECT DISTINCT statement is used to retrieve unique records from the database. It filters out duplicate values in the result set, ensuring that every row returned is different in some specified aspect. This functionality is vital in scenarios where only distinct entries are required from a large set of data.

When working with data, you’ll often have to make sure that your report or analysis includes only unique values. This is where SELECT DISTINCT can help you.

What is the DISTINCT Clause, and How Does It Modify the Behavior of the SELECT Statement

The DISTINCT clause’s primary function is to ensure that the results returned by a query are unique (or distinct) by removing duplicate rows from the result set. It acts as a filter, screening out repeated occurrences of the same data.

Under normal circumstances, you know that the SELECT statement will retrieve all rows from the specified columns in a table. But when used with DISTINCT, it will remove duplicates and return only unique rows of the column (when one column is selected) or a combination of columns (when multiple columns are selected).

The database executes the SELECT DISTINCT statement exactly in this order:

  1. SELECT is executed to return all the requested columns
  2. DISTINCT is executed to return duplicate rows from the output.

Due to that, the SELECT DISTINCT statement requires additional processing, which can be resource-intensive on large datasets or complex queries. Therefore, it’s essential to use DISTINCT judiciously, especially in environments where performance and response time are critical.

Syntax of SQL SELECT DISTINCT

The basic syntax of the SQL SELECT DISTINCT statement is very simple. You write a regular SELECT statement, only with DISTINCT after the keyword SELECT, like this. Or before the columns, depending on how you look at it.

SELECT DISTINCT column_1
FROM table;

Practical Applications of SELECT DISTINCT

Oh, boy, you’ll be using SQL SELECT DISTINCT a lot once you realize how useful it is in scenarios that revolve around data uniqueness.

Use Cases Where SELECT DISTINCT Is Essential

Here are examples and scenarios where data uniqueness is important, and SELECT DISTINCT comes in handy. These are all situations that are very common in the everyday work of most data professionals.

SQL SELECT DISTINCT Use Cases

1. Data Reporting and Analysis

Ensuring data accuracy in your reporting and analysis is crucial, so you’ll find SELECT DISTINCT useful when:

  • Generating Unique Lists: For creating reports that require a list of unique items, such as a list of all the unique products sold in a store.
  • Summarizing Data: Such as finding the total number of unique customers who made purchases within a specific period.

2. Data Cleaning and Preprocessing

Data often contains duplicates, which can skew analysis results. SELECT DISTINCT helps you with:

  • Removing Duplicates: By using SELECT DISTINCT before analysis, you can identify and remove duplicate entries from the dataset.
  • Consistency Checks: Used for consistency checks in data, it ensures that entries like IDs or codes are unique as required.

3. Database Design and Restructuring

When designing and restructuring databases, you can use SELECT DISTINCT in:

  • Identify Unique Values: This is required for creating a primary key or when wanting to understand the relationship between different tables.
  • Normalization Process: In the database normalization process, SELECT DISTINCT can help identify redundant data, which can then be moved to separate tables to reduce duplication.

4. Identifying Unique Combinations

When used with multiple columns, SELECT DISTINCT can help you find unique combinations of data across multiple columns—for instance, identifying unique combinations of country and city names in a customer database.

5. Data Integrity Checks

Data integrity is an essential concept for anyone working with data. SELECT DISTINCT can be used when doing data integrity checks by:

  • Checking Redundancies: Regular checks for redundancies or anomalies in the data, especially in large databases where manual checking is impractical.
  • Ensuring Uniqueness Constraints: Verifying that data supposed to be unique, such as email addresses or usernames, remains unique.

6. Business Intelligence and Decision Making

For strategic business decisions, SQL SELECT DISTINCT can be used in the following ways:

  • Market Analysis: Understanding the diversity of a product range, customer base, or vendor list by identifying unique entities.
  • Customer Segmentation: Segmenting customers based on unique characteristics which requires identifying those unique characteristics first.

7. Performance Optimization

When wanting to optimize data performance, SELECT DISTINCT is helpful as it can reduce data load. It reduces the amount of data processed in subsequent operations, leading to more efficient query performance in some cases.

8. Unique Record Extraction for Joins

When joining two, three or more tables in SQL, you could easily end up with duplicate values when you don’t want them. With SELECT DISTINCT, you can prepare data for joins. This ensures that the data being joined does not have duplicates, which could lead to incorrect join results.

9. Generating Distinct Counts

When doing statistical analysis, you can use SELECT DISTINCT in conjunction with COUNT to find the count of unique entries in a column.

Example From Real-World Databases

Now, let’s see how SQL SELECT DISTINCT is used with real-world data and in actual code. It's probably the thing you’re most interested in, as you want to see how to ditch Excel for something much more attractive.

Here’s an interview question by Spotify. The scenario it poses is the one many data professionals can relate to, such as data engineers, data scientists, BI analysts, data analysts, and ML engineers.


Table: billboard_top_100_year_end


Link to the question: https://platform.stratascratch.com/coding/10283-find-the-top-ranked-songs-for-the-past-30-years

The question asks you to find all the songs that were top-ranked at least once in the past 20 years. The question is phrased this way because you need to work with Spotify data. But the task is applicable to many other similar situations, like finding a top-ranked product, salesman, customer (by number of orders or their value), the most expensive product, etc.

The question gives you one table to work with, namely billboard_top_100_year_end.

Table: billboard_top_100_year_end
yearyear_rankgroup_nameartistsong_nameid
19561Elvis PresleyElvis PresleyHeartbreak Hotel1
19562Elvis PresleyElvis PresleyDon't Be Cruel2
19563Nelson RiddleNelson RiddleLisbon Antigua3
19564PlattersPlattersMy Prayer4
19565Gogi GrantGogi GrantThe Wayward Wind5

Why do you think we need a SELECT DISTINCT to solve this problem?

First of all, it’s the logic of the problem. It requires you to show songs ranked first at least once in the last 20 years. Why would anyone show one song every time it ranks first?

Knowing how fast songs go in and out of fashion in pop music, you could argue that it’s highly unlikely that one song was top-ranked in more than one year. You’re probably right about that. But, here enters the second reason why you should use DISTINCT: knowing your dataset and its logic.

Here’s a snapshot of the dataset.

All required columns and the first 5 rows of the solution are shown

yearyear_rankgroup_nameartistsong_nameid
20121Gotye feat. KimbraGotyeSomebody That I Used To Know6203
20121Gotye feat. KimbraKimbraSomebody That I Used To Know6204

As you can see, the song ‘Somebody That I Used To Know’ is top-ranked but appears twice. Why? Because it’s a feature song, it appears once under artist Gotye and the second time under artist Kimbra. As we want to output only the song name, this and all other songs like this would appear at least two times. You see now why it’s important to know your dataset and then use DISTINCT to avoid skewing your little report.

OK, let’s see the code.

The syntax is the same as explained earlier: write DISTINCT after SELECT and then the column you want to select. In this example, it’s the column song_name.

We also need to apply certain criteria in WHERE to filter data. The first condition is that the ranking equals one. Also, only songs that were top-ranked in the last 20 years are accepted. So, I use the DATE_PART() function to get the year part of the current date and then subtract the column year from it. That way, you get the number of years between the current year and the year of the song’s top rank. This result must be less than or equal to 20.

SELECT DISTINCT song_name
FROM billboard_top_100_year_end
WHERE year_rank = 1 AND
  DATE_PART('year', CURRENT_DATE) - year <= 20;

The output is a list of songs that satisfy the requirements.

All required columns and the first 5 rows of the solution are shown

song_name
Bad Day
Boom Boom Pow
Irreplaceable
Low
Rolling In The Deep

Now, the above example is an elementary application of DISTINCT. Of course, it’s not the only one, as DISTINCT can also be used with multiple columns as well as with other SQL clauses and functions.

SQL SELECT DISTINCT With Multiple Columns

When using SELECT DISTINCT with multiple columns, you write DISTINCT only once and then list all the columns you want.

Be careful when doing this! The output is not unique values for each column separately. The output is all unique combinations of all the columns you listed.

Also, you can’t somehow exclude one of the listed columns from the DISTINCT clause; they are all always included in the distinct combination.

I’ll solve this question by the City of San Francisco to show you how this works.


Table: sf_restaurant_health_violations


Link to the question: https://platform.stratascratch.com/coding/9720-find-the-business-names-that-have-inspection-scores-of-less-than-50

You need to find the names of the businesses with a score lower than 50, along with the corresponding inspection date and score.

Here’s the overview of the table sf_restaurant_health_violations we need to work with.

Table: sf_restaurant_health_violations
business_idbusiness_namebusiness_addressbusiness_citybusiness_statebusiness_postal_codebusiness_latitudebusiness_longitudebusiness_locationbusiness_phone_numberinspection_idinspection_dateinspection_scoreinspection_typeviolation_idviolation_descriptionrisk_category
5800John Chin Elementary School350 Broadway StSan FranciscoCA9413337.798-122.403{'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'}5800_201710172017-10-1798Routine - Unscheduled5800_20171017_103149Wiping cloths not clean or properly stored or inadequate sanitizerLow Risk
64236Sutter Pub and Restaurant700 Sutter StSan FranciscoCA9410237.789-122.412{'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'}64236_201707252017-07-2588Routine - Unscheduled64236_20170725_103133Foods not protected from contaminationModerate Risk
1991SRI THAI CUISINE4621 LINCOLN WaySan FranciscoCA9412237.764-122.508{'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'}1991_201711292017-11-2986Routine - Unscheduled1991_20171129_103139Improper food storageLow Risk
3816Washington Bakery & Restaurant733 Washington StSan FranciscoCA9410837.795-122.406{'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'}3816_201607282016-07-2867Routine - Unscheduled3816_20160728_103108Contaminated or adulterated foodHigh Risk
39119Brothers Restaurant4128 GEARY BlvdSan FranciscoCA9411837.781-122.464{'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'}39119_201607182016-07-1879Routine - Unscheduled39119_20160718_103133Foods not protected from contaminationModerate Risk

Solving this question requires using SELECT DISTINCT on multiple columns.

The syntax is straightforward: write SELECT DISTINCT and then list all the columns the question asks you to, namely business_name, inspection_date, and inspection_score.

After referencing the dataset, use WHERE to apply a filtering condition. It will output only data where the inspection score is lower than 50.

SELECT DISTINCT business_name,
                inspection_date,
                inspection_score
FROM sf_restaurant_health_violations
WHERE inspection_score < 50;

The code returns two businesses with the relevant data.

All required columns and the first 5 rows of the solution are shown

business_nameinspection_dateinspection_score
Lollipot2018-05-2245
Da Cafe2016-09-0748

Now, the question is, could we have solved this question without DISTINCT? In terms of returning the names of the same two restaurants, the answer is yes. Simply remove the DISTINCT keyword, and you get this code.

SELECT business_name,
       inspection_date,
       inspection_score
FROM sf_restaurant_health_violations
WHERE inspection_score < 50;

Yes, this solution is also accepted; run it to see. However, you see what your output looks like, right?

All required columns and the first 5 rows of the solution are shown

business_nameinspection_dateinspection_score
Lollipot2018-05-2245
Lollipot2018-05-2245
Da Cafe2016-09-0748
Lollipot2018-05-2245
Lollipot2018-05-2245
Lollipot2018-05-2245
Lollipot2018-05-2245
Da Cafe2016-09-0748
Da Cafe2016-09-0748
Da Cafe2016-09-0748

Mhm, it returns Lollipot and De Cafe – the same as the earlier solution with DISTINCT – but they are both repeated several times. Why does the query do that despite the same business name, inspection date, and inspection score?

Well, again, knowing your data comes in handy. There are columns violation_id and violation_description in the dataset. There can be (and there are, obviously!) multiple violation IDs – hence, numerous violation descriptions – by each inspection. So, the values we need – business name, inspection date, and score – will be repeated whenever there’s more than one violation per inspection.

That is why using SELECT DISTINCT on multiple columns is a better option here. If nothing, the output is nicer without the redundant data.

SQL DISTINCT With Aggregate Functions

The SQL distinct can also be used with the SQL aggregate functions. The purpose is to include only the unique values in the calculation. For example, the most common use of DISTINCT is with COUNT(). That combination returns the number of unique values, i.e., counting the same data value only once.

Let’s see how this works by solving the medium-difficulty interview question and using DISTINCT with COUNT().


Table: car_launches


Link to the question: https://platform.stratascratch.com/coding/10318-new-products

You need to count the net difference between the number of products companies launched in 2020, and the number of products the companies launched in the previous year.

The question gives you the table car_launches, where you can find the car make, model, and launch year.

Table: car_launches
yearcompany_nameproduct_name
2019ToyotaAvalon
2019ToyotaCamry
2020ToyotaCorolla
2019HondaAccord
2019HondaPassport

The solution query is a bit more complicated, so I’ll start writing it from the subqueries.

The first subquery shows companies and car models that were launched in 2020. The filter is applied in WHERE on the column year.

SELECT company_name,
       product_name AS brand_2020
FROM car_launches
WHERE year = 2020;

The second query is exactly the same, except that it shows models launched in 2019.

SELECT company_name,
       product_name AS brand_2019
FROM car_launches
WHERE year = 2019;

I need to join these two subqueries in the FROM clause. I use FULL JOIN because I want all data from both subqueries. The subqueries are joined on the company name.

SELECT *
FROM
  (SELECT company_name,
          product_name AS product_2020
   FROM car_launches
   WHERE year = 2020) a
FULL OUTER JOIN
  (SELECT company_name,
          product_name AS product_2019
   FROM car_launches
   WHERE year = 2019) b ON a.company_name = b.company_name;

Here’s the sample of the above query’s output.

All required columns and the first 5 rows of the solution are shown

company_nameproduct_2020company_nameproduct_2019
ToyotaCorollaToyotaCamry
ToyotaCorollaToyotaAvalon
HondaPilotHondaCivic
HondaPilotHondaCR-V
HondaPilotHondaPassport

As you can see, the models launched in 2020 can appear more than once. That’s because they are, when FULL JOINed, paired with the belonging company’s every model launched in 2019. So, if there are several models launched in 2019, the models launched in 2020 will appear multiple times.

The same is true in the opposite case.

So, we need to use DISTINCT somehow and exclude this duplicate data. First, let’s replace the asterisk (*) in SELECT with the actual columns we need. First, we select the company name.

Then, we need to calculate the difference between the number of products launched in 2020 and the year before. So, we first count the number of products launched in 2020. We do that by using COUNT() and then DISTINCT with the column brand_2020. This will count each product only once.

Remember: DISTINCT goes inside COUNT(), not outside!

Then, we do the same for the products launched in 2019. After that, we subtract these two values and name the column net_products.

To get these values by company, we must group them by the company name. As a final touch, let’s sort the data alphabetically by company name using ORDER BY.

SELECT a.company_name,
       (COUNT(DISTINCT a.product_2020)-COUNT(DISTINCT b.product_2019)) AS net_products
FROM
  (SELECT company_name,
          product_name AS product_2020
   FROM car_launches
   WHERE year = 2020) a
FULL OUTER JOIN
  (SELECT company_name,
          product_name AS product_2019
   FROM car_launches
   WHERE year = 2019) b ON a.company_name = b.company_name
GROUP BY a.company_name
ORDER BY company_name;


The output shows five companies and the net difference between the number of products in two consecutive years.

All required columns and the first 5 rows of the solution are shown

company_namenet_products
Chevrolet2
Ford-1
Honda-3
Jeep1
Toyota-1

SQL DISTINCT With HAVING

You can also use DISTINCT in the HAVING clause. Since HAVING is used for filtering aggregated data, you can use aggregate functions in it. And this leads to the conclusion that you can also use both aggregate functions and DISTINCT in HAVING.

Let’s see how to do that! Here’s a question from the Burtch Works, Glassdoor, and Salesforce interviews.


Table: employee


Link to the question: https://platform.stratascratch.com/coding/9911-departments-with-5-employees

The question wants you to find departments with five or more employees.

It gives you the table employee to work with. It’s a list of the company’s employees and their details.

Table: employee
idfirst_namelast_nameagesexemployee_titledepartmentsalarytargetbonusemailcityaddressmanager_id
5MaxGeorge26MSalesSales1300200150Max@company.comCalifornia2638 Richards Avenue1
13KattyBond56FManagerManagement1500000300Katty@company.comArizona1
11RicherdGear57MManagerManagement2500000300Richerd@company.comAlabama1
10JenniferDion34FSalesSales1000200150Jennifer@company.comAlabama13
19GeorgeJoe50MManagerManagement1000000300George@company.comFlorida1003 Wyatt Street1

To solve this problem, let’s start by selecting the departments from the table. Then, we can group by department to show each department only once.

Now that we have the list of departments, we can find those that have five or more employees. We use HAVING to filter data. The filtering condition uses COUNT() with DISTINCT to find the count of each employee only once and returns only those rows where the result is greater than or equal to five.

SELECT department
FROM employee 
GROUP BY department
HAVING COUNT(DISTINCT id) >= 5;

The output shows that the sales department is the only one with five or more employees.

All required columns and the first 5 rows of the solution are shown

department
Sales

SQL DISTINCT With CASE WHEN

Another possible use of the DISTINCT clause is in the CASE WHEN expression. One good example of this is the interview question by Apple and Google.


Tables: playbook_events, playbook_users

Link to the question: https://platform.stratascratch.com/coding/10141-apple-product-counts

You’re required to find the number of Apple product users and the number of total users with a device and group the counts by language.

The question gives you two tables. The first is playbook_events, with the preview shown below.

Table: playbook_events
user_idoccurred_atevent_typeevent_namelocationdevice
69912014-06-09 18:26:54engagementhome_pageUnited Statesiphone 5
188512014-08-29 13:18:38signup_flowenter_infoRussiaasus chromebook
149982014-07-01 12:47:56engagementloginFrancehp pavilion desktop
81862014-05-23 10:44:16engagementhome_pageItalymacbook pro
96262014-07-31 17:15:14engagementloginRussianexus 7

The second table is playbook_users.

Table: playbook_users
user_idcreated_atcompany_idlanguageactivated_atstate
112013-01-01 04:41:131german2013-01-01active
522013-01-05 15:30:452866spanish2013-01-05active
1082013-01-10 11:04:581848spanish2013-01-10active
1672013-01-16 20:40:246709arabic2013-01-16active
1752013-01-16 11:22:224797russian2013-01-16active

Let’s start solving the question by first joining the tables. The tables are joined using INNER JOIN on the column user_id. I also gave aliases to both tables, so I don’t need to write their full names in other parts of the query.

SELECT *
FROM playbook_users u
INNER JOIN playbook_events e 
ON u.user_id = e.user_id;

Now, let’s add the columns to SELECT. The first one is language, as showing data by language is the requirement.

Next, I want to calculate the number of Apple users. I’ll do that by using COUNT() and the CASE WHEN expression in it. CASE WHEN shows the IDs of the users who used MacBook-Pro, iPhone 5s, or iPad-air; the question tells us we can assume these are the only Apple products. The COUNT() function will then count these IDs, and this will represent the number of Apple users.

However, such an approach means the users will be counted every time they appear in the table. This would skew the total number of users, so we need to count each user only once. We can achieve this by adding DISTINCT before the CASE WHEN expression.

The third column counts the total number of users with DISTINCT in COUNT().

Finally, we can group the output to show all these calculations by each language separately. Also, it’s sorted by the number of total users descendingly, as required by the question.

SELECT u.language,
       COUNT (DISTINCT CASE
                           WHEN device IN ('macbook pro',
                                           'iphone 5s',
                                           'ipad air') THEN u.user_id
                           ELSE NULL
                       END) AS n_apple_users,
        COUNT(DISTINCT u.user_id) AS n_total_users
FROM playbook_users u
INNER JOIN playbook_events e ON u.user_id = e.user_id
GROUP BY u.language
ORDER BY n_total_users DESC;

The output shows the number of Apple and total users by each language.

All required columns and the first 5 rows of the solution are shown

languagen_apple_usersn_total_users
english1145
spanish39
japanese26
french05
russian05

SQL DISTINCT With JOIN

DISTINCT can also be used with JOIN to limit the data you get from one or both tables.

I’ll show you how to do this by solving this AirBnb interview question.


Tables: airbnb_contacts, airbnb_searches


Link to the question: https://platform.stratascratch.com/coding/10124-bookings-vs-non-bookings

The question asks you to show the average number of times users performed a search that led to a successful booking and the average number of times users performed a search but did not lead to a booking.

You’ll work with two tables. One is airbnb_contacts, with a data snapshot shown below.

Table: airbnb_contacts
id_guestid_hostid_listingts_contact_atts_reply_atts_accepted_atts_booking_atds_checkinds_checkoutn_guestsn_messages
86b39b70-965b-479d-a0b0-719b195acea21dfb22ec-c20e-4bf9-b161-1607afa25c5ad668de42-122a-45cd-b91f-91a70895f9022014-04-18 09:32:232014-04-18 09:39:062014-12-312015-01-0275
14f943bb-74e9-458b-be55-203dc72206883347390d-8670-4870-9dab-da30f370014114c47fb8-e831-4044-9674-9b3fd04991932014-10-06 06:55:452014-10-06 10:06:382014-10-06 10:06:382014-10-06 10:06:382014-11-032014-11-0728
425aa1ed-82ab-4ecf-b62f-d61e1848706d02cafb86-5445-45cc-80f2-405291578356c5a4a913-a094-4a9d-82e2-0b2d4f9d9eeb2014-10-04 05:02:392014-10-04 23:10:012014-11-022014-11-0922
bb490ede-8a70-4d61-a2e8-625855a393e2f49c3095-58de-4b8d-9d5b-3bfceceb47d827f4b429-d544-464f-b4b5-3c09fd5992e72014-08-31 11:46:112014-08-31 16:48:282014-11-032014-11-0725
b2fda15a-89bb-4e6e-ae81-8b21598e248271f1d49e-2ff4-4d72-b8e6-fd4c67feaa7495fb78ca-8e6e-436a-9830-949d995ad14f2014-10-08 15:07:562014-10-08 15:32:122014-10-08 15:32:122014-10-08 22:21:412014-11-062014-11-09210

The second table is airbnb_searches.

Table: airbnb_searches
dsid_userds_checkinds_checkoutn_searchesn_nightsn_guests_minn_guests_maxorigin_countryfilter_price_minfilter_price_maxfilter_room_typesfilter_neighborhoods
2014-10-1167aece73-e112-4e9e-9e05-8a2a94b003b9511IT099,Private room
2014-10-016cbb33d1-6ecc-4f74-8b6a-a43d07d484b62014-10-042014-10-0711333ES0567,Entire home/apt,Entire home/apt,Private room,Entire home/apt,Private room,Shared room
2014-10-03aa9cf5bf-5667-4212-8018-1cb8beee530e2014-11-142014-11-1617222GB0171,Entire home/apt
2014-10-093e6c2466-74fe-44c0-a6f3-dda79755d30a2015-02-262015-03-029414GB0240,Entire home/apt
2014-10-13a09bf912-b21d-4859-b194-8512c30695f62014-10-182014-10-227412GB,Entire home/apt

Let’s again start by first explaining the JOIN part of the code. The question logic suggests that we should take all the searches and then calculate the average of the searches that did and did not lead to booking.

So, this means using LEFT JOIN. Our left table is airbnb_searches, and our right table is actually no table at all but a subquery.

In this subquery, I’m using DISTINCT to find the unique combinations of the guest ID, check-in date, and the booking timestamp and return only those where the booking timestamp is not NULL, i.e., there was a booking. In other words, the subquery returns all the bookings.

The table and the subquery are joined on the columns id_user and id_guest. The additional joining condition is where the columns ds_checkin are equal. This is, of course, because we want to find the same users/guests. Also, the question gives the information that the search is connected to the booking only if their check-in dates match.

SELECT *
FROM airbnb_searches s
LEFT JOIN (SELECT DISTINCT id_guest,
                           ds_checkin,
                           ts_booking_at
           FROM airbnb_contacts
           WHERE ts_booking_at IS NOT NULL) c 
ON s.id_user = c.id_guest AND s.ds_checkin = c.ds_checkin;

The output contains all the data from the table airbnb_searches and only unique combinations of columns from the subquery.

All required columns and the first 5 rows of the solution are shown

dsid_userds_checkinds_checkoutn_searchesn_nightsn_guests_minn_guests_maxorigin_countryfilter_price_minfilter_price_maxfilter_room_typesfilter_neighborhoodsid_guestds_checkints_booking_at
2014-10-1167aece73-e112-4e9e-9e05-8a2a94b003b9511IT099,Private room
2014-10-016cbb33d1-6ecc-4f74-8b6a-a43d07d484b62014-10-042014-10-0711333ES0567,Entire home/apt,Entire home/apt,Private room,Entire home/apt,Private room,Shared room
2014-10-03aa9cf5bf-5667-4212-8018-1cb8beee530e2014-11-142014-11-1617222GB0171,Entire home/aptaa9cf5bf-5667-4212-8018-1cb8beee530e2014-11-142014-10-03 17:51:58
2014-10-093e6c2466-74fe-44c0-a6f3-dda79755d30a2015-02-262015-03-029414GB0240,Entire home/apt
2014-10-13a09bf912-b21d-4859-b194-8512c30695f62014-10-182014-10-227412GB,Entire home/apt

Now, I can start adding the columns in SELECT.

The first column will label the search as the one resulting in booking or not. This is done with the help of the CASE WHEN expression. (You’re familiar with it already!) The search will be labeled as ‘books’ if the time of the booking is not NULL (obviously, as time shows the booking occurred) and the check-in dates from both tables are equal (again, the criteria stated in the question). Otherwise, the search will be labeled as ‘does not book’.

The second column in SELECT calculates the average number of searches.

We now simply group the output by the first column to get the average number of searches for both search categories, and we have an answer to the question.

SELECT CASE
           WHEN c.ts_booking_at IS NOT NULL AND c.ds_checkin = s.ds_checkin THEN 'books'
           ELSE 'does not book'
       END AS action,
       AVG(n_searches) AS average_searches
FROM airbnb_searches s
LEFT JOIN (SELECT DISTINCT id_guest,
                           ds_checkin,
                           ts_booking_at
           FROM airbnb_contacts
           WHERE ts_booking_at IS NOT NULL) c 
ON s.id_user = c.id_guest AND s.ds_checkin = c.ds_checkin
GROUP BY 1;

The output shows two categories and the average number of searches.

All required columns and the first 5 rows of the solution are shown

actionaverage_searches
books23.333
does not book22.009

Optimizing Queries With SELECT DISTINCT in SQL

Since the SELECT DISTINCT query has to go through all the data and identify unique values, its performance might sometimes be inefficient, especially on a large amount of data.

Of course, this shouldn’t discourage you from using it. However, sometimes, you might want to optimize your SELECT DISTINCT query using some of these strategies.

Optimizing the SQL SELECT DISTINCT Queries

1. Indexing

  • Appropriate Indexing: If the columns in SELECT DISTINCT are indexed, this might speed up identifying the unique values.
  • The Same Index and Column Selection Order: Ensure the index aligns with the order of columns in SELECT DISTINCT, as that will make the database efficiently use the indexes to find unique values. This is because the index is essentially a sorted list of rows based on those columns.

2. Column Selection

You should include only the necessary columns in DISTINCT. Don’t show your generosity here, as the more columns there are in DISTINCT, the harder the database has to work to find unique combinations.

3. Filtering Data Earlier

  • Use WHERE: If you use the WHERE clause before the DISTINCT clause, this can significantly reduce the number of rows that need to be processed to find unique values.
  • Use Subqueries: The subqueries can also be used for filtering data and can sometimes be more efficient than DISTINCT.

4. Pay Attention to Data Types

Wherever you can choose the data types you use. And think about whether they’ll be required in DISTINCT. The reason is that smaller data types (e.g., integers) are processed faster than larger ones (e.g., strings or varchars).

5. Understanding the Data

We already saw in some practical examples how important this is in deciding whether to use DISTINCT. If a column is already unique, there’s no point in using DISTINCT.

6. Query Execution Plan Analysis

No matter what database you use, it has some kind of query execution plan analysis tool. Use it to understand how your SELECT DISTINCT query is being executed. That way, you can learn which query part can be optimized.

7. Avoid Overusing

Too much of anything isn’t good. The same goes for DISTINCT, no matter how cool this SQL clause actually is. Use it only when necessary. In some situations, duplicate data might be acceptable or can be removed more efficiently in later stages.

8. Testing and Monitoring

You should regularly check your queries’ performance, which can change based on data volume, distribution, and updates.

Common Mistakes and Misconceptions About the DISTINCT Clause

While DISTINCT isn’t that complicated, some misconceptions about it can lead to making mistakes, especially if you’re an SQL beginner.

These are some of them.

Mistakes of SQL Select Distinct Clause

1. Misunderstanding the DISTINCT Scope

  • Row-Level Uniqueness: There’s a misconception that DISTINCT applies to individual columns. While this might seem true when there’s only one column in the clause, this misconception becomes glaring if two or more columns are selected. Actually, SELECT DISTINCT considers the uniqueness of the row based on the combination of values in all selected columns.
  • Example: We mentioned this combination of values when solving a question requiring multiple columns in DISTINCT. To really bring this home to you, let me say this once again: the below query will not return all the unique first names and then all the unique last names; it will return unique combinations of the first and last names. So, if two people are named Arthur Clark, it will appear only once. But if there are several other Arthurs with other last names or Clarkes not named Arthur, you will see all of them. In other words, you’ll probably see many Arthurs and Clarkes, but only one Arthur Clarke.
SELECT DISTINCT first_name, 
		                last_name
		FROM employee;

2. Overestimating Performance Impact

  • Always Assuming High Cost: Yes, I know I dedicated the whole section to optimizing SELECT DISTINCT and drawing your attention to its possible inefficiency. But the key word here is possible. Put differently, SELECT DISTINCT doesn’t always lead to (significant) performance degradation. The impact mainly depends on the data size, indexing, and the particular RDBMS you’re using.
  • Avoiding DISTINCT Even When It’s Necessary: Of course, there are ways to work around DISTINCT and get unique values another way. But there are also scenarios where DISTINCT is unavoidable, so don’t insist on avoiding it just for fear of performance issues. First, test and analyze the actual impact, and then decide if this impact is significant or not.

3. Using DISTINCT as a Default

Avoiding DISTINCT at all costs is not good, but using it as a default in every query is not good either. Try to find the balance. Assess whether you need DISTINCT or not. If not, don’t use it. By avoiding unnecessary use, you also avoid unnecessary processing, especially when duplicate data is minimal or acceptable.

4. Confusing DISTINCT With GROUP BY

  • Misusing GROUP BY: DISTINCT is often confused with SQL GROUP BY because they both can be used to return unique values. However, DISTINCT is a simpler and more direct method of achieving this. While you can sometimes get the same effect with GROUP BY, it is primarily used for aggregating data.
  • Misusing DISTINCT: This is the mirror image of the above issue. This time, you’re using DISTINCT to aggregate data when using aggregate functions with GROUP BY is more adequate for that task.

5. Incorrect Use With Aggregate Functions

  • Misusing With COUNT: It’s a common mistake to use DISTINCT with COUNT() – as in SELECT COUNT(DISTINCT column_name) – without really understanding that this will count only the unique values of the column and getting unexpected or skewed results.
  • Redundant Use: Using DISTINCT with some aggregate functions, such as MIN() or MAX(), is redundant, as they already return a single value based on the entire set

6. Misinterpreting DISTINCT in JOINs

Using DISTINCT in complex joins without really understanding how it interacts with joined tables can lead to unexpected results, especially if there are duplicates in the joined tables.

7. Ignoring NULL Values

When your data has NULLs, you should be aware of the fact that DISTINCT treats all the NULLs as the same value. If a column contains multiple NULL values, SELECT DISTINCT will treat them as a unique single value.

Alternatives to SELECT DISTINCT

We learned that SELECT DISTINCT is an elegant and powerful tool for finding unique values. But there are also other ways to achieve the same result, which can sometimes be more efficient or appropriate.

Let’s see what these alternatives are.

Alternatives to SQL Select Distinct

1. GROUP BY Clause

  • Use: Even though its primary use is data aggregation, GROUP BY can also be used to remove duplicates when used without aggregate functions.
  • Why Use It?: There are some cases where GROUP BY is more efficient than DISTINCT, for example, when used with indexes or large data.
  • Example Usage: If you’re only interested in the unique values of a single column, then this query,
SELECT column 
FROM table 
GROUP BY column;

can be an alternative to the one below.

SELECT DISTINCT column 
FROM table;

2. Window Functions

  • Use: SQL Window functions provide a way to perform complex calculations across a set of rows related to the current row. While they don't inherently remove duplicates, they can be used to achieve similar results.
  • How to Use Them?: One common approach is to assign a row number to each row within a partition of data and then filter based on this row number to achieve uniqueness.
  • Example Usage: The below query using the window functions returns unique values.
SELECT column_1, 
 column_2 
FROM (SELECT column_1, 
 column_2, 
 ROW_NUMBER() OVER (PARTITION BY column_1 ORDER BY column_2) AS rn 
            FROM table) sub 
      WHERE rn = 1;

3. Subqueries

  • Use: Sometimes, using a subquery to filter the data before applying SELECT DISTINCT on the outer query can be more efficient.
  • How to Use It?: This approach involves nested queries where the inner query handles a part of the data manipulation, and the outer query applies the DISTINCT.
  • Example Usage: This is how you can use subqueries to remove duplicates.
	SELECT DISTINCT column 
FROM (SELECT column 
FROM table 
WHERE condition) sub;

4. EXISTS Clause

  • Use: The EXISTS clause is a logical operator used to test the existence of rows in a subquery. It can also be used to remove duplicate values.
  • How to Use It?: You can use it in the WHERE or HAVING clauses and with a subquery.
  • Example Usage: Here’s how you can use it.
SELECT *
FROM orders o
WHERE EXISTS (
    SELECT 1
    FROM customers c
    WHERE c.customer_id = o.customer_id
    AND c.country = 'USA'
);

5. SET Operators

  • Use: The UNION operator is used to combine the outputs of two or more queries into one output. It inherently removes duplicates. Using UNION can sometimes be a better idea than combining multiple SELECT DISTINCT queries.
  • How to Use It?: Like you would use UNION with two SELECT statements in any other scenario.
  • Example Usage:
SELECT column_1
FROM table_1

UNION

SELECT column_1
FROM table_2;

6. DISTINCT ON (PostgreSQL Specific)

  • Use: DISTINCT ON is a feature in PostgreSQL that allows you to get distinct values based on specific columns.
  • When to Use It?: When you want selective distinctiveness. In other words, when you want to fetch rows that are distinct in one or more columns but don’t require all columns to be distinct.
  • Example Use: Here’s the example of a query using DISTINCT ON to return rows that are unique with respect to column_1.
SELECT DISTINCT ON (column_1) column_1,
                   		 column2
      FROM TABLE;

Conclusion

The SQL SELECT DISTINCT statement is a relatively straightforward SQL statement designed to remove data duplicates. If you know SELECT, you’ll intuitively know how to use DISTINCT in most cases.

However, the DISTINCT clause can also be used with multiple columns, in aggregate functions, or with HAVING, CASE WHEN, or JOIN. While DISTINCT doesn’t change its nature, it’s crucial – especially in those more complex use cases – that you know precisely what DISTINCT does. We mentioned some misconceptions and typical SQL errors people make when using DISTINCT.

While DISTINCT is tremendous and probably the best choice for removing duplicates in most cases, you’ll sometimes need to consider alternatives. Sometimes GROUP BY, a subquery, or any other alternative we covered earlier are more adequate and resource-efficient for finding unique data values.

You see that knowing DISTINCT is essential. But also leads to the imperative of understanding some other important SQL concepts that can give you flexibility in your daily work with data. No matter which stage of your data career, you should ensure you’re familiar with the topics covered in the SQL Interview Questions article.

A Guide to SQL SELECT DISTINCT Statement


Become a data expert. Subscribe to our newsletter.