SELECT UNIQUE in SQL: Mastering Data Retrieval with DISTINCT

Select unique in SQL


What is SELECT UNIQUE in SQL? What about SELECT DISTINCT? How do you retrieve unique data? We will answer these three and more questions in this article.

Databases contain a lot of data, but this data is not necessarily unique. With so many rows of data being stored, it’s realistic to expect that some of that data will have the same values in the columns as some other data—in short, duplicate values.

Having duplicate values goes against the requirement of many data handling tasks, which is to eliminate duplicate values. This is where the SQL clause DISTINCT comes into play.

What “Unique” Means in the Context of Database Records

In a database context, “unique” means that each database record (or row) is different from all others. It doesn’t mean that all the values (columns) need to be different from all other rows’ columns; it’s enough that only one column is different to proclaim the row unique.

So, what we’re looking at here is the unique combination of values. Take a look at this illustration.

What is SELECT UNIQUE in SQL

The red rows are not unique because they have exactly the same values in all columns. We have duplicate rows for whatever reason – maybe it’s intentional or an error.

However, the green row is unique, despite the value in one column only (first_name) being different from the other two rows. This is a record of a different person (maybe Bob’s twin brother Leo Zimmerman?) than in the other two rows.

Why Retrieving Unique Records is Crucial

There are three main reasons why getting unique data is important.

Why retrieving select unique is crucial in sql

1. Data Analysis

Many a data professional has made the mistake of including duplicate values when only unique values are required. By doing so, they made the entire data analysis and its conclusion utterly worthless. Not excluding duplicates from your data means you will artificially inflate whatever values you’re trying to calculate. This could lead to painfully wrong conclusions and business decisions. Yes, you have the power to run the businesses into the ground! Are you scared of such a responsibility? You should be.

Imagine you have a database that records each order as a new row whenever the order’s status changes.

Data analysis with select unique in sql

Naively, you just sum all the rows and report $28,200 monthly revenue—well above the monthly plan of $18,000. A record! The boss is happy, every employee gets a $1,000 bonus, and the whole company goes on holiday for five days as a reward. What happens when you come back to work and realize you included duplicate orders?

Data analysis with select unique in sql

Shit, the actual monthly revenue is $16,600, almost 50% below the monthly plan!

You’ll be happy if the boss only throws a chair your way. (With all the ‘respectful workplace’ and ‘code of conduct’ guidelines flying straight out the window!)

2. Data Integrity

The other side of the previous point is having duplicate values where there shouldn’t be. In short, data integrity. Compromise it, and you’re again risking producing data that will lead to erroneous decision-making.

For example, you now have a database where each transaction in the table should be unique. However, due to a system glitch, some transactions were erroneously recorded multiple times.

Data integrity with select unique in sql

If you checked the data, you would see that only some transaction IDs appear once. By doing so, you could report the mistake, make corrections, and initiate an investigation to find the cause of the error. On top of that, you can temporarily work with such data by simply summing the unique transaction IDs. That way, you get a correct total transaction amount of $10,500, not $19,500, as you would by including duplicate transactions.

Data integrity with select unique in sql

3. Querying Efficiency

Creating a sub-dataset that includes only unique values can significantly improve the efficiency of the queries used in the following steps of data processing. This is logical, as having only unique rows reduces the amount of data to be queried.

Take this, for example. Imagine this table tracks user login attempts.

Quering efficiency with select unique in sql

Potentially, the table could have millions of login data rows. Imagine that you’re making some report for which it’s enough for the user to successfully log in only once to be considered for the report.

If that’s the main criterion, you would wisely reduce the data by selecting only unique values. Then, you can perform any further analysis much quicker and easier.

The DISTINCT Keyword

We’ve been talking about unique values, so what the heck is DISTINCT now? Well, SQL uses the DISTINCT keyword to select unique values. There’s no such thing as UNIQUE. OK, that’s not entirely true. So, let’s clear this out.

UNIQUE is an old command for selecting unique values in the Oracle database. Later, the SQL standard defined DISTINCT as the keyword for fetching unique (or distinct) values. So, this is now used in all databases, including Oracle. However, Oracle didn’t ditch UNIQUE. Even though it completely disappeared for collective consciousness, you can still, if you insist or feel very retro, use UNIQUE instead of DISTINCT in Oracle to get unique values.

The DISTINCT keyword is the most commonly used as a part of SELECT DISTINCT queries.

Examples of SELECT DISTINCT Queries

Since there’s nothing that special with the SELECT DISTINCT syntax, I’ll immediately go to simple examples that show its two widespread uses.

I’ll just run through queries and what they do. Don’t worry; very detailed explanations of applying DISTINCT in practice will follow in the next section.

Using SELECT DISTINCT With One Column

One typical use is making a list of unique names of customers. It’s expected that many customers have the same first name, so to create a list of names, you’d need SELECT DISTINCT.

SELECT DISTINCT first_name
FROM customers;

It’s a regular SELECT statement but with the keyword DISTINCT before the selected column. This way, the database knows you’re interested in showing each customer name only once.

Using SELECT DISTINCT With Aggregate Function

Another common use of DISTINCT is with the aggregate function.

For example, you’re tracking individual customer orders. Each row records the order ID, together with the customer ID and order value. Naturally, many customers placed multiple orders.

You’re interested in finding the number of unique order amounts across all customers. You can do that using DISTINCT in the aggregate function, COUNT() in this case.

SELECT COUNT(DISTINCT order_amount) AS number_of_unique_order_amounts

FROM orders;

When using DISTINCT with an aggregate function, remember that DISTINCT goes inside the aggregate function and before the column within its parentheses.

Using DISTINCT on Multiple Columns

When used with one column, DISTINCT returns the unique values of that column. However, when used with multiple columns, DISTINCT looks for unique value combinations across the selected columns.

Two important notes here. No, you don’t write the DISTINCT keyword in front of every column in SELECT; you write it only in front of the first selected column.Second, no, you can’t selectively choose which selected columns you want to apply DISTINCT to. In other words, you either select unique combinations of all the columns or no unique columns at all. In even ‘otherer’ words, whatever columns you list in SELECT will be included in the DISTINCT clause.

Here’s an example to make it clear.

This is a slightly more complicated example than the previous two, so I’ll show you an example dataset. That way, you’ll visualize what’s going on better.

What would happen if you wrote a query like this?

SELECT DISTINCT first_name,

 last_name,

 department

FROM employee;

It would return unique combinations of the employees’ first names, last names, and departments.

Using select unique in sql on multiple columns

Let’s compare it to the original table we’re querying.

 Using select unique in sql on multiple columns

Remember, we’re talking about the combination of all columns. That’s why both Mark Johnson and Mark Wallace from Marketing appear: they have the same first name and department but different last names.

There are also two Mark Johnsons, but they are not duplicates, as they are from different departments.

As you can see, the only omitted row is the duplicate of Francesca Garibaldi from Sales. Rows 5 and 6 in the initial table are exactly the same, so they are shown only once in the query output.

Practical Applications of SELECT UNIQUE in SQL

Let’s now see how selecting unique values in SQL can be applied in the business context. There are some great examples in our coding interview questions. There it is, another practical application for you: passing the SQL job interview by practicing SQL interview questions.

Business Application: SELECT DISTINCT With One Column

This question is by Wine Magazine. It’s an example of a simple use of SELECT DISTINCT with one column.


Table: winemag_p1

Link to the question: https://platform.stratascratch.com/coding/10022-find-all-wine-varieties-which-can-be-considered-cheap-based-on-the-price

The question asks you to query the table winemag_p1 and find all the unique cheap wine varieties.

Table: winemag_p1
idcountrydescriptiondesignationpointspriceprovinceregion_1region_2varietywinery
126576USRich and round, this offers plenty of concentrated blackberry notes enveloped in warm spices and supple oak. There's a hint of green tomato leaves throughout, but the lush fruit combined with sturdy grape tannins and high acidity would pair well with fatty short ribs or braised pork.Estate Club8732VirginiaVirginiaMerlotVeramar
127077ItalyThis luminous sparkler offers measured aromas of yellow fruit and honey and delivers a broad, wide approach in the mouth. The wine's texture is creamy and full and there is a spicy point of effervescence on the palate.Extra Dry8519VenetoProsecco di ValdobbiadeneProseccoVaraschin
143029USThere are some pretty features to this wine, namely the lovely tannins and rich fruit, but it's just too soft and sweet. Tastes like blackberry jam, with extra white sugar spinkled on top.Signature8345CaliforniaPaso RoblesCentral CoastCabernet SauvignonByington
98813USLovely nose, which runs from violets and cherries right into coffee and sweet chocolate. In the mouth the wine coats the palate with Napa-like intensity, only buttressed with more vivid acidity. It's both smooth and deep. The aromas are more detailed, interesting and really clean, loaded with dark flavors of soy, coffee, molasses and black cherry liqueur.Champoux & Loess Vineyards Reserve93100WashingtonWashingtonWashington OtherCabernet SauvignonBoudreaux Cellars
43172USTastes sweeter and fruitier than the 2009, which was dry, so if you like orange marmalade, pineapple tart and vanilla cookie sugariness in your Sauvignon Blancs, it's for you.Les Pionniers8640CaliforniaSonoma CountySonomaSauvignon BlancCapture

I used distinct on the wine variety column. After that, I only needed to filter data using where to output only wines with prices between $5 and $20.

SELECT DISTINCT variety
FROM winemag_p1
WHERE price BETWEEN 5 AND 20;


Here’s the output.

All required columns and the first 5 rows of the solution are shown

variety
Kuntra
Riesling
Sangiovese
Assyrtiko
Malbec-Syrah

Just for fun, let’s see how to use SELECT UNIQUE in Oracle. I’ll take the above code, use UNIQUE instead of DISTINCT, and run it in the Oracle code editor.

SELECT UNIQUE variety
FROM winemag_p1
WHERE price BETWEEN 5 AND 20


The code works perfectly!

Business Application: SELECT DISTINCT With Aggregation

The following question was sourced from the Netflix and Google interviews.


Table: oscar_nominees

Link to the question: https://platform.stratascratch.com/coding/10128-count-the-number-of-movies-that-abigail-breslin-nominated-for-oscar

It requires you to count the number of movies for which Abigail Breslin was an Oscar nominee. The question gives you the table oscar_nominees to work with.

Table: oscar_nominees
yearcategorynomineemoviewinnerid
2006actress in a supporting roleAbigail BreslinLittle Miss SunshineFALSE1
1984actor in a supporting roleAdolph CaesarA Soldier's StoryFALSE2
2006actress in a supporting roleAdriana BarrazaBabelFALSE3
2002actor in a leading roleAdrien BrodyThe PianistTRUE4
1942actress in a supporting roleAgnes MooreheadThe Magnificent AmbersonsFALSE5

To solve this question, use DISTINCT with the COUNT() function. Then, filter the nominee using WHERE.

SELECT COUNT(DISTINCT movie) AS n_movies_by_abi
FROM oscar_nominees
WHERE nominee = 'Abigail Breslin';

Abigail was nominated for only one movie.

All required columns and the first 5 rows of the solution are shown

n_movies_by_abi
1

Business Application: DISTINCT With Multiple Columns

Here’s an example of a business needing to use DISTINCT with multiple columns in SQL. The question is by the City of San Francisco.


Table: sf_restaurant_health_violations

Link to the question: https://platform.stratascratch.com/coding/9720-find-the-business-names-that-have-inspection-scores-of-less-than-50/official-solution

You’ll work with the sf_restaurant_health_violations table.

Table: sf_restaurant_health_violations
business_idbusiness_namebusiness_addressbusiness_citybusiness_statebusiness_postal_codebusiness_latitudebusiness_longitudebusiness_locationbusiness_phone_numberinspection_idinspection_dateinspection_scoreinspection_typeviolation_idviolation_descriptionrisk_category
5800John Chin Elementary School350 Broadway StSan FranciscoCA9413337.798-122.403{'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'}5800_201710172017-10-1798Routine - Unscheduled5800_20171017_103149Wiping cloths not clean or properly stored or inadequate sanitizerLow Risk
64236Sutter Pub and Restaurant700 Sutter StSan FranciscoCA9410237.789-122.412{'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'}64236_201707252017-07-2588Routine - Unscheduled64236_20170725_103133Foods not protected from contaminationModerate Risk
1991SRI THAI CUISINE4621 LINCOLN WaySan FranciscoCA9412237.764-122.508{'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'}1991_201711292017-11-2986Routine - Unscheduled1991_20171129_103139Improper food storageLow Risk
3816Washington Bakery & Restaurant733 Washington StSan FranciscoCA9410837.795-122.406{'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'}3816_201607282016-07-2867Routine - Unscheduled3816_20160728_103108Contaminated or adulterated foodHigh Risk
39119Brothers Restaurant4128 GEARY BlvdSan FranciscoCA9411837.781-122.464{'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'}39119_201607182016-07-1879Routine - Unscheduled39119_20160718_103133Foods not protected from contaminationModerate Risk

This table records data on the health violation level. One inspection can have several health violations, so it’s highly likely that the same inspection will appear several times.

Having that in mind, as well as the requirement to output the business name, inspection date, and score, we know we need to use DISTINCT the following way.

SELECT DISTINCT business_name,
inspection_date,
inspection_score
FROM sf_restaurant_health_violations
WHERE inspection_score < 50;

This ensures that each inspection data appears only once for a particular business.

In the final code line, you also need to filter data so the output shows only inspections below the score of 50.

The output shows only two businesses and inspections.

All required columns and the first 5 rows of the solution are shown

business_nameinspection_dateinspection_score
Da Cafe2016-09-0748
Lollipot2018-05-2245

Advanced Topics in Selecting Unique Values in SQL

Now, I want to talk about some more complex (but not too much!) ways of using DISTINCT.

Business Application: SELECT DISTINCT With Aggregate and GROUP BY

In one of the previous examples, I used SELECT DISTINCT with an aggregate function. Because there was only an aggregated column in SELECT, there was no need to use GROUP BY in the query.

Let’s now see an example where GROUP BY is necessary. For instance, in this question by Airbnb.


Tables: airbnb_hosts, airbnb_units

Link to the question: https://platform.stratascratch.com/coding/10156-number-of-units-per-nationality

The question asks you to find the number of apartments by host nationality, but you should count only hosts younger than 30.

We work with two tables here. The first one is airbnb_hosts.

Table: airbnb_hosts
host_idnationalitygenderage
0USAM28
1USAF29
2ChinaF31
3ChinaM24
4MaliM30

The second is a table of accommodation units.

Table: airbnb_units
host_idunit_idunit_typen_bedsn_bedroomscountrycity
0A1Room11USANew York
0A2Room11USANew Jersey
0A3Room11USANew Jersey
1A4Apartment21USAHouston
1A5Apartment21USALas Vegas

Yes, you’re right; we will select nationality and use COUNT() with DISTINCT to find the number of units. The selected columns are from different tables, so we need to join them. (For a reminder how this works, read about types of SQL JOIN.

To get the count by nationality, it’s necessary to group the output by it. So, this is not really different from any other use of the aggregate functions. The principle is always the same: the columns you have in SELECT have to be in GROUP BY. The DISTINCT keyword doesn’t change anything in that sense; it only impacts how the aggregate function itself will behave.

Regarding our query, we need to fulfill some additional question requirements. We use WHERE to include only accommodations categorized as apartments and those owned by under-30s.

As a last step, we must sort the output by the number of apartments descendingly.

SELECT nationality,
       COUNT(DISTINCT unit_id) AS apartment_count
FROM airbnb_units apartment 
INNER JOIN airbnb_hosts host 
ON apartment.host_id = host.host_id 
WHERE host.age < 30 AND unit_type = 'Apartment'
GROUP BY host.nationality 
ORDER BY apartment_count DESC;

The code output is this.

All required columns and the first 5 rows of the solution are shown

nationalityapartment_count
USA2

Business Application: DISTINCT in HAVING

DISTINCT is not limited to SELECT; you can use it in HAVING to filter the output based on unique values.

Let’s solve the question by the City of Los Angeles to see how.


Table: los_angeles_restaurant_health_inspections

Link to the question: https://platform.stratascratch.com/coding/9694-single-facility-corporations

Here, we need to output the owners that have restaurants with a single facility.

We will query the los_angeles_restaurant_health_inspections table.

Table: los_angeles_restaurant_health_inspections
serial_numberactivity_datefacility_namescoregradeservice_codeservice_descriptionemployee_idfacility_addressfacility_cityfacility_idfacility_statefacility_zipowner_idowner_namepe_descriptionprogram_element_peprogram_nameprogram_statusrecord_id
DAQHRSETQ2017-06-08MARGARITAS CAFE93A1ROUTINE INSPECTIONEE00000065026 S CRENSHAW BLVDLOS ANGELESFA0023656CA90043OW0004133BAZAN, ASCENCIONRESTAURANT (61-150) SEATS HIGH RISK1638MARGARITAS CAFEACTIVEPR0011718
DA2GQRJOS2017-03-07LAS MOLENDERAS97A1ROUTINE INSPECTIONEE00009972635 WHITTIER BLVDLOS ANGELESFA0160416CA90023OW0125379MARISOL FEREGRINORESTAURANT (0-30) SEATS HIGH RISK1632LAS MOLENDERASINACTIVEPR0148504
DAMQTA46T2016-03-22SANDRA'S TAMALES93A1ROUTINE INSPECTIONEE00010495390 WHITTIER BLVDLOS ANGELESFA0171769CA90022-4032OW0178828SANDRA'S TAMALES INC.RESTAURANT (0-30) SEATS MODERATE RISK1631SANDRA'S TAMALESACTIVEPR0164225
DAXMBTIRZ2018-02-12CAFE GRATITUDE97A1ROUTINE INSPECTIONEE0000828639 N LARCHMONT BLVD STE #102LOS ANGELESFA0058921CA90004OW0005704CAFE GRATITUDE LARCHMONT LLCRESTAURANT (61-150) SEATS HIGH RISK1638CAFE GRATITUDEACTIVEPR0019854
DAK8TBMS02015-09-10THE WAFFLE90A1ROUTINE INSPECTIONEE00007096255 W SUNSET BLVD STE #105LOS ANGELESFA0051830CA90028OW0035796THE WAFFLE, LLCRESTAURANT (61-150) SEATS HIGH RISK1638THE WAFFLEACTIVEPR0010922

We select the owner names from the table and group the results by the owner ID and name. This ensures we have unique owners, which are repeated across multiple inspections.

Now, we need to keep only the owners with a one-facility restaurant. We can do that using HAVING, which means we filter after grouping the output. In HAVING, we use COUNT() with DISTINCT to count the unique facility IDs.

SELECT owner_name
FROM los_angeles_restaurant_health_inspections
GROUP BY owner_id, owner_name
HAVING COUNT(DISTINCT facility_id) = 1
ORDER BY owner_name;

The code outputs this list.

All required columns and the first 5 rows of the solution are shown

owner_name
110 SUNSHINE SMOOTHIES, INC.
110 SUNSHINE SMOOTHIES, INC.
7111 MELROSE LLC
99 CENT ONLY STORES, LLC
9 STONES, LLC

Business Application: DISTINCT With CASE Statement

The DISTINCT clause can also be used with the CASE statement, like with any regular column.

In this question by Salesforce and Asana, we use DISTINCT with the COUNT() function and CASE statement.


Table: sf_events

Link to the question: https://platform.stratascratch.com/coding/2052-user-growth-rate

We’ll work with a single table named sf_events.

Table: sf_events
dateaccount_iduser_id
2021-01-01A1U1
2021-01-01A1U2
2021-01-06A1U3
2021-01-02A1U1
2020-12-24A1U2

From the table, we need to find the growth rate of active users between December 2020 and January 2021 for each account.

How do we do that? Growth is defined as the number of users in January 2021 divided by the number in December 2020.

So, this is exactly what COUNT(), DISTINCT, and CASE do in the query below. The first instance (in the numerator) uses the CASE statement to find all active users between 1 January 2021 and 31 January 2021. DISTINCT ensures that only unique users are counted.

The calculation in the denominator does the same but for dates between 1 December 2020 and 31 December 2020.

In the end, the output is just grouped by account ID.

SELECT account_id,
       COUNT(DISTINCT (CASE
                           WHEN date BETWEEN '2021-01-01' AND '2021-01-31' THEN user_id
                           ELSE NULL
                       END))::float / COUNT(DISTINCT (CASE
                                                          WHEN date BETWEEN '2020-12-01' AND '2020-12-31' THEN user_id
                                                          ELSE NULL
                                                      END))::float AS growth_rate
FROM sf_events
GROUP BY 1;

Here’s the code output.

All required columns and the first 5 rows of the solution are shown

account_id?column?
A11
A22
A30.5

Best Practices and Common Pitfalls

Writing code with DISTINCT is not complicated at all. Using DISTINCT to its strength is much more difficult. In doing that, try to apply some best practices to avoid common pitfalls in using DISTINCT.

Best Practices in Using DISTINCT in SQL

The overview below shows five best practices I advise you to follow. Familiarize yourself with the list, and then I’ll explain it.

Best practices in using select unique in sql

1. Use DISTINCT Sparingly: You should use DISTINCT only when necessary. If the data in the table is already unique, don’t use DISTINCT. If you can get unique values by grouping data, use GROUP BY. DISTINCT can slow down your query, especially with large datasets, as it goes through all the data to find and eliminate duplicates.

2. Use Indexes: If the columns used in DISTINCT are indexed, this can improve your query performance.

3. Limit the Number of Columns in DISTINCT: Don’t go into an overkill with the number of columns in DISTINCT: the more columns, the slower the query. Think about what columns you need and include only them in DISTINCT.

4. Use Aggregation Instead of DISTINCT: In one of the examples, I talked about how distinct values are fetched using GROUP BY. Whenever you can do it with GROUP BY or GROUP BY + aggregation, do so.

5. Understand Data: If you know your data, you’ll also know if some is unique. This relates to the nature of the data itself. But it also refers to data integrity constraints that ensure data is unique. Being aware of them saves you from using DISTINCT, as it’s unnecessary.

You’ll get into pitfalls if you ignore these best practices: you’ll use DISTINCT all the time with too many columns on data that already is unique while completely ignoring the existence of data grouping and aggregation. As a result, you’ll write too much code that will work unnecessarily slowly.

Conclusion

Knowing how to query unique values in SQL is important for anyone working with data. Hence, mastering DISTINCT is, too.

While its syntax is nothing short of banal, knowing when and how to use DISTINCT is anything but. You’ll need some practice to get to that level. Practice and business level is what StrataScratch is all about: in our coding questions section; there are almost 150 interview questions focusing on DISTINCT.

More than enough to learn DISTINCT inside out!

Select unique in SQL


Become a data expert. Subscribe to our newsletter.