Published: April 29, 2024

Categories:

Written by:
Nathan Rosidi

What is SELECT UNIQUE in SQL? What about SELECT DISTINCT? How do you retrieve unique data? We will answer these three and more questions in this article.

Databases contain a lot of data, but this data is not necessarily unique. With so many rows of data being stored, it’s realistic to expect that some of that data will have the same values in the columns as some other data—in short, duplicate values.

Having duplicate values goes against the requirement of many data handling tasks, which is to eliminate duplicate values. This is where the SQL clause DISTINCT comes into play.

What “Unique” Means in the Context of Database Records

In a database context, “unique” means that each database record (or row) is different from all others. It doesn’t mean that all the values (columns) need to be different from all other rows’ columns; it’s enough that only one column is different to proclaim the row unique.

So, what we’re looking at here is the unique combination of values. Take a look at this illustration.

The red rows are not unique because they have exactly the same values in all columns. We have duplicate rows for whatever reason – maybe it’s intentional or an error.

However, the green row is unique, despite the value in one column only (first_name) being different from the other two rows. This is a record of a different person (maybe Bob’s twin brother Leo Zimmerman?) than in the other two rows.

Why Retrieving Unique Records is Crucial

There are three main reasons why getting unique data is important.

Why retrieving select unique is crucial in sql

1. Data Analysis

Many a data professional has made the mistake of including duplicate values when only unique values are required. By doing so, they made the entire data analysis and its conclusion utterly worthless. Not excluding duplicates from your data means you will artificially inflate whatever values you’re trying to calculate. This could lead to painfully wrong conclusions and business decisions. Yes, you have the power to run the businesses into the ground! Are you scared of such a responsibility? You should be.

Imagine you have a database that records each order as a new row whenever the order’s status changes.

Naively, you just sum all the rows and report $28,200 monthly revenue—well above the monthly plan of $18,000. A record! The boss is happy, every employee gets a $1,000 bonus, and the whole company goes on holiday for five days as a reward. What happens when you come back to work and realize you included duplicate orders?

Shit, the actual monthly revenue is $16,600, almost 50% below the monthly plan!

You’ll be happy if the boss only throws a chair your way. (With all the ‘respectful workplace’ and ‘code of conduct’ guidelines flying straight out the window!)

2. Data Integrity

The other side of the previous point is having duplicate values where there shouldn’t be. In short, data integrity. Compromise it, and you’re again risking producing data that will lead to erroneous decision-making.

For example, you now have a database where each transaction in the table should be unique. However, due to a system glitch, some transactions were erroneously recorded multiple times.

Data integrity with select unique in sql

If you checked the data, you would see that only some transaction IDs appear once. By doing so, you could report the mistake, make corrections, and initiate an investigation to find the cause of the error. On top of that, you can temporarily work with such data by simply summing the unique transaction IDs. That way, you get a correct total transaction amount of $10,500, not $19,500, as you would by including duplicate transactions.

3. Querying Efficiency

Creating a sub-dataset that includes only unique values can significantly improve the efficiency of the queries used in the following steps of data processing. This is logical, as having only unique rows reduces the amount of data to be queried.

Take this, for example. Imagine this table tracks user login attempts.

Quering efficiency with select unique in sql

Potentially, the table could have millions of login data rows. Imagine that you’re making some report for which it’s enough for the user to successfully log in only once to be considered for the report.

If that’s the main criterion, you would wisely reduce the data by selecting only unique values. Then, you can perform any further analysis much quicker and easier.

The DISTINCT Keyword

We’ve been talking about unique values, so what the heck is DISTINCT now? Well, SQL uses the DISTINCT keyword to select unique values. There’s no such thing as UNIQUE. OK, that’s not entirely true. So, let’s clear this out.

UNIQUE is an old command for selecting unique values in the Oracle database. Later, the SQL standard defined DISTINCT as the keyword for fetching unique (or distinct) values. So, this is now used in all databases, including Oracle. However, Oracle didn’t ditch UNIQUE. Even though it completely disappeared for collective consciousness, you can still, if you insist or feel very retro, use UNIQUE instead of DISTINCT in Oracle to get unique values.

The DISTINCT keyword is the most commonly used as a part of SELECT DISTINCT queries.

Examples of SELECT DISTINCT Queries

Since there’s nothing that special with the SELECT DISTINCT syntax, I’ll immediately go to simple examples that show its two widespread uses.

I’ll just run through queries and what they do. Don’t worry; very detailed explanations of applying DISTINCT in practice will follow in the next section.

Using SELECT DISTINCT With One Column

One typical use is making a list of unique names of customers. It’s expected that many customers have the same first name, so to create a list of names, you’d need SELECT DISTINCT.

SELECT DISTINCT first_name
FROM customers;

It’s a regular SELECT statement but with the keyword DISTINCT before the selected column. This way, the database knows you’re interested in showing each customer name only once.

Using SELECT DISTINCT With Aggregate Function

Another common use of DISTINCT is with the aggregate function.

For example, you’re tracking individual customer orders. Each row records the order ID, together with the customer ID and order value. Naturally, many customers placed multiple orders.

You’re interested in finding the number of unique order amounts across all customers. You can do that using DISTINCT in the aggregate function, COUNT() in this case.

SELECT COUNT(DISTINCT order_amount) AS number_of_unique_order_amounts

FROM orders;

When using DISTINCT with an aggregate function, remember that DISTINCT goes inside the aggregate function and before the column within its parentheses.

Using DISTINCT on Multiple Columns

When used with one column, DISTINCT returns the unique values of that column. However, when used with multiple columns, DISTINCT looks for unique value combinations across the selected columns.

Two important notes here. No, you don’t write the DISTINCT keyword in front of every column in SELECT; you write it only in front of the first selected column.Second, no, you can’t selectively choose which selected columns you want to apply DISTINCT to. In other words, you either select unique combinations of all the columns or no unique columns at all. In even ‘otherer’ words, whatever columns you list in SELECT will be included in the DISTINCT clause.

Here’s an example to make it clear.

This is a slightly more complicated example than the previous two, so I’ll show you an example dataset. That way, you’ll visualize what’s going on better.

What would happen if you wrote a query like this?

SELECT DISTINCT first_name,

 last_name,

 department

FROM employee;

It would return unique combinations of the employees’ first names, last names, and departments.

Using select unique in sql on multiple columns

Let’s compare it to the original table we’re querying.

Remember, we’re talking about the combination of all columns. That’s why both Mark Johnson and Mark Wallace from Marketing appear: they have the same first name and department but different last names.

There are also two Mark Johnsons, but they are not duplicates, as they are from different departments.

As you can see, the only omitted row is the duplicate of Francesca Garibaldi from Sales. Rows 5 and 6 in the initial table are exactly the same, so they are shown only once in the query output.

Practical Applications of SELECT UNIQUE in SQL

Let’s now see how selecting unique values in SQL can be applied in the business context. There are some great examples in our coding interview questions. There it is, another practical application for you: passing the SQL job interview by practicing SQL interview questions.

Business Application: SELECT DISTINCT With One Column

This question is by Wine Magazine. It’s an example of a simple use of SELECT DISTINCT with one column.

Find all wine varieties which can be considered cheap based on the price

Wine Magazine

EasyID 10022

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find all wine varieties which can be considered cheap based on the price. A variety is considered cheap if the price of a bottle lies between 5 to 20 USD. Output unique variety names only.

Table: winemag_p1

Link to the question: https://platform.stratascratch.com/coding/10022-find-all-wine-varieties-which-can-be-considered-cheap-based-on-the-price

The question asks you to query the table winemag_p1 and find all the unique cheap wine varieties.

Table: winemag_p1

I used distinct on the wine variety column. After that, I only needed to filter data using where to output only wines with prices between $5 and $20.

SELECT DISTINCT variety
FROM winemag_p1
WHERE price BETWEEN 5 AND 20;

Here’s the output.

An error loading the results has occurred

Just for fun, let’s see how to use SELECT UNIQUE in Oracle. I’ll take the above code, use UNIQUE instead of DISTINCT, and run it in the Oracle code editor.

SELECT UNIQUE variety
FROM winemag_p1
WHERE price BETWEEN 5 AND 20

The code works perfectly!

Business Application: SELECT DISTINCT With Aggregation

The following question was sourced from the Netflix and Google interviews.

Abigail Breslin Nominations

Google

Netflix

EasyID 10128

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Count the number of movies for which Abigail Breslin was nominated for an Oscar.

Table: oscar_nominees

Link to the question: https://platform.stratascratch.com/coding/10128-count-the-number-of-movies-that-abigail-breslin-nominated-for-oscar

It requires you to count the number of movies for which Abigail Breslin was an Oscar nominee. The question gives you the table oscar_nominees to work with.

Table: oscar_nominees

To solve this question, use DISTINCT with the COUNT() function. Then, filter the nominee using WHERE.

SELECT COUNT(DISTINCT movie) AS n_movies_by_abi
FROM oscar_nominees
WHERE nominee = 'Abigail Breslin';

Abigail was nominated for only one movie.

An error loading the results has occurred

Business Application: DISTINCT With Multiple Columns

Here’s an example of a business needing to use DISTINCT with multiple columns in SQL. The question is by the City of San Francisco.

Find the business names that have inspection scores of less than 50

City of San Francisco

EasyID 9720

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find the business names that scored less than 50 in inspections. Output the result along with the corresponding inspection date and the score.

Table: sf_restaurant_health_violations

Link to the question: https://platform.stratascratch.com/coding/9720-find-the-business-names-that-have-inspection-scores-of-less-than-50/official-solution

You’ll work with the sf_restaurant_health_violations table.

Table: sf_restaurant_health_violations

This table records data on the health violation level. One inspection can have several health violations, so it’s highly likely that the same inspection will appear several times.

Having that in mind, as well as the requirement to output the business name, inspection date, and score, we know we need to use DISTINCT the following way.

SELECT DISTINCT business_name,
inspection_date,
inspection_score
FROM sf_restaurant_health_violations
WHERE inspection_score < 50;

This ensures that each inspection data appears only once for a particular business.

In the final code line, you also need to filter data so the output shows only inspections below the score of 50.

The output shows only two businesses and inspections.

An error loading the results has occurred

Advanced Topics in Selecting Unique Values in SQL

Now, I want to talk about some more complex (but not too much!) ways of using DISTINCT.

Business Application: SELECT DISTINCT With Aggregate and GROUP BY

In one of the previous examples, I used SELECT DISTINCT with an aggregate function. Because there was only an aggregated column in SELECT, there was no need to use GROUP BY in the query.

Let’s now see an example where GROUP BY is necessary. For instance, in this question by Airbnb.

Number Of Units Per Nationality

Airbnb

MediumID 10156

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

We have data on rental properties and their owners. Write a query that figures out how many different apartments (use unit_id) are owned by people under 30, broken down by their nationality. We want to see which nationality owns the most apartments, so make sure to sort the results accordingly.

Tables: airbnb_hosts, airbnb_units

Link to the question: https://platform.stratascratch.com/coding/10156-number-of-units-per-nationality

The question asks you to find the number of apartments by host nationality, but you should count only hosts younger than 30.

We work with two tables here. The first one is airbnb_hosts.

Table: airbnb_hosts

The second is a table of accommodation units.

Table: airbnb_units

Yes, you’re right; we will select nationality and use COUNT() with DISTINCT to find the number of units. The selected columns are from different tables, so we need to join them. (For a reminder how this works, read about types of SQL JOIN.

To get the count by nationality, it’s necessary to group the output by it. So, this is not really different from any other use of the aggregate functions. The principle is always the same: the columns you have in SELECT have to be in GROUP BY. The DISTINCT keyword doesn’t change anything in that sense; it only impacts how the aggregate function itself will behave.

Regarding our query, we need to fulfill some additional question requirements. We use WHERE to include only accommodations categorized as apartments and those owned by under-30s.

As a last step, we must sort the output by the number of apartments descendingly.

SELECT nationality,
       COUNT(DISTINCT unit_id) AS apartment_count
FROM airbnb_units apartment 
INNER JOIN airbnb_hosts host 
ON apartment.host_id = host.host_id 
WHERE host.age < 30 AND unit_type = 'Apartment'
GROUP BY host.nationality 
ORDER BY apartment_count DESC;

The code output is this.

An error loading the results has occurred

Business Application: DISTINCT in HAVING

DISTINCT is not limited to SELECT; you can use it in HAVING to filter the output based on unique values.

Let’s solve the question by the City of Los Angeles to see how.

Single Facility Corporations

City of Los Angeles

MediumID 9694

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find all owners which have only a single facility. Output the owner_name and order the results alphabetically.

Table: los_angeles_restaurant_health_inspections

Link to the question: https://platform.stratascratch.com/coding/9694-single-facility-corporations

Here, we need to output the owners that have restaurants with a single facility.

We will query the los_angeles_restaurant_health_inspections table.

Table: los_angeles_restaurant_health_inspections

We select the owner names from the table and group the results by the owner ID and name. This ensures we have unique owners, which are repeated across multiple inspections.

Now, we need to keep only the owners with a one-facility restaurant. We can do that using HAVING, which means we filter after grouping the output. In HAVING, we use COUNT() with DISTINCT to count the unique facility IDs.

SELECT owner_name
FROM los_angeles_restaurant_health_inspections
GROUP BY owner_id, owner_name
HAVING COUNT(DISTINCT facility_id) = 1
ORDER BY owner_name;

The code outputs this list.

An error loading the results has occurred

Business Application: DISTINCT With CASE Statement

The DISTINCT clause can also be used with the CASE statement, like with any regular column.

In this question by Salesforce and Asana, we use DISTINCT with the COUNT() function and CASE statement.

User Growth Rate

Last Updated: July 2021

Asana

Salesforce

EasyID 2052

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the growth rate of active users for Dec 2020 to Jan 2021 for each account. The growth rate is defined as the number of users in January 2021 divided by the number of users in Dec 2020. Output the account_id and growth rate.

Table: sf_events

Link to the question: https://platform.stratascratch.com/coding/2052-user-growth-rate

We’ll work with a single table named sf_events.

Table: sf_events

From the table, we need to find the growth rate of active users between December 2020 and January 2021 for each account.

How do we do that? Growth is defined as the number of users in January 2021 divided by the number in December 2020.

So, this is exactly what COUNT(), DISTINCT, and CASE do in the query below. The first instance (in the numerator) uses the CASE statement to find all active users between 1 January 2021 and 31 January 2021. DISTINCT ensures that only unique users are counted.

The calculation in the denominator does the same but for dates between 1 December 2020 and 31 December 2020.

In the end, the output is just grouped by account ID.

SELECT account_id,
       COUNT(DISTINCT (CASE
                           WHEN date BETWEEN '2021-01-01' AND '2021-01-31' THEN user_id
                           ELSE NULL
                       END))::float / COUNT(DISTINCT (CASE
                                                          WHEN date BETWEEN '2020-12-01' AND '2020-12-31' THEN user_id
                                                          ELSE NULL
                                                      END))::float AS growth_rate
FROM sf_events
GROUP BY 1;

Here’s the code output.

An error loading the results has occurred

Best Practices and Common Pitfalls

Writing code with DISTINCT is not complicated at all. Using DISTINCT to its strength is much more difficult. In doing that, try to apply some best practices to avoid common pitfalls in using DISTINCT.

Best Practices in Using DISTINCT in SQL

The overview below shows five best practices I advise you to follow. Familiarize yourself with the list, and then I’ll explain it.

Best practices in using select unique in sql

1. Use DISTINCT Sparingly: You should use DISTINCT only when necessary. If the data in the table is already unique, don’t use DISTINCT. If you can get unique values by grouping data, use GROUP BY. DISTINCT can slow down your query, especially with large datasets, as it goes through all the data to find and eliminate duplicates.

2. Use Indexes: If the columns used in DISTINCT are indexed, this can improve your query performance.

3. Limit the Number of Columns in DISTINCT: Don’t go into an overkill with the number of columns in DISTINCT: the more columns, the slower the query. Think about what columns you need and include only them in DISTINCT.

4. Use Aggregation Instead of DISTINCT: In one of the examples, I talked about how distinct values are fetched using GROUP BY. Whenever you can do it with GROUP BY or GROUP BY + aggregation, do so.

5. Understand Data: If you know your data, you’ll also know if some is unique. This relates to the nature of the data itself. But it also refers to data integrity constraints that ensure data is unique. Being aware of them saves you from using DISTINCT, as it’s unnecessary.

You’ll get into pitfalls if you ignore these best practices: you’ll use DISTINCT all the time with too many columns on data that already is unique while completely ignoring the existence of data grouping and aggregation. As a result, you’ll write too much code that will work unnecessarily slowly.

Conclusion

Knowing how to query unique values in SQL is important for anyone working with data. Hence, mastering DISTINCT is, too.

While its syntax is nothing short of banal, knowing when and how to use DISTINCT is anything but. You’ll need some practice to get to that level. Practice and business level is what StrataScratch is all about: in our coding questions section; there are almost 150 interview questions focusing on DISTINCT.

More than enough to learn DISTINCT inside out!

SELECT UNIQUE in SQL: Mastering Data Retrieval with DISTINCT

What “Unique” Means in the Context of Database Records

Why Retrieving Unique Records is Crucial

1. Data Analysis

2. Data Integrity

3. Querying Efficiency

The DISTINCT Keyword

Examples of SELECT DISTINCT Queries

Using SELECT DISTINCT With One Column

Using SELECT DISTINCT With Aggregate Function

Using DISTINCT on Multiple Columns

Practical Applications of SELECT UNIQUE in SQL

Business Application: SELECT DISTINCT With One Column

Find all wine varieties which can be considered cheap based on the price

Business Application: SELECT DISTINCT With Aggregation

Abigail Breslin Nominations

Business Application: DISTINCT With Multiple Columns

Find the business names that have inspection scores of less than 50

Advanced Topics in Selecting Unique Values in SQL

Business Application: SELECT DISTINCT With Aggregate and GROUP BY

Number Of Units Per Nationality

Business Application: DISTINCT in HAVING

Single Facility Corporations

Business Application: DISTINCT With CASE Statement

User Growth Rate

Best Practices and Common Pitfalls

Best Practices in Using DISTINCT in SQL

Conclusion

Latest Posts:

How to Create a Bubble Plot with Python and Matplotlib?

What Is a Python Abstract Class? When and How to Use It

How We Oversaturated the Data Science Job Market