Counting Distinct Values in SQL: Tips and Examples
Categories
Using SQL COUNT(DISTINCT) is an easy way to count distinct values in SQL. However, it has its own quirks, which we’ll explain (with examples) in this article.
Anybody who has worked with data and anything close to reporting can confirm – you’ll be counting things more than if you were a warehouse worker.
“Give me the number of customers!” “How many months were we in positive this year?” “How many products of this-and-that did we sell?” Managers are simply obsessed with counting, and you’ll have to become, too.
(Half) joking aside, counting instances of data really is an essential part of reporting. Typically, it involves counting only unique values.
Get ready to learn how to count and select unique in SQL.
What is DISTINCT in SQL?
DISTINCT is an SQL keyword used for removing duplicates from the query output. In other words, whether the data appears once or several times in a dataset, it doesn’t matter – DISTINCT always returns it only once. In even-more-other words, SQL SELECT DISTINCT returns unique values.
When applied to one column, DISTINCT will return unique values from that column. If applied to multiple columns, DISTINCT returns a unique combination of values from those columns.
The Importance of Counting Distinct Values in Various Data Analysis Scenarios
Focusing on unique entries in datasets gives data analysis the possibility to understand data diversity, frequency, and distribution.
By doing so, you can understand patterns, anomalies, and trends that might be hidden away in redundant data.
1. Understanding Diversity
The diversity of a dataset is a key metric in many fields, which signals its richness and variety. Some examples are:
- Marketing – for understanding customer preferences’ diversity
- Ecology – for assessing biodiversity
- Human resources – for analyzing employees’ skill set diversity and developing training and development programs based on that
2. Frequency and Distribution Analysis
By counting distinct values, you can go beyond simple counts and understand the distribution of unique items within a dataset. For instance, aside from knowing that the company sold 50,000 products, you can see that these products are distributed among 1,000 distinct products, which tells you much more about product variety and customer choice.
3. Identifying Trends and Anomalies
Counting distinct values can reveal certain trends. For example, suppose you track the emergence of new distinct values, such as new products or customer segments, over time in your time-series dataset. In that case, you can discover changes in customer and market behavior.
4. Data Cleaning and Quality
Employing distinct values count is also common when you want to check data quality and clean the data. If your distinct count is higher or lower than expected, it can lead to further data exploration that will reveal duplicate data, missing data, or inconsistencies.
How to Use DISTINCT in SQL?
Using DISTINCT in SQL is really easy: simply place it after SELECT and before the column names.
Here’s a syntax.
SELECT DISTINCT column_1,
column_2,
...
FROM table;
Counting Distinct Values
Counting values is done using the COUNT() aggregate function. To count values in the specific column, simply write its name in the function’s parentheses.
SELECT COUNT(column_1) AS column_alias
FROM table;
Combined with DISTINCT, it will do what we were building to – counting distinct values.
The syntax prescribes writing DISTINCT in COUNT() before the columns that appear in the function.
SELECT COUNT(DISTINCT column_1,
column_2,
...)
FROM table;
Practical Tips for Using DISTINCT and COUNT()
I’ll now give you several practical tips for using DISTINCT in COUNT().
Handling NULL Values
One important thing to remember is that using COUNT() with a specific column name excludes NULLs from the count. So, using it with DISTINCT means that NULLs won’t be counted as a separate category.
If you want to count NULL values as a distinct category, the easiest way to do it is to use COALESCE().
SELECT COUNT(DISTINCT COALESCE(column_name, 'NULL'))
FROM table;
This function allows you to replace the NULL values with a placeholder, 'NULL', as shown in the example above.
Performance Considerations
When dealing with large datasets, using COUNT(DISTINCT) can be computationally expensive. This is because the database engine needs to scan the entire dataset, sort it, and compare values to return unique values. Only after that can the count be performed.
All this can be additionally slowed down if the columns in COUNT(DISTINCT) are not indexed.
Here are some tips on how to improve the query performance.
- Indexing: Create indexes on a column named in COUNT(DISTINCT). While they improve performance, bear in mind that indexes require additional storage space, which can have a detrimental impact on write operations, such as INSERT, UPDATE, and DELETE.
- Optimize the Query Execution Plan: Use your database’s query optimizer and query hints.
- GROUP BY With COUNT(): One alternative to COUNT(DISTINCT) is a query with GROUP BY and COUNT(). Grouping the dataset by required columns and then counting values can sometimes be more efficient than COUNT(DISTINCT). This is especially the case if the data is already well-partitioned or indexed.
- Approximate Queries: If you’re dealing with a large dataset where an exact count is not necessary, use approximate algorithms (such as APPROX_COUNT_DISTINCT() in Apache Spark) to estimate the count of distinct values.
- Break Down the Query: Use subqueries or CTEs to break down your query into smaller parts. In addition, you can use partitioning to limit the scope of COUNT(DISTINCT).
- Parallel Processing: If you have the possibility, leverage parallel query execution of your database engine or use distributed databases, such as Apache Spark or Amazon Redshift.
- Data Aggregation and Preprocessing: If a distinct count is required frequently, you could consider aggregating data during ETL processes and performing count operations on that data.
Basic Real-World Examples
It’s now time to get into using SQL COUNT(DISTINCT) in practice. We’ll help ourselves with the interview questions from our platform.
Simple Distinct Values Count
The simplest possible example of counting distinct values is this interview question by Postmates. It asks you to count the customers who placed an order and calculate the average order amount.
Interview Question Date: February 2021
How many customers placed an order and what is the average order amount?
Link to the question: https://platform.stratascratch.com/coding/2013-customer-average-orders
The question gives you the postmates_orders table. It’s a list of orders and their details.
id | customer_id | courier_id | seller_id | order_timestamp_utc | amount | city_id |
---|---|---|---|---|---|---|
1 | 102 | 224 | 79 | 2019-03-11 23:27:00 | 155.73 | 47 |
2 | 104 | 224 | 75 | 2019-04-11 04:24:00 | 216.6 | 44 |
3 | 100 | 239 | 79 | 2019-03-11 21:17:00 | 168.69 | 47 |
4 | 101 | 205 | 79 | 2019-03-11 02:34:00 | 210.84 | 43 |
5 | 103 | 218 | 71 | 2019-04-11 00:15:00 | 212.6 | 47 |
The solution to this question counts distinct customers. Why distinct? Because a customer can place more than one order. There’s no point in counting customers each time they order, as it will artificially inflate the total number of customers.
In addition, I use the AVG() function to calculate the average order amount.
SELECT COUNT(DISTINCT customer_id) AS number_of_customers,
AVG(amount) AS average_order_amount
FROM postmates_orders;
The output shows there are, in total, five customers, and their average order amount is 139.224.
count | avg |
---|---|
5 | 139.22 |
Handling NULLs With COUNT(DISTINCT)
Another question, this time by Crunchbase, will teach you how to handle NULL values.
Find the number of acquisitions that occurred in each quarter of each year. Output the acquired quarter in YYYY-Qq format along with the number of acquisitions and order results by the quarters with the highest number of acquisitions first.
Link to the question: https://platform.stratascratch.com/coding/10162-number-of-acquisitions
I’ll write a solution for a reworked problem. The code will count unique company country codes, including those companies with NULLs instead of country codes.
The table you have at your disposal is named crunchbase_acquisitions.
company_permalink | company_name | company_category_code | company_country_code | company_state_code | company_region | company_city | acquirer_permalink | acquirer_name | acquirer_category_code | acquirer_country_code | acquirer_state_code | acquirer_region | acquirer_city | acquired_at | acquired_month | acquired_quarter | acquired_year | price_amount | price_currency_code | id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
/company/1form-com | 1Form.com | other | AUS | Australia - Other | /company/rea-group | REA Group | news | AUS | Victoria | Victoria | 2014-01-14 | 2014-01 | 2014-Q1 | 2014 | 15000000 | USD | 13 | |||
/company/24-7-real-media | 24/7 Real Media | advertising | USA | NY | New York | New York | /company/wpp | WPP | public_relations | USA | NY | New York | New York | 2008-05-17 | 2008-05 | 2008-Q2 | 2008 | 649000000 | USD | 21 |
/company/280-north | 280 North | software | unknown | /company/motorola-mobility | Motorola Mobility | mobile | USA | United States - Other | 2010-07-01 | 2010-07 | 2010-Q3 | 2010 | 20000000 | USD | 23 | |||||
/company/280-north | 280 North | software | unknown | /company/motorola-solutions | Motorola Solutions | enterprise | USA | IL | Chicago | Schaumburg | 2010-07-01 | 2010-07 | 2010-Q3 | 2010 | 20000000 | USD | 24 | |||
/company/2nd-story-software-inc | 2nd Story Software, Inc. | software | USA | IA | Cedar Rapids | Cedar Rapids | /company/blucora | Blucora | web | USA | WA | Seattle | Bellevue | 2012-01-09 | 2012-01 | 2012-Q1 | 2012 | 287500000 | USD | 26 |
Here’s how to solve this problem. I use the DISTINCT clause in the COUNT() function as I did in the previous example. However, this would ignore NULL values, so I must embed the COALESCE() function in it. This is the function that goes through the column company_country_code. If the value in the column is non-NULL, the function returns the actual value from that row. If the value is NULL, the function replaces it with the string 'NULL'. That way, a distinct category named 'NULL' is assigned to NULL values so it can be included in the distinct count.
SELECT COUNT(DISTINCT COALESCE(company_country_code, 'NULL')) AS number_of_countries
FROM crunchbase_acquisitions;
Based on country code counting, the output tells us there are 65 unique countries in the dataset.
number_of_countries |
---|
65 |
I can check if the code really counted the NULLs as a separate category by rewriting the code without COALESCE().
SELECT COUNT(DISTINCT company_country_code) AS number_of_countries
FROM crunchbase_acquisitions;
As you can see, there’s one country fewer compared to the previous output; this is a NULL category missing.
number_of_countries |
---|
64 |
Advanced Techniques
Now, let me talk about the more advanced usage of DISTINCT with COUNT().
GROUP BY With COUNT(DISTINCT)
The GROUP BY clause organizes query output into categories. It is typically used with aggregate functions, which allows you to show values aggregated by groups.
In the case of COUNT(DISTINCT), it means counting distinct values by group.
The general syntax is shown here.
SELECT column_1,
COUNT(DISTINCT column_2) AS column_alias
FROM table_name
GROUP BY column_1;
Example
Here’s an interview question by Spotify.
Interview Question Date: January 2024
You're tasked with analyzing a Spotify-like dataset that captures user listening habits. For each user, calculate the total listening time and the count of unique songs they've listened to. In the database duration values are displayed in seconds. Round the total listening duration to the nearest whole minute.
The output should contain three columns: 'user_id', 'total_listen_duration', and 'unique_song_count'.
Link to the question: https://platform.stratascratch.com/coding/10367-aggregate-listening-data
It wants you to calculate the total listening time per user and count the unique songs they’ve listened to.
You have to work with the listening_habits table to solve this problem.
user_id | song_id | listen_duration |
---|---|---|
101 | 5001 | 240 |
101 | 5002 | 0 |
102 | 5001 | 300 |
102 | 5003 | 0 |
101 | 5001 | 240 |
To show these calculations by a user, you need to select the user_id column.
Next, use SUM() to calculate the total listening time. In addition, divide the result by 60 to get the minutes, and then round the result to the nearest minute using ROUND().
The number of unique songs by a user is calculated using COUNT(DISTINCT).
Finally, group the output by the user ID to get all these calculations for each user separately.
SELECT user_id,
ROUND(SUM (listen_duration) / 60.0) AS total_listening_time ,
COUNT(DISTINCT song_id) AS number_of_songs
FROM listening_habits
GROUP BY user_id;
Here’s the output.
user_id | total_listen_duration | unique_song_count |
---|---|---|
101 | 8 | 2 |
102 | 5 | 2 |
103 | 6 | 1 |
104 | 6 | 2 |
105 | 4 | 1 |
Window Functions for Distinct Counts Within Partitions
The window functions allow you to perform calculations on the window of rows, i.e., rows preceding and/or following the current row. They are famous for their ability to show aggregate data while not collapsing individual rows.
In SQL, there’s an optional clause PARTITION BY used with window functions. It splits the dataset into subsets based on one or more columns, i.e., it partitions it. Doing that with the dataset means that the window function calculations will be executed for each partition separately.
It would be nice if you could use COUNT(DISTINCT) as a window function with PARTITION BY to count distinct values by partition, wouldn’t it? Unfortunately, DISTINCT is not allowed in the COUNT() window function!
The workaround depends on the particular problem you need to solve and your dataset. However, it mainly boils down to utilizing subqueries or CTEs to simulate using COUNT(DISTINCT) as a window function.
One option is to use a window function – namely DENSE_RANK() – for distinct counts within partitions. The general approach is to rank data within the partition by the column for which you want to know the distinct count. The last row in the partition (if ranked ascendingly) will contain the highest rank, which is the same as the unique count of the data in the partition.
The count will be unique because DENSE_RANK() assigns the same rank to duplicate values, so duplicates won’t artificially increase the rank (or count).
We’ll show how this works in an example.
Example
Take a look at this interview question by ESPN.
Find the median age of gold medal winners across all Olympics.
Link to the question: https://platform.stratascratch.com/coding/9958-median-age-of-gold-medal-winners
There’s a table named olympics_athletes_events.
id | name | sex | age | height | weight | team | noc | games | year | season | city | sport | event | medal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3520 | Guillermo J. Amparan | M | Mexico | MEX | 1924 Summer | 1924 | Summer | Paris | Athletics | Athletics Men's 800 metres | ||||
35394 | Henry John Finchett | M | Great Britain | GBR | 1924 Summer | 1924 | Summer | Paris | Gymnastics | Gymnastics Men's Rings | ||||
21918 | Georg Frederik Ahrensborg Clausen | M | 28 | Denmark | DEN | 1924 Summer | 1924 | Summer | Paris | Cycling | Cycling Men's Road Race Individual | |||
110345 | Marinus Cornelis Dick Sigmond | M | 26 | Netherlands | NED | 1924 Summer | 1924 | Summer | Paris | Football | Football Men's Football | |||
54193 | Thodore Tho Jeitz | M | 26 | Luxembourg | LUX | 1924 Summer | 1924 | Summer | Paris | Gymnastics | Gymnastics Men's Individual All-Around |
I’ll modify the question and write a query that shows an athlete’s ID, name, and team while also showing the total number of athletes who were team members, regardless of when they competed.
The first CTE ranks athletes. It does that with DENSE_RANK() window function ranking by ID and data being partitioned by the team, meaning it will rank for each team separately. With this CTE, you get the cumulative count of athletes in each team, where the last row within the partition contains the count of athletes in a team. The count is unique because DENSE_RANK() assigns the same rank to all athletes with the same ID, i.e., the athletes that appear several times in the dataset.
WITH ranked_athletes AS (
SELECT id,
name,
team,
DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
FROM olympics_athletes_events
)
Here’s the output snapshot of that first CTE.
id | name | team | athlete_rank |
---|---|---|---|
3325 | Santiago Amat Cansino | Amolgavar | 1 |
17489 | Facundo Callioni | Argentina | 1 |
27126 | Jorge del Mazo | Argentina | 2 |
66754 | Roberto Larraz | Argentina | 3 |
100656 | Charles Riddy | Argonaut Rowing Club | 1 |
Take the Argentina team, for example. There are three unique athletes on the team, and the cumulative count (or rank) in the athlete_rank column is also three, and rightly so.
However, the list of athletes is not unique, which you can see in the example of Pierre Tolar, who represented Luxembourg two times and appears two times in the output.
WITH ranked_athletes AS (
SELECT id,
name,
team,
DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
FROM olympics_athletes_events
)
SELECT *
FROM ranked_athletes
WHERE team = 'Luxembourg';
id | name | team | athlete_rank |
---|---|---|---|
14719 | Rene Brasseur | Luxembourg | 1 |
54193 | Thodore Tho Jeitz | Luxembourg | 2 |
120786 | Pierre Tolar | Luxembourg | 3 |
120786 | Pierre Tolar | Luxembourg | 3 |
129575 | Jean Welter Sr. | Luxembourg | 4 |
The data has to be deduplicated. I’ll solve that in one of the following steps.
The second CTE takes the first one to extract the max rank – the number of unique athletes per team – using the MAX() aggregate function.
The third CTE deduplicates the first CTE, so now I have a dataset containing only the unique combinations of the athlete’s ID, name, and team.
In the final SELECT, I join the second and third CTEs, i.e., the unique count with the list of unique athletes.
WITH ranked_athletes AS (
SELECT id,
name,
team,
DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
FROM olympics_athletes_events
),
max_rank_per_team AS (
SELECT team,
MAX(athlete_rank) AS team_athlete_count
FROM ranked_athletes
GROUP BY team
),
distinct_athletes AS (
SELECT DISTINCT id,
name,
team
FROM ranked_athletes
)
SELECT da.name,
da.team,
mrt.team_athlete_count
FROM distinct_athletes da
JOIN max_rank_per_team mrt
ON da.team = mrt.team
ORDER BY da.team;
Here’s the output snapshot. I’ve achieved what I wanted, which was to show the unique athlete’s ID, name, and team, along with the count of unique athletes per team.
name | team | team_athlete_count |
---|---|---|
Santiago Amat Cansino | Amolgavar | 1 |
Roberto Larraz | Argentina | 3 |
Jorge del Mazo | Argentina | 3 |
Facundo Callioni | Argentina | 3 |
Charles Riddy | Argonaut Rowing Club | 1 |
If you want to check whether it really shows only unique athletes, check it on the Luxembourg and Pierre Tolar example using this query.
WITH ranked_athletes AS (
SELECT id,
name,
team,
DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
FROM olympics_athletes_events
),
max_rank_per_team AS (
SELECT team,
MAX(athlete_rank) AS team_athlete_count
FROM ranked_athletes
GROUP BY team
),
distinct_athletes AS (
SELECT DISTINCT id,
name,
team
FROM ranked_athletes
)
SELECT da.id,
da.name,
da.team,
mrt.team_athlete_count
FROM distinct_athletes da
JOIN max_rank_per_team mrt
ON da.team = mrt.team
WHERE da.team = 'Luxembourg'
ORDER BY da.team;
The output shows Pierre Tolar only once, confirming this is a list of unique athletes.
id | name | team | team_athlete_count |
---|---|---|---|
14719 | Rene Brasseur | Luxembourg | 4 |
54193 | Thodore Tho Jeitz | Luxembourg | 4 |
120786 | Pierre Tolar | Luxembourg | 4 |
129575 | Jean Welter Sr. | Luxembourg | 4 |
Now, the solution I showed you above may be a little too elaborate. I can write a shorter query that achieves the same result without DENSE_RANK().
I immediately go to data deduplication in the first CTE in the query below.
The second CTE counts distinct athletes per team by using COUNT(DISTINCT) and GROUP BY, and that’s it – I now only need to JOIN these two CTEs in SELECT and specify the columns I want to output.
WITH distinct_athletes_teams AS (
SELECT DISTINCT id,
name,
team
FROM olympics_athletes_events
),
team_athlete_counts AS (
SELECT team,
COUNT(DISTINCT id) AS team_athlete_count
FROM distinct_athletes_teams
GROUP BY team
)
SELECT dat.name,
dat.team,
tac.team_athlete_count
FROM distinct_athletes_teams dat
JOIN team_athlete_counts tac
ON dat.team = tac.team
ORDER BY team;
The output stays the same as with the first code.
name | team | team_athlete_count |
---|---|---|
Santiago Amat Cansino | Amolgavar | 1 |
Jorge del Mazo | Argentina | 3 |
Facundo Callioni | Argentina | 3 |
Roberto Larraz | Argentina | 3 |
Charles Riddy | Argonaut Rowing Club | 1 |
Common Mistakes and How to Avoid Them
There are three common mistakes made when using COUNT() with DISTINCT.
1. Forgetting to Handle NULLs: First, you should know your data and whether it includes NULLs. Second, you should know whether your task requires including the NULL values in the count. If it does, you’ll have to include them using techniques such as COALESCE().
2. Misunderstanding DISTINCT With Multiple Columns: It’s a common misconception, especially with beginners, that DISTINCT applied on multiple columns returns unique values for each column separately. No, it doesn’t! It returns unique value combinations across all columns. Used like that in COUNT(), you get the count of unique value combinations.
3. Neglecting Indexing: Using COUNT() with DISTINCT can be computationally expensive, especially on large datasets. This is because it first has to scan the whole dataset to find all the values in the specified columns, then remove duplicates and, finally, count the unique records.
Conclusion
I covered in this article several topics relevant to counting distinct values in SQL using COUNT(DISTINCT). The most important is knowing what DISTINCT as a clause does and the syntax of using it with COUNT().
However, I also demonstrated how to handle NULL values and gave you a couple of more complex examples of using COUNT(DISTINCT).
Along with some tips about its performance and avoiding common mistakes when counting distinct values, you’re now ready to build on that knowledge by writing the code on your own. Feel free to experiment with the above questions and solve them in other SQL dialects.
There are still many coding problems requiring COUNT(DISTINCT) knowledge in our coding interview questions section. Use them to polish your knowledge of counting distinct values in SQL.