Counting Distinct Values in SQL: Tips and Examples

Counting Distinct Values in SQL


Using SQL COUNT(DISTINCT) is an easy way to count distinct values in SQL. However, it has its own quirks, which we’ll explain (with examples) in this article.

Anybody who has worked with data and anything close to reporting can confirm – you’ll be counting things more than if you were a warehouse worker.

“Give me the number of customers!” “How many months were we in positive this year?” “How many products of this-and-that did we sell?” Managers are simply obsessed with counting, and you’ll have to become, too.

(Half) joking aside, counting instances of data really is an essential part of reporting. Typically, it involves counting only unique values.

Get ready to learn how to count and select unique in SQL.

What is DISTINCT in SQL?

DISTINCT is an SQL keyword used for removing duplicates from the query output. In other words, whether the data appears once or several times in a dataset, it doesn’t matter – DISTINCT always returns it only once. In even-more-other words, SQL SELECT DISTINCT returns unique values.

When applied to one column, DISTINCT will return unique values from that column. If applied to multiple columns, DISTINCT returns a unique combination of values from those columns.

The Importance of Counting Distinct Values in Various Data Analysis Scenarios

Focusing on unique entries in datasets gives data analysis the possibility to understand data diversity, frequency, and distribution.

By doing so, you can understand patterns, anomalies, and trends that might be hidden away in redundant data.

Why Focus on Counting Distinct Values in Sql

1. Understanding Diversity

The diversity of a dataset is a key metric in many fields, which signals its richness and variety.  Some examples are:

  • Marketing – for understanding customer preferences’ diversity
  • Ecology – for assessing biodiversity
  • Human resources – for analyzing employees’ skill set diversity and developing training and development programs based on that

2. Frequency and Distribution Analysis

By counting distinct values, you can go beyond simple counts and understand the distribution of unique items within a dataset. For instance, aside from knowing that the company sold 50,000 products, you can see that these products are distributed among 1,000 distinct products, which tells you much more about product variety and customer choice.

3. Identifying Trends and Anomalies

Counting distinct values can reveal certain trends. For example, suppose you track the emergence of new distinct values, such as new products or customer segments, over time in your time-series dataset. In that case, you can discover changes in customer and market behavior.

4. Data Cleaning and Quality

Employing distinct values count is also common when you want to check data quality and clean the data. If your distinct count is higher or lower than expected, it can lead to further data exploration that will reveal duplicate data, missing data, or inconsistencies.

How to Use DISTINCT in SQL?

Using DISTINCT in SQL is really easy: simply place it after SELECT and before the column names.

Here’s a syntax.

SELECT DISTINCT column_1,
	 column_2,
 	 ...
FROM table;

Counting Distinct Values

Counting values is done using the COUNT() aggregate function. To count values in the specific column, simply write its name in the function’s parentheses.

SELECT COUNT(column_1) AS column_alias
FROM table;

Combined with DISTINCT, it will do what we were building to – counting distinct values.

The syntax prescribes writing DISTINCT in COUNT() before the columns that appear in the function.

SELECT COUNT(DISTINCT column_1,
			    column_2,
			    ...)
FROM table;

Practical Tips for Using DISTINCT and COUNT()

I’ll now give you several practical tips for using DISTINCT in COUNT().

Handling NULL Values

One important thing to remember is that using COUNT() with a specific column name excludes NULLs from the count. So, using it with DISTINCT means that NULLs won’t be counted as a separate category.

If you want to count NULL values as a distinct category, the easiest way to do it is to use COALESCE().

SELECT COUNT(DISTINCT COALESCE(column_name, 'NULL'))
FROM table;

This function allows you to replace the NULL values with a placeholder, 'NULL', as shown in the example above.

Performance Considerations

When dealing with large datasets, using COUNT(DISTINCT) can be computationally expensive. This is because the database engine needs to scan the entire dataset, sort it, and compare values to return unique values. Only after that can the count be performed.

All this can be additionally slowed down if the columns in COUNT(DISTINCT) are not indexed.

Here are some tips on how to improve the query performance.

Improving the Count Distinct Query Performance in SQL

  1. Indexing: Create indexes on a column named in COUNT(DISTINCT). While they improve performance, bear in mind that indexes require additional storage space, which can have a detrimental impact on write operations, such as INSERT, UPDATE, and DELETE.
  2. Optimize the Query Execution Plan: Use your database’s query optimizer and query hints.
  3. GROUP BY With COUNT(): One alternative to COUNT(DISTINCT) is a query with GROUP BY and COUNT(). Grouping the dataset by required columns and then counting values can sometimes be more efficient than COUNT(DISTINCT). This is especially the case if the data is already well-partitioned or indexed.
  4. Approximate Queries: If you’re dealing with a large dataset where an exact count is not necessary, use approximate algorithms (such as APPROX_COUNT_DISTINCT() in Apache Spark) to estimate the count of distinct values.
  5. Break Down the Query: Use subqueries or CTEs to break down your query into smaller parts. In addition, you can use partitioning to limit the scope of COUNT(DISTINCT).
  6. Parallel Processing: If you have the possibility, leverage parallel query execution of your database engine or use distributed databases, such as Apache Spark or Amazon Redshift.
  7. Data Aggregation and Preprocessing: If a distinct count is required frequently, you could consider aggregating data during ETL processes and performing count operations on that data.

Basic Real-World Examples

It’s now time to get into using SQL COUNT(DISTINCT) in practice. We’ll help ourselves with the interview questions from our platform.

Simple Distinct Values Count

The simplest possible example of counting distinct values is this interview question by Postmates. It asks you to count the customers who placed an order and calculate the average order amount.


Table: postmates_orders

Link to the question: https://platform.stratascratch.com/coding/2013-customer-average-orders

The question gives you the postmates_orders table. It’s a list of orders and their details.

Table: postmates_orders
idcustomer_idcourier_idseller_idorder_timestamp_utcamountcity_id
1102224792019-03-11 23:27:00155.7347
2104224752019-04-11 04:24:00216.644
3100239792019-03-11 21:17:00168.6947
4101205792019-03-11 02:34:00210.8443
5103218712019-04-11 00:15:00212.647

The solution to this question counts distinct customers. Why distinct? Because a customer can place more than one order. There’s no point in counting customers each time they order, as it will artificially inflate the total number of customers.

In addition, I use the AVG() function to calculate the average order amount.

SELECT COUNT(DISTINCT customer_id) AS number_of_customers,
       AVG(amount) AS average_order_amount
FROM postmates_orders;

The output shows there are, in total, five customers, and their average order amount is 139.224.

All required columns and the first 5 rows of the solution are shown

countavg
5139.22

Handling NULLs With COUNT(DISTINCT)

Another question, this time by Crunchbase, will teach you how to handle NULL values.


Table: crunchbase_acquisitions

Link to the question: https://platform.stratascratch.com/coding/10162-number-of-acquisitions

I’ll write a solution for a reworked problem. The code will count unique company country codes, including those companies with NULLs instead of country codes.

The table you have at your disposal is named crunchbase_acquisitions.

Table: crunchbase_acquisitions
company_permalinkcompany_namecompany_category_codecompany_country_codecompany_state_codecompany_regioncompany_cityacquirer_permalinkacquirer_nameacquirer_category_codeacquirer_country_codeacquirer_state_codeacquirer_regionacquirer_cityacquired_atacquired_monthacquired_quarteracquired_yearprice_amountprice_currency_codeid
/company/1form-com1Form.comotherAUSAustralia - Other/company/rea-groupREA GroupnewsAUSVictoriaVictoria2014-01-142014-012014-Q1201415000000USD13
/company/24-7-real-media24/7 Real MediaadvertisingUSANYNew YorkNew York/company/wppWPPpublic_relationsUSANYNew YorkNew York2008-05-172008-052008-Q22008649000000USD21
/company/280-north280 Northsoftwareunknown/company/motorola-mobilityMotorola MobilitymobileUSAUnited States - Other2010-07-012010-072010-Q3201020000000USD23
/company/280-north280 Northsoftwareunknown/company/motorola-solutionsMotorola SolutionsenterpriseUSAILChicagoSchaumburg2010-07-012010-072010-Q3201020000000USD24
/company/2nd-story-software-inc2nd Story Software, Inc.softwareUSAIACedar RapidsCedar Rapids/company/blucoraBlucorawebUSAWASeattleBellevue2012-01-092012-012012-Q12012287500000USD26

Here’s how to solve this problem. I use the DISTINCT clause in the COUNT() function as I did in the previous example. However, this would ignore NULL values, so I must embed the COALESCE() function in it. This is the function that goes through the column company_country_code. If the value in the column is non-NULL, the function returns the actual value from that row. If the value is NULL, the function replaces it with the string 'NULL'. That way, a distinct category named 'NULL'  is assigned to NULL values so it can be included in the distinct count.

SELECT COUNT(DISTINCT COALESCE(company_country_code, 'NULL')) AS number_of_countries
FROM crunchbase_acquisitions;

Based on country code counting, the output tells us there are 65 unique countries in the dataset.

All required columns and the first 5 rows of the solution are shown

number_of_countries
65

I can check if the code really counted the NULLs as a separate category by rewriting the code without COALESCE().

SELECT COUNT(DISTINCT company_country_code) AS number_of_countries
FROM crunchbase_acquisitions;

As you can see, there’s one country fewer compared to the previous output; this is a NULL category missing.

All required columns and the first 5 rows of the solution are shown

number_of_countries
64

Advanced Techniques

Now, let me talk about the more advanced usage of DISTINCT with COUNT().

GROUP BY With COUNT(DISTINCT)

The GROUP BY clause organizes query output into categories. It is typically used with aggregate functions, which allows you to show values aggregated by groups.

In the case of COUNT(DISTINCT), it means counting distinct values by group.

The general syntax is shown here.

SELECT column_1,
	 COUNT(DISTINCT column_2) AS column_alias
FROM table_name
GROUP BY column_1;

Example

Here’s an interview question by Spotify.


Table: listening_habits

Link to the question: https://platform.stratascratch.com/coding/10367-aggregate-listening-data

It wants you to calculate the total listening time per user and count the unique songs they’ve listened to.

You have to work with the listening_habits table to solve this problem.

Table: listening_habits
user_idsong_idlisten_duration
1015001240
10150020
1025001300
10250030
1015001240

To show these calculations by a user, you need to select the user_id column.

Next, use SUM() to calculate the total listening time. In addition, divide the result by 60 to get the minutes, and then round the result to the nearest minute using ROUND().

The number of unique songs by a user is calculated using COUNT(DISTINCT).

Finally, group the output by the user ID to get all these calculations for each user separately.

SELECT user_id, 
       ROUND(SUM (listen_duration) / 60.0) AS total_listening_time , 
       COUNT(DISTINCT song_id) AS number_of_songs
FROM listening_habits
GROUP BY user_id;


Here’s the output.

All required columns and the first 5 rows of the solution are shown

user_idtotal_listen_durationunique_song_count
10182
10252
10361
10462
10541

Window Functions for Distinct Counts Within Partitions

The window functions allow you to perform calculations on the window of rows, i.e., rows preceding and/or following the current row. They are famous for their ability to show aggregate data while not collapsing individual rows.

In SQL, there’s an optional clause PARTITION BY used with window functions. It splits the dataset into subsets based on one or more columns, i.e., it partitions it. Doing that with the dataset means that the window function calculations will be executed for each partition separately.

It would be nice if you could use COUNT(DISTINCT) as a window function with PARTITION BY to count distinct values by partition, wouldn’t it? Unfortunately, DISTINCT is not allowed in the COUNT() window function!

The workaround depends on the particular problem you need to solve and your dataset. However, it mainly boils down to utilizing subqueries or CTEs to simulate using COUNT(DISTINCT) as a window function.

One option is to use a window function – namely DENSE_RANK() – for distinct counts within partitions. The general approach is to rank data within the partition by the column for which you want to know the distinct count. The last row in the partition (if ranked ascendingly) will contain the highest rank, which is the same as the unique count of the data in the partition.

The count will be unique because DENSE_RANK() assigns the same rank to duplicate values, so duplicates won’t artificially increase the rank (or count).

We’ll show how this works in an example.

Example
Take a look at this interview question by ESPN.


Table: olympics_athletes_events

Link to the question: https://platform.stratascratch.com/coding/9958-median-age-of-gold-medal-winners

There’s a table named olympics_athletes_events.

Table: olympics_athletes_events
idnamesexageheightweightteamnocgamesyearseasoncitysporteventmedal
3520Guillermo J. AmparanMMexicoMEX1924 Summer1924SummerParisAthleticsAthletics Men's 800 metres
35394Henry John FinchettMGreat BritainGBR1924 Summer1924SummerParisGymnasticsGymnastics Men's Rings
21918Georg Frederik Ahrensborg ClausenM28DenmarkDEN1924 Summer1924SummerParisCyclingCycling Men's Road Race Individual
110345Marinus Cornelis Dick SigmondM26NetherlandsNED1924 Summer1924SummerParisFootballFootball Men's Football
54193Thodore Tho JeitzM26LuxembourgLUX1924 Summer1924SummerParisGymnasticsGymnastics Men's Individual All-Around

I’ll modify the question and write a query that shows an athlete’s ID, name, and team while also showing the total number of athletes who were team members, regardless of when they competed.

The first CTE ranks athletes. It does that with DENSE_RANK() window function ranking by ID and data being partitioned by the team, meaning it will rank for each team separately. With this CTE, you get the cumulative count of athletes in each team, where the last row within the partition contains the count of athletes in a team. The count is unique because DENSE_RANK() assigns the same rank to all athletes with the same ID, i.e., the athletes that appear several times in the dataset.

WITH ranked_athletes AS (
    SELECT id, 
           name, 
           team,
           DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
    FROM olympics_athletes_events
)

Here’s the output snapshot of that first CTE.

All required columns and the first 5 rows of the solution are shown

idnameteamathlete_rank
3325Santiago Amat CansinoAmolgavar1
17489Facundo CallioniArgentina1
27126Jorge del MazoArgentina2
66754Roberto LarrazArgentina3
100656Charles RiddyArgonaut Rowing Club1

Take the Argentina team, for example. There are three unique athletes on the team, and the cumulative count (or rank) in the athlete_rank column is also three, and rightly so.

However, the list of athletes is not unique, which you can see in the example of Pierre Tolar, who represented Luxembourg two times and appears two times in the output.

WITH ranked_athletes AS (
    SELECT id, 
           name, 
           team,
           DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
    FROM olympics_athletes_events
)

SELECT *
FROM ranked_athletes
WHERE team = 'Luxembourg';


All required columns and the first 5 rows of the solution are shown

idnameteamathlete_rank
14719Rene BrasseurLuxembourg1
54193Thodore Tho JeitzLuxembourg2
120786Pierre TolarLuxembourg3
120786Pierre TolarLuxembourg3
129575Jean Welter Sr.Luxembourg4

The data has to be deduplicated. I’ll solve that in one of the following steps.

The second CTE takes the first one to extract the max rank – the number of unique athletes per team – using the MAX() aggregate function.

The third CTE deduplicates the first CTE, so now I have a dataset containing only the unique combinations of the athlete’s ID, name, and team.

In the final SELECT, I join the second and third CTEs, i.e., the unique count with the list of unique athletes.

WITH ranked_athletes AS (
    SELECT id, 
           name, 
           team,
           DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
    FROM olympics_athletes_events
),

max_rank_per_team AS (
    SELECT team, 
           MAX(athlete_rank) AS team_athlete_count
    FROM ranked_athletes
    GROUP BY team
),

distinct_athletes AS (
    SELECT DISTINCT id, 
           name, 
           team
    FROM ranked_athletes
)

SELECT da.name, 
       da.team, 
       mrt.team_athlete_count
FROM distinct_athletes da
JOIN max_rank_per_team mrt
ON da.team = mrt.team
ORDER BY da.team;

Here’s the output snapshot. I’ve achieved what I wanted, which was to show the unique athlete’s ID, name, and team, along with the count of unique athletes per team.

All required columns and the first 5 rows of the solution are shown

nameteamteam_athlete_count
Santiago Amat CansinoAmolgavar1
Roberto LarrazArgentina3
Jorge del MazoArgentina3
Facundo CallioniArgentina3
Charles RiddyArgonaut Rowing Club1

If you want to check whether it really shows only unique athletes, check it on the Luxembourg and Pierre Tolar example using this query.

WITH ranked_athletes AS (
    SELECT id, 
           name, 
           team,
           DENSE_RANK() OVER (PARTITION BY team ORDER BY id) AS athlete_rank
    FROM olympics_athletes_events
),

max_rank_per_team AS (
    SELECT team, 
           MAX(athlete_rank) AS team_athlete_count
    FROM ranked_athletes
    GROUP BY team
),

distinct_athletes AS (
    SELECT DISTINCT id, 
           name, 
           team
    FROM ranked_athletes
)

SELECT da.id,
       da.name, 
       da.team, 
       mrt.team_athlete_count
FROM distinct_athletes da
JOIN max_rank_per_team mrt
ON da.team = mrt.team
WHERE da.team = 'Luxembourg'
ORDER BY da.team;


The output shows Pierre Tolar only once, confirming this is a list of unique athletes.

All required columns and the first 5 rows of the solution are shown

idnameteamteam_athlete_count
14719Rene BrasseurLuxembourg4
54193Thodore Tho JeitzLuxembourg4
120786Pierre TolarLuxembourg4
129575Jean Welter Sr.Luxembourg4

Now, the solution I showed you above may be a little too elaborate. I can write a shorter query that achieves the same result without DENSE_RANK().

I immediately go to data deduplication in the first CTE in the query below.

The second CTE counts distinct athletes per team by using COUNT(DISTINCT) and GROUP BY, and that’s it – I now only need to JOIN these two CTEs in SELECT and specify the columns I want to output.

WITH distinct_athletes_teams AS (
    SELECT DISTINCT id, 
           name, 
           team
    FROM olympics_athletes_events
),

team_athlete_counts AS (
    SELECT team, 
           COUNT(DISTINCT id) AS team_athlete_count
    FROM distinct_athletes_teams
    GROUP BY team
)

SELECT dat.name, 
       dat.team, 
       tac.team_athlete_count
FROM distinct_athletes_teams dat
JOIN team_athlete_counts tac
ON dat.team = tac.team
ORDER BY team;

The output stays the same as with the first code.

All required columns and the first 5 rows of the solution are shown

nameteamteam_athlete_count
Santiago Amat CansinoAmolgavar1
Jorge del MazoArgentina3
Facundo CallioniArgentina3
Roberto LarrazArgentina3
Charles RiddyArgonaut Rowing Club1

Common Mistakes and How to Avoid Them

There are three common mistakes made when using COUNT() with DISTINCT.

Common Mistakes in Using Count With Distinct in SQL

1. Forgetting to Handle NULLs: First, you should know your data and whether it includes NULLs. Second, you should know whether your task requires including the NULL values in the count. If it does, you’ll have to include them using techniques such as COALESCE().

2. Misunderstanding DISTINCT With Multiple Columns: It’s a common misconception, especially with beginners, that DISTINCT applied on multiple columns returns unique values for each column separately. No, it doesn’t! It returns unique value combinations across all columns. Used like that in COUNT(), you get the count of unique value combinations.

3. Neglecting Indexing: Using COUNT() with DISTINCT can be computationally expensive, especially on large datasets. This is because it first has to scan the whole dataset to find all the values in the specified columns, then remove duplicates and, finally, count the unique records.

Conclusion

I covered in this article several topics relevant to counting distinct values in SQL using COUNT(DISTINCT). The most important is knowing what DISTINCT as a clause does and the syntax of using it with COUNT().

However, I also demonstrated how to handle NULL values and gave you a couple of more complex examples of using COUNT(DISTINCT).

Along with some tips about its performance and avoiding common mistakes when counting distinct values, you’re now ready to build on that knowledge by writing the code on your own. Feel free to experiment with the above questions and solve them in other SQL dialects.

There are still many coding problems requiring COUNT(DISTINCT) knowledge in our coding interview questions section. Use them to polish your knowledge of counting distinct values in SQL.

Counting Distinct Values in SQL


Become a data expert. Subscribe to our newsletter.