Utilizing DENSE_RANK for Data Deduplication in SQL
Categories
Data deduplication in SQL can be done in several ways. Using DENSE_RANK() is one. Today, we’ll talk about using this function to improve your data’s quality.
Working with data often requires deduplicating it. This is a regular step in ensuring data quality and accuracy.
The most obvious method that springs to mind when talking about data deduplication is using the DISTINCT keyword.
However, using SQL DENSE_RANK() is also one of the methods, and this is the topic of our article today.
What is Data Deduplication in SQL?
Data deduplication is a task where duplicate data is identified and removed from the dataset.
Deduplicating data has several benefits in data management. The reduced size of the saved data leads to storage optimization, which, in turn, leads to cost reduction.
Another benefit of deduplication is improved data quality, which provides a foundation for more accurate data analysis.
Importance of Ranking Functions in Identifying Duplicate Records
Except for its primary purpose, which is, obviously, ranking data, the SQL rank functions are also excellent for deduplicating data.
The ranking functions are window functions, meaning they perform an operation (ranking) on the window frame, a set of rows related to the current row.
Three ranking window functions in SQL are:
- ROW_NUMBER()
- RANK()
- DENSE_RANK()
Each of these functions ranks data in a distinct manner.
ROW_NUMBER()
The ROW_NUMBER() window function ranks the rows sequentially, even if there are rows with the same value. So, there are no same ranks repeated, and there is no gap between the ranks. If there are rows with the same value, their order will be determined randomly.
Here’s the table song_streams that I’ll use as an example.
Ranking data from the highest to the lowest streams using ROW_NUMBER() will return this.
The output shows all the rows are sequentially ranked. Even when rows have the same number of streams (duplicates), ROW_NUMBER() still ranks them sequentially, i.e., rows with the same value won’t get the same rank.
This feature is useful for removing duplicates from the dataset by identifying them in each partition. The logic is that if the data is not duplicated, then there will be only one row per partition if data is partitioned by every column.
RANK()
Another ranking window function is RANK(). Its characteristic is assigning the same rank to the rows with the same values. The next row not having the same value will get a new rank but with a gap. The gap is equal to the number of rows getting the same rank.
Using the same table as earlier, this is how RANK() would rank the values.
The three rows with the same values got the same rank. The next rank is not 8, but 10 since three rows are ranked 7.
This function, too, can be used to identify duplicates. However, it’s used less often due to its rank skipping, which can, in some cases, mess up deduplication.
DENSE_RANK()
The DENSE_RANK() function ranks the rows with the duplicates the same as RANK() – they all get the same rank. However, there’s no rank skipping when the following row has a different value and requires a new rank.
Here’s how it will look in our example table.
The ranking is the same as with RANK(), only there’s no rank skipping, so the last row is ranked 8, not 10.
This can be utilized in identifying duplicate values when there should be no gap between the ranks, which is the core topic of our article.
DENSE_RANK() in Deduplication
The general principle of deduplication using DENSE_RANK() is first identifying duplicate data by ranking data. This is done by partitioning the output using the PARTITION BY window functions clause. The partitioning should be done using the column or a pair of columns that will expose the data duplication.
You also need to specify the column in ORDER BY to determine by what values the data will be ranked and in what manner (ascendingly or descendingly). The column in ORDER BY should uniquely define the data, so the ID column is usually used.
The logic behind this is that if the rows have the same values in the PARTITION BY columns, then these rows will belong to one partition. If you rank the rows within a partition, then these rows would be ranked differently, as they all have different IDs. This means you would have partitions with only one row that is ranked as one for the non-duplicate data, and partitions with more than one row will be ranked sequentially; these are duplicates.
Then, you would just need to remove ranks higher than 1 from the partitions to deduplicate data.
Doing so requires knowing the DENSE_RANK() syntax.
Syntax and Basic Usage of DENSE_RANK()
The syntax of DENSE_RANK() is as follows.
SELECT …,
DENSE_RANK() OVER (PARTITION BY column_name ORDER BY column_name [ASC | DESC])
FROM table_name;
The parentheses of the DENSE_RANK() function remain empty.
The OVER() clause is mandatory and creates a window function.
In the ranking functions, including DENSE_RANK(), the ORDER BY clause is mandatory. It defines the order in which the data will be ranked, which can be ascending or descending. ORDER BY accepts one or more columns.
The PARTITION BY clause is optional. When used, data will be partitioned (divided into subsets) based on the column referenced in PARTITION BY. If you omit this clause, then the whole output is one partition. PARTITION BY, too, accepts more than one column.
Now, we will demonstrate how this works when identifying duplicates. The table below is named movie_streams.
The code here identifies duplicates using DENSE_RANK().
SELECT id,
movie_title,
streams,
DENSE_RANK() OVER (PARTITION BY movie_title, streams ORDER BY id) AS ranking
FROM movie_streams;
The table is partitioned by the movie title and streams. These columns determine uniqueness; any rows with the same movie title and the number of streams are duplicates. PARTITION BY will put such rows in the same partition.
After that, the data is ordered by ID, which is the column by which the data will be ranked descendingly.
Here’s the output.
Wherever there’s a rank higher than one, those are duplicate values. Now, the only thing you need to do is to remove them completely from the database using the DELETE statement or simply exclude them from the output using SELECT. In both cases, you’d need to use the WHERE clause, excluding all the ranks higher than 1.
Practical Applications of DENSE_RANK() for Deduplication in SQL
It’s now time to use several SQL interview questions from our platform and show how to use DENSE_RANK() in data deduplication in SQL.
Example #1: Common Friends Script
We can use this question by Google to create a simple example of using DENSE_RANK() to deduplicate data.
Interview Question Date: January 2024
You are analyzing a social network dataset at Google. Your task is to find mutual friends between two users, Karl and Hans. There is only one user named Karl and one named Hans in the dataset.
The output should contain 'user_id' and 'user_name' columns.
Link to the question: https://platform.stratascratch.com/coding/10365-common-friends-script
There’s a table named users.
user_id | user_name |
---|---|
1 | Karl |
2 | Hans |
3 | Emma |
4 | Emma |
5 | Mike |
We can use it to check for duplicate users and remove them, as we only need unique user names for our analysis.
This is done exactly as I showed you earlier. I partition data by the columns that will determine the uniqueness; in this case, it’s only the user_name column. Then, I rank the data by the user ID.
SELECT user_id,
user_name,
DENSE_RANK() OVER (PARTITION BY user_name ORDER BY user_id) AS ranking
FROM users;
The output shows users named Emma and Lucas have duplicates.
user_id | user_name | ranking |
---|---|---|
9 | Anna | 1 |
3 | Emma | 1 |
4 | Emma | 2 |
2 | Hans | 1 |
10 | John | 1 |
1 | Karl | 1 |
6 | Lucas | 1 |
8 | Lucas | 2 |
5 | Mike | 1 |
7 | Sarah | 1 |
To remove these duplicates, I can, for example, turn the above code into CTE and then exclude all the data that is not ranked as one. Also, I actually don’t need any other data, so I can only select user_name to appear in the output.
WITH data_ranking AS (
SELECT user_id,
user_name,
DENSE_RANK() OVER (PARTITION BY user_name ORDER BY user_id) AS ranking
FROM users
)
SELECT user_name
FROM data_ranking
WHERE ranking = 1;
Here it is, the list of users without duplicates.
user_name |
---|
Anna |
Emma |
Hans |
John |
Karl |
Lucas |
Mike |
Sarah |
Example #2: Workers With The Highest Salaries
In this example, I’ll use the question from the Amazon and DoorDash interviews.
Interview Question Date: July 2021
You have been asked to find the job titles of the highest-paid employees.
Your output should include the highest-paid title or multiple titles with the same salary.
Link to the question: https://platform.stratascratch.com/coding/10353-workers-with-the-highest-salaries
The question provides a table named worker.
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
1 | Monika | Arora | 100000 | 2014-02-20 | HR |
2 | Niharika | Verma | 80000 | 2014-06-11 | Admin |
3 | Vishal | Singhal | 300000 | 2014-02-20 | HR |
4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
5 | Vivek | Bhati | 500000 | 2014-06-11 | Admin |
I want to show a list of departments and unique salary amounts paid in the particular departments. In the code, I partition by deparments and salary. I order data by worker_id so that the data is ranked by this column ascendingly.
SELECT department,
salary,
RANK() OVER (PARTITION BY department, salary ORDER BY worker_id) AS ranking
FROM worker;
Here’s the ranking for each partition. You can see there are duplicates in the Account department and twice in the Admin department.
department | salary | ranking |
---|---|---|
Account | 75000 | 1 |
Account | 75000 | 2 |
Account | 200000 | 1 |
Admin | 80000 | 1 |
Admin | 90000 | 1 |
Admin | 90000 | 2 |
Admin | 500000 | 1 |
Admin | 500000 | 2 |
HR | 65000 | 1 |
HR | 85000 | 1 |
HR | 100000 | 1 |
HR | 300000 | 1 |
To remove the duplicates in one step, you can rewrite the code as a CTE, like in the previous example. However, I’ll rewrite it as a subquery just to bring in some diversity.
SELECT department,
salary
FROM (SELECT department,
salary,
RANK() OVER (PARTITION BY department, salary ORDER BY worker_id) AS ranking
FROM worker
) AS rank_sq
WHERE ranking = 1;
Here’s the output.
department | salary |
---|---|
Account | 75000 |
Account | 200000 |
Admin | 80000 |
Admin | 90000 |
Admin | 500000 |
HR | 65000 |
HR | 85000 |
HR | 100000 |
HR | 300000 |
Example #3: Find the Duplicate Records in the Dataset
The last example for deduplication in SQL practice is the interview question by Google and Amazon.
Find the duplicate records in the dataset. Output the worker title, affected_from date, and the number of times the records appear in the dataset.
Link to the question: https://platform.stratascratch.com/coding/9849-find-the-duplicate-records-in-the-dataset
Like the question asks, we’ll find duplicate values. However, we won’t output the number of times the duplicate records appear; we’ll output the actual duplicate values.
There’s the table title at our disposal.
worker_ref_id | worker_title | affected_from |
---|---|---|
1 | Manager | 2016-02-20 |
2 | Executive | 2016-06-11 |
8 | Executive | 2016-06-11 |
5 | Manager | 2016-06-11 |
4 | Asst. Manager | 2016-06-11 |
The approach is similar to previous examples. There’s a SELECT statement that uses DENSE_RANK() to find duplicate records. The data is partitioned by the columns worker_title and affected_from and ordered by the worker ID.
SELECT worker_ref_id,
worker_title,
affected_from,
DENSE_RANK() OVER (PARTITION BY worker_title, affected_from ORDER BY worker_ref_id) AS rank
FROM title;
I’ll write the above query as a CTE, then reference it and output only the records with the rank above 1, i.e., duplicate values.
WITH ranked_titles AS (
SELECT worker_ref_id,
worker_title,
affected_from,
DENSE_RANK() OVER (PARTITION BY worker_title, affected_from ORDER BY worker_ref_id) AS rank
FROM title
)
SELECT *
FROM ranked_titles
WHERE rank > 1;
Here’s the output showing three rows that are duplicates.
worker_ref_id | worker_title | affected_from | rank |
---|---|---|---|
7 | Executive | 2016-06-11 | 2 |
8 | Executive | 2016-06-11 | 3 |
6 | Lead | 2016-06-11 | 2 |
To spice things up, let’s remove these records from the database. This is what you will often do in your job, as the point of finding duplicates is to remove them and clean your data.
Here’s the example code that would do it. The CTE part is the same. The second SELECT statement is now a subquery in the WHERE clause of a DELETE statement.
WITH ranked_titles AS (
SELECT worker_ref_id,
worker_title,
affected_from,
DENSE_RANK() OVER (PARTITION BY worker_title, affected_from ORDER BY worker_ref_id) AS rank
FROM title
)
DELETE FROM title
WHERE worker_ref_id IN (
SELECT worker_ref_id
FROM ranked_titles
WHERE rank > 1
);
Optimizing Performance for Large Datasets
When you’re deduplicating large datasets, it’s important to optimize the query performance. Not doing so can make your query significantly slow and make data deduplication more tedious than it should be.
1. Indexing
Indexing relevant columns is the most important technique for performance optimization. The columns that are frequently used in WHERE, JOIN, GROUP BY, and ORDER BY should be indexed.
Another tip is to consider using composite indexes in cases where your queries frequently involve multiple columns, which I would say will be the case.
If you imagine the table title we used earlier is large, we could’ve created a composite index before finding duplicates.
CREATE INDEX idx_worker_date ON title (worker_title, affected_from);
This will help the database execute the queries that partition the dataset by worker_title and affected_from.
2. Query Structuring
The way you structure your query can also improve (or degrade) its performance. Here are several optimization tips when deduplicating data in large datasets:
- Avoid SELECT *: Instead of selecting all columns from the tables, specify only the required columns in SELECT. That way, the query won’t retrieve unnecessary data.
- Use CTEs Moderately: CTEs are a great tool for improving your query’s readability. However, they can hinder the performance of your query if used excessively.
- Filter Data: Whenever possible, filter data using WHERE so your deduplication query is fed a smaller dataset.
3. Partitioning Tables
You already learned what partitioning means. It’s not only possible to do it temporarily in window functions, but you can also create range partitions. By doing so, you would split the table into several smaller tables, which could significantly reduce the running time of your query.
For example, if we imagine that the table sales contains dates from 2016, 2017, and 2018 in the column affected_from, we could partition into three subsets when creating the table, one for each year.
CREATE TABLE title (
worker_ref_id INT,
worker_title VARCHAR(255),
affected_from DATE
)
PARTITION BY RANGE (affected_from);
CREATE TABLE title_p2016 PARTITION OF title
FOR VALUES FROM ('2016-01-01') TO ('2017-01-01');
CREATE TABLE title_p2017 PARTITION OF title
FOR VALUES FROM ('2017-01-01') TO ('2018-01-01');
CREATE TABLE title_p2018 PARTITION OF title
FOR VALUES FROM ('2018-01-01') TO ('2019-01-01');
If we, for instance, need to find and delete duplicates for all the rows from 2016, we could rewrite the query from our last interview questions like this.
WITH ranked_titles AS (
SELECT worker_ref_id,
worker_title,
affected_from,
DENSE_RANK() OVER (PARTITION BY worker_title, affected_from ORDER BY worker_ref_id) AS rank
FROM title
WHERE affected_from >= '2016-01-01' AND affected_from < '2017-01-01'
)
DELETE
FROM title
WHERE worker_ref_id IN (
SELECT worker_ref_id
FROM ranked_titles
WHERE rank > 1
AND affected_from >= '2016-01-01' AND affected_from < '2017-01-01'
);
That way, the query will look through only one partition of the table, which will be much quicker than looking through the whole table.
4. Monitoring and Tuning
Whatever RDBS you use, it has tools for monitoring query performance and database health. Use them regularly to identify bottlenecks and then remove them. For instance, query analyzers can be utilized to get detailed information about slow-running queries and optimize them.
Common Mistakes When Using DENSE_RANK() for Deduplication
The three most common mistakes in deduplicating data with DENSE_RANK() are given below.
1. Misunderstanding the PARTITION BY Clause
One of the most common mistakes is not partitioning the data the way it allows you to deduplicate data. You have to be aware that ranking is performed for each partition separately.
Also, to deduplicate data, two or more columns very commonly do partitioning. They have to be carefully chosen. In general, the combination of columns should collectively serve to represent a unique record.
2. Misunderstanding the ORDER BY Clause
First, how DENSE_RANK() ranks the data depends on ORDER BY, so you must choose between ascending and descending ranking. The distinction is important in some cases.
Second, it’s also important to choose an appropriate column by which to order the data. The column should be deterministic; otherwise, you could end up with unexpected, inconsistent, and incorrect results. You must be sure that the column you order by is for sure unique; that’s why the ID column is typically the best choice, as you’ve seen in our examples.
3. Overlooking NULL Values
If you have NULLs in your data, DENSE_RANK() will treat them as identical. In other words, all the NULL rows could form one partition and all except the first row could be marked as duplicates, even though they’re not. I don’t have to tell you this leads to wrong data deduplication.
Best Practices to Ensure Accurate and Efficient Deduplication
The best practices are virtually a Venn diagram of optimization techniques and common mistakes.
1. Have clear criteria for partitioning – the columns used in PARTITION BY must represent the uniqueness of records.
2. Use indexes – they can improve deduplication query performance, especially with large datasets
3. Test thoroughly – Check if the query’s deduplication logic does what you want it to do before you apply it to the whole database.
Conclusion
One method for data deduplication in SQL is using window functions. Very often, the choice is DENSE_RANK().
Using it to rank data within partitions makes it easy to identify duplicates – all the rows with a ranking above 1 are duplicates.
While the principles of deduplicating data using DENSE_RANK() are generally simple, you still need to be careful about by which columns you partition and order data.
This requires some code-writing practice. There’s plenty of opportunity for that in more than 1,000 SQL interview questions on our platform. Many of those questions cover window functions and, specifically, DENSE_RANK(), so help yourself.