SQL UNION vs UNION ALL: Differences You Need to Know
In this article, we’ll discuss differences and similarities between UNION vs UNION ALL, possible use cases, and compare the performance of two set operators.
There are various ways to combine data from different sources in SQL. JOINs combine data horizontally by increasing the number of columns.
Set operators like UNION and UNION ALL allow you to combine datasets vertically. They don’t change the number of columns but stack records from multiple datasets in one table.
In this article, we will explain the differences between two set operators - UNION vs UNION ALL. We’ll go over their syntax and show how to best use these SQL features. Finally, we’ll walk you through answers to actual interview questions that involve UNION and UNION ALL.
We will also refresh your memory on two set operators. The set operators section in our SQL Cheat Sheet is quite informative as well.
What is UNION?
UNION is a set operator. It is used to vertically combine multiple datasets by stacking them on top of one another.
One important feature of UNION is that it removes duplicate rows from the combined data. If you want to ensure that combined datasets don't have duplicate records, use UNION.
UNION removes duplicate rows, not duplicate values. If a row has 5 columns, all 5 values must be the same for UNION to remove that row.
What is UNION ALL?
It is a set operator that combines two datasets by stacking them on top of one another without checking if each record is unique or not.
UNION ALL is useful when:
- You’re sure that datasets you are combining don’t have duplicate rows
- Keeping duplicate records is necessary for the task at hand.
As a rule of thumb, UNION ALL is useful for creating copies of a certain dataset.
Later in this “UNION vs UNION ALL” guide, we’ll walk you through two answers to interview questions. One of them asks you to return two copies of a certain dataset, so it’s a good question to demonstrate the usefulness of UNION ALL.
UNION vs UNION ALL in SQL - Differences and Similarities
The main difference between UNION vs UNION ALL is their approach to duplicate rows.
After UNION combines datasets, it also checks for duplicate records and removes them. UNION ALL only combines them and stops there.
Because UNION takes the extra step of removing duplicate values, generally, it is considered slower than UNION ALL, but that’s not always the case. We’ll compare their performance in a later section.
Finally, the UNION ALL set operator typically produces larger datasets because it doesn’t remove duplicate records.
Both UNION and UNION ALL combine datasets vertically by stacking datasets on top of one another. Unlike JOINs, merging two datasets doesn’t increase the number of columns - but the number of rows.
UNION and UNION ALL operations also follow the same rules - for datasets to be combined, they must have the same type, number, and order of columns.
UNION and UNION ALL also have almost identical syntax.
SQL UNION vs UNION ALL: Syntax
Let’s go over the syntax rules for UNION and UNION ALL.
Minimum number of datasets is two, with a set operator (UNION or UNION ALL) placed between them.
Note that both datasets must SELECT the same number, type, and order of columns, or SQL will throw an error.
Set operators to combine identical datasets
In the previous example, we SELECT data from two different tables. Datasets can be identical as well.
In this example, we combine two identical SELECT statements.
SELECT * FROM sales UNION ALL SELECT * FROM sales;
As a result, we’ll have two copies of the same data from one table.
Set operators to combine two complex queries
Datasets you combine can be as simple as a pair of SELECT/FROM statements or very complex. It’s okay to filter, group, or aggregate data from the source table, as long as two datasets have the right number, type, and order of columns.
Let’s look at an example where we combine two filtered datasets with a WHERE clause.
SELECT * FROM domestic_sales WHERE total_revenue > 100 UNION ALL SELECT * FROM international_sales WHERE total_revenue > 300;
Two datasets don’t need to have the same column names to be combined.
As long as it’s the correct number, type, and logical order of columns, names from the first query don’t have to match those of the second one.
SELECT total_revenue, units_sold FROM domestic_sales UNION ALL SELECT aggregate_revenue, pcs_ordered FROM international_sales;
As you can see, both datasets have the same number of columns.
For two datasets to be successfully combined, total_revenue and aggregate_revenue columns need to have the same type (or implicitly convertible) values. The same applies to units_sold and pcs_ordered columns.
The combined dataset will have column names of the first query - total_revenue and units_sold.
You can specify column names for the final table using the AS keyword.
SELECT total_revenue AS total, units_sold AS sold FROM domestic_sales UNION ALL SELECT aggregate_revenue, pcs_ordered FROM international_sales;
Logical order of columns
When using UNION to combine datasets vertically, mixing up the logical order of columns can skew final data.
For example, let’s imagine you have three columns, labeled: thousands, hundreds, and tens. Each record is supposed to describe a number like 1560, where thousands is 1, hundreds is 5, and tens is 6.
Let’s look at this example:
SELECT thousands, hundreds, tens FROM table UNION ALL SELECT tens, thousands, hundreds FROM table;
In this case, UNION itself won’t throw an error because you have three columns of integer type.
However, tens values from the second dataset will be aligned with thousands values from the first. Other columns will be misaligned as well, which can seriously disrupt accuracy of your data.
SQL UNION vs UNION ALL: In Practice
In this “UNION vs UNION ALL” guide, we’ll continue the StrataScratch blog tradition of breaking down SQL Interview Questions to help you understand various SQL features.
Let’s review two interview questions where UNION and UNION ALL play an integral role in finding a solution.
UNION question: Wine Variety Revenues
In this interview question from Wine Magazine, candidates have to work with sales data to find total sales for each region.
Interview Question Date: February 2020
Find the total revenue made by each region from each variety of wine in that region. Output the region, variety, and total revenue.
Take into calculation both region_1 and region_2. Remove the duplicated rows where region, price and variety are exactly the same.
Link to the question: https://platform.stratascratch.com/coding/10033-wine-variety-revenues
Understand the question
This question can be confusing, especially after looking at the table.
The important takeaway is that we need to aggregate wine revenue for each region. Every record in the winemag_p1 table has two region columns.
We’ll need to create two datasets, each consisting of region, price and variety values. We will be aggregating total revenue for each region, so we need to avoid counting the same sales record more than once. We need to find and remove duplicate records in the table. This is why we need to use UNION instead of UNION ALL.
To solve this question, candidates need to work with the winemag_p1 table.
This table contains information about wine sales in two regions. We’ll need to separate records so that there is just one region value in each record.
|126576||US||Rich and round, this offers plenty of concentrated blackberry notes enveloped in warm spices and supple oak. There's a hint of green tomato leaves throughout, but the lush fruit combined with sturdy grape tannins and high acidity would pair well with fatty short ribs or braised pork.||Estate Club||87||32||Virginia||Virginia||Merlot||Veramar|
|127077||Italy||This luminous sparkler offers measured aromas of yellow fruit and honey and delivers a broad, wide approach in the mouth. The wine's texture is creamy and full and there is a spicy point of effervescence on the palate.||Extra Dry||85||19||Veneto||Prosecco di Valdobbiadene||Prosecco||Varaschin|
|143029||US||There are some pretty features to this wine, namely the lovely tannins and rich fruit, but it's just too soft and sweet. Tastes like blackberry jam, with extra white sugar spinkled on top.||Signature||83||45||California||Paso Robles||Central Coast||Cabernet Sauvignon||Byington|
|98813||US||Lovely nose, which runs from violets and cherries right into coffee and sweet chocolate. In the mouth the wine coats the palate with Napa-like intensity, only buttressed with more vivid acidity. It's both smooth and deep. The aromas are more detailed, interesting and really clean, loaded with dark flavors of soy, coffee, molasses and black cherry liqueur.||Champoux & Loess Vineyards Reserve||93||100||Washington||Washington||Washington Other||Cabernet Sauvignon||Boudreaux Cellars|
|43172||US||Tastes sweeter and fruitier than the 2009, which was dry, so if you like orange marmalade, pineapple tart and vanilla cookie sugariness in your Sauvignon Blancs, it's for you.||Les Pionniers||86||40||California||Sonoma County||Sonoma||Sauvignon Blanc||Capture|
The table has many columns. At this point, we can only make assumptions:
- id identifies each sale
- country, region, province, winery values probably represent the origin of the wine.
- description, designation columns give us broad information about the wine.
- points values are probably used for wine ranking at Wine Magazine
- price most likely represents the price of each unit. Most likely, we’ll have to aggregate values in this column to get the total revenue per region.
- region_1 and region_2 specify the region of the wine. We’ll have to generate datasets with the region, price, variety values for both regions.
- variety column specifies the type of wine.
Looking at the available data, you might notice that some records don’t have either region_1 or region_2 values, and some are missing price values as well. Since we are aggregating price values for each region, records without region and price values will need to be filtered out.
Plan your approach
As you can see, each record contains information about wine sales in two regions. We’ll use the available data to create two separate datasets, one for each region.
Because both datasets are sourced from the same table, some records might be duplicates.
We obviously need UNION, not UNION ALL, for this task. Because it allows us to:
- Merge two datasets into one to prepare it for aggregation
- Automatically remove identical records
Next, we need to filter. As you can see, some records in the winemag_p1 table are missing region_1, region_2, and price values. We can use the WHERE clause to make sure the combined table doesn’t have any records with empty values.
Finally, we’ll have to use the SQL GROUP BY statement and SUM() aggregate function to find the aggregate revenue for each region.
Write the code
1. UNION two datasets
Each record in the winemag_p1 table contains information about two regions.
We need to use the SELECT statement to create two datasets, one for each region. For the first dataset, we’ll SELECT values from region_1, variety, and price. For the second dataset - region_2, variety, price.
Next, we need to merge two datasets again, but this time we’ll stack them on top of one another.
We’ll aggregate combined data to find total revenue by region. So we don’t want to count each record more than once. We need to remove duplicates. That’s why we use UNION, which will vertically combine two datasets and remove any rows where all three values are the same.
SELECT region_1 AS region, variety, price FROM winemag_p1 UNION SELECT region_2 AS region, variety, price FROM winemag_p1
2. Filter table
Looking at the initial table, we can see that some rows have empty region_1, region_2, or price columns. Our datasets will be based on this table, so some of the records in the combined table will be empty as well.
|Brunello di Montalcino||Sangiovese||63|
|Columbia Valley (WA)||Merlot||23|
We need WITH and AS keywords to save and reference the combined table as CTE.
Finally, we pair SELECT with the WHERE clause to filter out records where region and price values are NULL.
WITH CTE AS( SELECT region_1 AS region, variety, price FROM winemag_p1 UNION SELECT region_2 AS region, variety, price FROM winemag_p1 ) SELECT region, variety, price FROM CTE WHERE region IS NOT NULL AND price IS NOT NULL
|Brunello di Montalcino||Sangiovese||63|
|Columbia Valley (WA)||Merlot||23|
3. Aggregate sales
We want to aggregate revenue from each type of wine in each region. The logical choice is to find an aggregate of price values.
WITH CTE AS( SELECT region_1 AS region, variety, price FROM winemag_p1 UNION SELECT region_2 AS region, variety, price FROM winemag_p1 ) SELECT region, variety, SUM(price) AS price_sum FROM CTE WHERE region IS NOT NULL AND price IS NOT NULL GROUP BY region, variety ORDER BY price_sum DESC;
We use the AS keyword to give the aggregated column a descriptive name, ‘sum’.
Finally, we’ll need a GROUP BY statement to create groups of rows so that SUM() is applied to each region, not the entire table.
Lastly, the question also tells us to order regions by highest total sales to lowest.
Final output should display the total sales for each variety of wine from each region.
|Napa Valley||Cabernet Sauvignon||925|
|Diamond Mountain District||Cabernet Sauvignon||350|
|Napa Valley||Bordeaux-style Red Blend||267|
Output when we use UNION ALL instead of UNION
Let’s use UNION ALL instead of UNION to understand the difference between the two.
WITH CTE AS( SELECT region_1 AS region, variety, price FROM winemag_p1 UNION ALL SELECT region_2 AS region, variety, price FROM winemag_p1 ) SELECT region, variety, SUM(price) AS price_sum FROM CTE WHERE region IS NOT NULL AND price IS NOT NULL GROUP BY region, variety ORDER BY price_sum DESC;
An error loading the results has occured
UNION ALL does not remove duplicates, so it produces a lot more records, which might result in inaccurate aggregate values.
UNION ALL question: HR department employee
In this Amazon interview question, candidates are explicitly asked to return duplicate results. It’s an interesting use case for UNION ALL.
You have been asked to find employees in the HR department, and then output the result with one duplicate.
The employee's first name and department should be included in your output.
Note: This dataset does not contain any duplicates.
Understand the question
We are provided with information about workers. The question asks us to find workers who belong to the HR department and output their department and first name.
There is one unusual condition - we need to return two copies of the final answer.
Available data for this question is stored in the worker table. Let’s take a closer look at values in each column:
- Integer values in the worker_id column identify each employee.
- first_name and last_name columns contain employees’ first and last names of the varchar (text) type.
- Integer values in the salary column represent workers’ salaries in dollars.
- joining_date datetime values represent the date when the worker first joined
- Text values in the department column refer to each employee’s department. Since the question asks us to find employees in the HR department, we’ll most likely have to filter records by their department value.
It looks like the department column has three possible values: ‘HR’, ‘Admin’, and ‘Account’.
Plan your approach
We need to filter employees by their department and only keep rows where the value is ‘HR’.
The question also specifies that the final output should only include first_name and department values.
Finally, the answer should consist of two copies of HR employee records. There are many ways to do this, but we’ll explore the simplest approach using UNION ALL.
Write the code
1. Find employees of the HR department
First, we should SELECT records from the workers table. Question instructions clearly tell us to output only two columns - first_name and department.
We need the WHERE clause to filter employees by their department. In this case, we need only to return employees from the ‘HR’ department.
SELECT first_name, department FROM worker WHERE department = 'HR'
2. Return two copies of the dataset
The final answer should have two copies of all HR employees.
UNION ALL vertically combines the results of two queries and does not remove duplicate rows, so it’s a perfect SQL feature to answer this question.
We already have a query that returns all HR employees. We simply need to combine it with itself.
SELECT first_name, department FROM worker WHERE department = 'HR' UNION ALL SELECT first_name, department FROM worker WHERE department = 'HR'
Final answer should return two copies of HR employees’ first names and their department.
Output with UNION instead of UNION ALL
Let’s illustrate the difference between UNION vs UNION ALL by using the former instead of the latter.
SELECT first_name, department FROM worker WHERE department = 'HR' UNION SELECT first_name, department FROM worker WHERE department = 'HR';
As you can see, UNION removed all duplicate rows. The question asks us to output duplicate rows, so we need to use UNION ALL.
SQL UNION vs UNION ALL: Performance
UNION vertically combines two data sets, finds duplicate rows, and removes them. UNION ALL also combines datasets, but it doesn’t remove any records, regardless of their uniqueness.
Because of the extra steps involved, UNION is considered to be the slower of the two, especially when applied to large datasets.
This is a simple explanation that makes sense, but it doesn’t always hold up in the real world. There may be other factors involved, like internet speed, which might make UNION ALL slower. But the best is to assess UNION vs UNION ALL choice on a case-by-case basis.
The general rule is if you are confident that the dataset has no duplicates or that keeping duplicates is necessary for the task, go with UNION ALL.
If you need to ensure there are no duplicate records, go with UNION.
SQL UNION vs UNION ALL: Which is better?
Depends on the task at hand and what information you have about each dataset.
UNION ALL is more efficient, but it is only applicable if:
- You know there won’t be any duplicate values to remove
- Keeping duplicates is necessary for the task
If you want to combine two datasets and ensure there are no duplicates in the final table, use UNION instead.
In this “UNION vs UNION ALL” guide, we described UNION and UNION ALL, their similarities, differences, and tips for using two set operators.
Having a thorough knowledge of UNION and UNION ALL can increase your chances of finding a job. For example, understand how using one set operator instead of the other can change the query's output.
The StrataScratch platform has hundreds of questions where you can practice writing queries with UNION, UNION ALL, and many other SQL features. You can explore answers from the community and the thought process behind each solution. For example, understand the decision behind using UNION instead of UNION ALL to solve a question.