Mastering the SQL Subqueries With Multiple Tables
Categories
Mastering the tools to use subqueries involving multiple tables
The complexity of business processes and vast amounts of data in organizations often mean that a single table does not contain all the information needed to make informed decisions or answer critical questions. As a result, multiple tables are often required to gain a more complete understanding of the context and to enrich data analysis. So as a data analyst, it is essential to master SQL joins and subqueries.
Subqueries allow us to use the results of one query as the input for another query and are a powerful tool for filtering and manipulating data. But subqueries can be tricky to master, especially when dealing with multiple tables.
That's why we're here to help. In today’s tutorial, we'll teach you how to use IN, NOT IN, EXISTS, and NOT EXISTS, which are common subqueries that allow you to filter data based on specific criteria.
To help you understand everything we cover here, here’s also a Youtube video from our channel.
IN vs NOT IN
‘IN’ and ‘NOT IN’ are used to filter data based on specific values.
The WHERE clause with an equals operator is commonly used to filter data. However, when filtering by a longer list of values, using the OR operator can be tedious. In such cases, we can use the IN operator to filter data. The IN operator allows us to specify a list of values to match against a given column. It is a more concise and readable way of filtering data compared to using the OR operator repeatedly.
Let’s take an example question from Amazon:
Find information on employees who do not have the first names 'Vipul' or 'Satish' or a last name that contains a 'c'.
Link to the question: https://platform.stratascratch.com/coding/9838-find-details-of-workers-excluding-those-with-the-first-name-vipul-or-satish
worker_id | first_name | last_name | salary | joining_date | department |
---|---|---|---|---|---|
1 | Monika | Arora | 100000 | 2014-02-20 | HR |
2 | Niharika | Verma | 80000 | 2014-06-11 | Admin |
3 | Vishal | Singhal | 300000 | 2014-02-20 | HR |
4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
5 | Vivek | Bhati | 500000 | 2014-06-11 | Admin |
For example, using this worker dataset from Amazon, we can find the workers of the departments ‘Admin’ and ‘HR’ by:
SELECT worker_id,
first_name,
last_name
FROM worker
WHERE department='Admin'
OR department='HR'
But if given a larger dataset with more values to filter on, it is a lot more readable to use the IN operator.
SELECT worker_id, first_name, last_name FROM worker WHERE department IN ('Admin', 'HR')
Sounds simple enough? Try to solve this question. Aside from IN, you’ll also have to use the NOT IN operator!
Using IN/NOT IN with subqueries
IN and NOT IN are not only useful in filtering based on a specified list of values. By using IN and NOT IN, we can also filter the main table based on a set of values from another table or subquery without having to join the tables together!
Let’s try solving this problem.
Interview Question Date: June 2020
Find countries that are in winemag_p1 dataset but not in winemag_p2. Output distinct country names. Order records by the country in ascending order.
Link to the question: https://platform.stratascratch.com/coding/10147-find-countries-that-are-in-winemag_p1-dataset-but-not-in-winemag_p2
id | country | description | designation | points | price | province | region_1 | region_2 | variety | winery |
---|---|---|---|---|---|---|---|---|---|---|
126576 | US | Rich and round, this offers plenty of concentrated blackberry notes enveloped in warm spices and supple oak. There's a hint of green tomato leaves throughout, but the lush fruit combined with sturdy grape tannins and high acidity would pair well with fatty short ribs or braised pork. | Estate Club | 87 | 32 | Virginia | Virginia | Merlot | Veramar | |
127077 | Italy | This luminous sparkler offers measured aromas of yellow fruit and honey and delivers a broad, wide approach in the mouth. The wine's texture is creamy and full and there is a spicy point of effervescence on the palate. | Extra Dry | 85 | 19 | Veneto | Prosecco di Valdobbiadene | Prosecco | Varaschin | |
143029 | US | There are some pretty features to this wine, namely the lovely tannins and rich fruit, but it's just too soft and sweet. Tastes like blackberry jam, with extra white sugar spinkled on top. | Signature | 83 | 45 | California | Paso Robles | Central Coast | Cabernet Sauvignon | Byington |
98813 | US | Lovely nose, which runs from violets and cherries right into coffee and sweet chocolate. In the mouth the wine coats the palate with Napa-like intensity, only buttressed with more vivid acidity. It's both smooth and deep. The aromas are more detailed, interesting and really clean, loaded with dark flavors of soy, coffee, molasses and black cherry liqueur. | Champoux & Loess Vineyards Reserve | 93 | 100 | Washington | Washington | Washington Other | Cabernet Sauvignon | Boudreaux Cellars |
43172 | US | Tastes sweeter and fruitier than the 2009, which was dry, so if you like orange marmalade, pineapple tart and vanilla cookie sugariness in your Sauvignon Blancs, it's for you. | Les Pionniers | 86 | 40 | California | Sonoma County | Sonoma | Sauvignon Blanc | Capture |
id | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
118040 | US | A bit too soft and thus lacks structure. Shows a good array of wild berry and briary, brambly flavors. Dry and spicy, and ready now. | The Convict Rocky Ridge Vineyard | 86 | 38 | California | Rockpile | Sonoma | Paradise Ridge 2006 The Convict Rocky Ridge Vineyard Zinfandel (Rockpile) | Zinfandel | Paradise Ridge | ||
59743 | Italy | Aromas of toasted oak, vanilla and a confectionary note lead the nose while the palate offers mature Golden Delicious apple, butterscotch and walnut skin alongside bracing acidity. | Mongris Riserva | 88 | 30 | Northeastern Italy | Collio | Kerin O'Keefe | @kerinokeefe | Marco Felluga 2012 Mongris Riserva Pinot Grigio (Collio) | Pinot Grigio | Marco Felluga | |
117951 | US | Heavy in alcohol and overripe, but entirely dry, and the black currant, chocolate and pepper flavors will play well against richly sauced barbecue. | The Caboose | 84 | 30 | California | Alexander Valley | Sonoma | Starry Night 2007 The Caboose Zinfandel (Alexander Valley) | Zinfandel | Starry Night | ||
10202 | France | Fragrant; dusty plum and rose aromas are followed on the palate by very fruity flavors of cherries and apple skins. The fine bead and rich mousse result in a creamy mouthfeel. It's all a bit obvious and simple, but undeniably good. | Brut Rosa | 87 | 83 | Champagne | Champagne | Joe Czerwinski | @JoeCz | Ruinart NV Brut Rosa (Champagne) | Champagne Blend | Ruinart | |
69268 | Germany | Layers of plush yellow peach and apricot flavors highlight this easy, crowd-pleasing kabinett. Sunny lemon acidity lends refreshment to the midpalate, cutting through all the ripeness and leading to a brisk finish. Drink now through 2019. | Kabinett | 88 | 15 | Rheinhessen | Anna Lee C. Iijima | Weinreich 2012 Kabinett Riesling (Rheinhessen) | Riesling | Weinreich |
To filter out duplicate countries from the winemag_p1 dataset, we can use the NOT IN operator and a subquery that selects all the countries listed in winemag_p2. This is a more efficient alternative to using the OR operator to filter by each value. The final query becomes:
SELECT DISTINCT country
FROM winemag_p1
WHERE country NOT IN (SELECT DISTINCT country
FROM winemag_p2)
So, when it comes to using IN or NOT IN in SQL, it's all about filtering data based on a specific column and a list of allowed or excluded values. To make it happen, you need to follow a few simple steps.
- First, you'll want to select the columns you want to work with from a main table.
- Then, you can use a WHERE clause to filter where a specific column is either IN or NOT IN a list of values. This list of values can either be
- specified directly, in a tuple-like format; or
- indirectly, using a subquery that returns a list of eligible values for the column
Info box to explain the options:
SELECT col1, col2 FROM table WHERE col1 IN
- (‘allowed_value1’, ‘allowed_value2’, ‘allowed_value3’)
- (SELECT col1 FROM table2 GROUP BY col1 HAVING count(*)>2)
But here's the catch: if you decide to use a subquery, it needs to return a list of values for the specific column you're filtering on. For instance, if you're filtering based on col1, the subquery should only return col1 values.
EXISTS and NOT EXISTS
Let's discuss another SQL function that enables us to filter data based on the results from another table, which is called EXISTS and NOT EXISTS. But before we delve into some examples, let's first understand the difference between IN/NOT IN and EXISTS/NOT EXISTS, as they can be easily mixed up.
Infobox:
Let me explain it to you in simple terms - both IN/NOT IN and EXISTS/NOT EXISTS are used in the WHERE clause, but their functionalities are different. IN and NOT IN filter rows by checking if a particular value is present or not, while EXISTS and NOT EXISTS filter based on the existence of rows in a subquery. So, keep this difference in mind while working with these functions.
Let's move on and explore the EXISTS and NOT EXISTS functions with this question from Meta.
Interview Question Date: December 2021
You are given the list of Facebook friends and the list of Facebook pages that users follow. Your task is to create a new recommendation system for Facebook. For each Facebook user, find pages that this user doesn't follow but at least one of their friends does. Output the user ID and the ID of the page that should be recommended to this user.
Link to the question: https://platform.stratascratch.com/coding/2081-recommendation-system
user_id | friend_id |
---|---|
1 | 2 |
1 | 4 |
1 | 5 |
2 | 1 |
2 | 3 |
user_id | page_id |
---|---|
1 | 21 |
1 | 25 |
2 | 25 |
2 | 23 |
2 | 24 |
Discovery is the aim of recommendation systems, so the pages recommended should be new. In other words, we should only return pages once (i.e., not duplicated), and these should be pages that the user does not yet follow.
First, we need to join the user_friends table to the user_pages table to identify the pages that their friends follow.
SELECT f.user_id,
p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
But since friends may follow the same pages, we need to return only distinct page recommendations for each user. Easy peasy, we just add the DISTINCT keyword to the SELECT statement.
SELECT DISTINCT f.user_id,
p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
But now comes the real magic - using the NOT EXISTS operator to filter out pages that the user is already following. We create a subquery that checks for matching user_id and page_id combinations in the user_pages table and join it to the outer query. If this subquery returns any row, the user-page combination is removed. Otherwise, it is retained.
SELECT *
FROM users_pages pg
WHERE pg.user_id = f.user_id
AND pg.page_id = p.page_id)
And just like that, we have our solution! Our final query looks like this:
SELECT DISTINCT f.user_id,
p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
WHERE NOT EXISTS
(SELECT *
FROM users_pages pg
WHERE pg.user_id = f.user_id
AND pg.page_id = p.page_id)
Conclusion
With our tutorial today, you’ll be better equipped to make more complex SQL queries using multiple tables through IN, NOT IN, EXISTS, and NOT EXISTS. In summary, the IN operator is great for when you have a specific list of values to match against, while EXISTS is more useful for complex subquery filtering.
Check out “SQL Interview Questions” to practice more such questions.