Top Most Common SQL Coding Errors in Data Science

Order of Execution of SQL Query

It is critical to understand the order of execution of your SQL query. This will help you write better/efficient queries while avoiding syntactic or semantic errors. The order must be in the format below, or your code will have errors.

Get the data (This can be FROM one table or from 2 tables when JOIN executes)
Filter the data (the WHERE clause is executed when the data is available)
Grouping (When using aggregation, GROUP BY is executed after filtering the data)
HAVING clause
SELECT statements
ORDER BY

Once you understand the above order of execution, you can avoid making some typical SQL errors in your code. Now let’s focus on some of the most common SQL coding errors beginner data science folks commit.

Most Common SQL Coding Errors

SQL Coding Error #1: Use of Reserved Keywords

Reserved keywords are SQL keywords that should not be used when writing your SQL query. These keywords have special meanings in the relational engine. For example, MAX is a reserved keyword in SQL used to compute the maximum value. ORDER is a keyword used to sort the data using the ORDER BY clause. When used in the query, such keywords throw an error if not handled correctly.

Example from StrataScratch:

Most Active Users On Messenger

Interview Question Date: November 2020

id	date	user1	user2	msg_count
1	2020-08-02	kpena	scottmartin	2
2	2020-08-02	misty19	srogers	2
3	2020-08-02	jerome75	craig23	3
4	2020-08-02	taylorhoward	johnmccann	8
5	2020-08-02	wangdenise	sgoodman	2

Code With Error

SELECT id, 
       user1 user
FROM fb_messages;

In the above example, the candidate is trying to select the ID and the user_1 field from the table but using user as an alias. User is a keyword in SQL and can’t be used like that. To avoid this error, we can use the below query.

Code Without Error

SELECT id, 
       user1 AS user
FROM fb_messages;

In the above query, we have used the AS keyword to give an alias to the column user1. Thus, we need to use AS and can’t have any shortcuts when using SQL keywords.

Run the query to see if you get the same output.

All required columns and the first 5 rows of the solution are shown

id	user
1	kpena
2	misty19
3	jerome75
4	taylorhoward
5	wangdenise

The reserved keywords should ideally be avoided in your queries, so instead of using the user as the name of the column, we can change it to username as in the below query.

SELECT id, 
       user1 AS username
FROM fb_messages;

Thus, if you want to use the reserved keywords as aliases, you must use AS for giving alias, and there shouldn’t be any shortcuts.

SQL Coding Error #2: Column as Reserved Keyword

SQL Coding Error of Column as Reserved Keyword

This is a problem similar to the previous one. This SQL coding error occurs when the column in the table is named as the reserved keyword.

Suppose you have a table named it_problems. It’s a list of IT problems categorized as internal or external.

Code With Error

If you wanted to count the number of problems by the problem type (Internal/External), you would write this code.

SELECT int, 
	  COUNT(id) AS problem_count	   
FROM it_problems
GROUP BY int;

In MySQL, this would return an error. However, the query runs without the problem in PostgreSQL and returns this output.

Code Without Error

There are two ways of avoiding this problem. Except renaming the column in the database, that is.

You could use backticks the following way.

SELECT `int`, 
	  COUNT(id) AS problem_count	   
FROM it_problems
GROUP BY `int`;

Or you could name the table before the reserved keyword column name.

SELECT it_problems.int, 
	  COUNT(id) AS problem_count	   
FROM it_problems
GROUP BY it_problems.int;

Both ways would work in MySQL.

Each DB has different reserved keywords; you can find those in the documentation.

SQL Coding Error #3: Data De-Duplication

From the vast amount of solutions we have on our platform, we identified one of the most common SQL coding errors while using a DISTINCT keyword in SQL queries.

Some questions usually ask to output a user or a product based on a certain condition where a user/product might appear in multiple rows. Many users do not use the DISTINCT keyword in the query, which results in duplicated user/product in the output. An example is shown below:

Finding User Purchases

Interview Question Date: December 2020

Amazon

MediumID 10322

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Write a query that'll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.

Table: amazon_transactions

Link: https://platform.stratascratch.com/coding/10322-finding-user-purchases

Table: amazon_transactions

id	user_id	item	created_at	revenue
1	109	milk	2020-03-03	123
2	139	biscuit	2020-03-18	421
3	120	milk	2020-03-18	176
4	108	banana	2020-03-18	862
5	130	milk	2020-03-28	333

From the above question, let’s imagine you need to find the user IDs that either buy milk or bread or both and output the user IDs in ascending order.

Code With Error (Semantic - Duplication in the Data)

SELECT user_id
FROM amazon_transactions
WHERE item IN ('milk','bread')
ORDER BY user_id;

All required columns and the first 5 rows of the solution are shown

user_id
100
100
101
101
102

From the above query, the output of the code will have repeated user IDs since one user might have bought both milk and bread. Thus, in order to de-duplicate the data, we need to use the DISTINCT keyword, as in the below example.

Code Without Error

SELECT DISTINCT user_id
FROM amazon_transactions
WHERE item IN ('milk','bread')
ORDER BY user_id;

Run the query to see if you get the same output.

All required columns and the first 5 rows of the solution are shown

user_id
100
101
102
103
105

When writing your SQL queries, it’s critical to think about the question and understand whether the results need to be de-duplicated or not. If you think yes, then use the DISTINCT clause to avoid any duplicates in your output.

SQL Coding Error #4: Wrong Understanding of the DISTINCT Clause

SQL Coding Error of Wrong Understanding of the DISTINCT Clause

In the above section, we saw the importance of using DISTINCT keywords in cases where we don’t need any duplication. This DISTINCT keyword can be used for one column or can be used for all the columns that the user selects.

By analyzing the solutions on the StrataScratch platform, we realized that there is a common misconception about using the DISTINCT keyword. Data science beginners usually think that they can apply distinct keywords to only a specific column and not other columns from the select clause.

Let’s demonstrate this on this question by Airbnb.

Find matching hosts and guests in a way that they are both of the same gender and nationality

Airbnb

MediumID 10078

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find matching hosts and guests pairs in a way that they are both of the same gender and nationality. Output the host id and the guest id of matched pair.

Tables: airbnb_hosts, airbnb_guests

Link: https://platform.stratascratch.com/coding/10078-find-matching-hosts-and-guests-in-a-way-that-they-are-both-of-the-same-gender-and-nationality/discussion

Table: airbnb_hosts

host_id	nationality	gender	age
0	USA	M	28
1	USA	F	29
2	China	F	31
3	China	M	24
4	Mali	M	30

Table: airbnb_guests

guest_id	nationality	gender	age
0	Mali	M	21
1	China	F	23
2	Mali	F	27
3	Australia	F	24
4	Luxembourg	M	19

The questions asks us to find the hosts and guests pairs where they are both of the same gender and nationality.

Below is an example of an incorrect query.

Code With Error

SELECT h.host_id,
       DISTINCT g.guest_id
FROM airbnb_hosts h
INNER JOIN airbnb_guests g ON h.nationality = g.nationality
AND h.gender = g.gender;

In the above example, the code will result in an error. The DISTINCT clause should be used at the beginning of listing the columns in the select query. Also, DISTINCT cannot be applied only to one column, but it automatically applies to all the columns listed in the select statement.

Code Without Error

SELECT DISTINCT h.host_id,
                g.guest_id
FROM airbnb_hosts h
INNER JOIN airbnb_guests g ON h.nationality = g.nationality
AND h.gender = g.gender;

In the above example, the code will successfully run. The DISTINCT clause is used right after the select statement. This doesn’t mean that DISTINCT is only applied to column1 in the above example, but by default, it applies to all the columns in the select statement (column1 and column2 in the above example).

All required columns and the first 5 rows of the solution are shown

host_id	guest_id
0	9
1	5
2	1
3	7
4	0

Thus, the result of the above query will give unique values of the columns host_id and guest_id, i.e., the unique combinations. If you want one column to be only unique values and the other columns to be all values, you need two different outputs/queries/results.

SQL Coding Error #5: Incorrect Use of LIMIT in Questions Where RANK() or DENSE_RANK() Should Ideally Be Used

This is another SQL coding error that beginners do in Data Science about using LIMIT in ranking questions. This will sometimes give the correct answer, but the solution would be wrong if there are any edge cases. LIMIT is used when checking the sample data in a table. For example, if we have an employee table, we can do LIMIT 10 on that table to see the first ten rows. The RANK() function is used to rank the data based on a specific condition.

We’ll show this in an example.

Ranking Hosts By Beds

Interview Question Date: July 2020

Airbnb

MediumID 10161

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Rank each host based on the number of beds they have listed. The host with the most beds should be ranked 1 and the host with the least number of beds should be ranked last. Hosts that have the same number of beds should have the same rank but there should be no gaps between ranking values. A host can also own multiple properties. Output the host ID, number of beds, and rank from highest rank to lowest.

Table: airbnb_apartments

Link: https://platform.stratascratch.com/coding/10161-ranking-hosts-by-beds

Table: airbnb_apartments

host_id	apartment_id	apartment_type	n_beds	n_bedrooms	country	city
0	A1	Room	1	1	USA	New York
0	A2	Room	1	1	USA	New Jersey
0	A3	Room	1	1	USA	New Jersey
1	A4	Apartment	2	1	USA	Houston
1	A5	Apartment	2	1	USA	Las Vegas

Now let’s change this question slightly to understand this common SQL coding error. So the new question would be: List the top 5 host IDs based on the number of beds. If there are multiple hosts with the same number of beds, then display all host IDs.

Below is the common mistake of using LIMIT in such questions.

Code With Error (Incorrect Solution)

SELECT host_id, 
       SUM(n_beds) AS number_of_beds
FROM airbnb_apartments
GROUP BY host_id
ORDER BY number_of_beds desc
LIMIT 5;

All required columns and the first 5 rows of the solution are shown

host_id	number_of_beds
10	16
3	8
6	6
5	5
9	4

From the output, you can see the top 5 host IDs based on the number of beds. But, in reality, there are more hosts with 4 beds, and in the solution, we can only see 1 at the 5th position. Thus, we need to use DENSE_RANK() to rank all the hosts and then filter based on the rank for each host ID.

Code Without Error (Using DENSE_RANK)

SELECT *
FROM 
    (
    SELECT 
        host_id, 
        SUM(n_beds) AS number_of_beds,
        DENSE_RANK() OVER(ORDER BY SUM(n_beds) DESC) AS rank
    FROM airbnb_apartments
    GROUP BY host_id
    ORDER BY number_of_beds desc
    )A
WHERE RANK <=5

All required columns and the first 5 rows of the solution are shown

host_id	number_of_beds	rank
10	16	1
3	8	2
6	6	3
5	5	4
7	4	5
1	4	5
9	4	5

From the output of the above query, we get a total of 7 rows because there are 3 hosts with rank 5 in the dataset. Thus, using LIMIT would give us wrong results, and thus, it should be used very carefully in such questions.

The RANK() and DENSE_RANK() are the window functions. It might be a good idea to make yourself familiar with them in our ultimate guide to SQL window functions.

SQL Coding Error #6: WHERE vs HAVING

Oftentimes, beginners get confused with the WHERE clause and HAVING clause and do not understand which one to use in which situation. If we look at the order of execution at the start of this article, the WHERE clause is the first thing the SQL query executes after getting the data.

The WHERE clause is used to filter specific rows, while the HAVING clause is used to filter specific groups. The HAVING clause is used when you need to filter based on a certain aggregation in the data.

Now let’s look at an example where many users try to filter based on the aggregation using the WHERE clause, but instead, they should be using a HAVING clause.

Positive Ad Channels

Uber

HardID 10013

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the advertising channel with the smallest maximum yearly spending that still brings in more than 1500 customers each year.

Table: uber_advertising

Link: https://platform.stratascratch.com/coding/10013-positive-ad-channels

Table: uber_advertising

year	advertising_channel	money_spent	customers_acquired
2019	celebrities	10000000	1800
2019	billboards	1000000	2000
2019	busstops	1500	400
2019	buses	70000	2500
2019	tv	300000	5000

The query tries to find the distinct advertising channels with at least 1,500 customers acquired through that channel.

Code With Error

SELECT DISTINCT advertising_channel
FROM uber_advertising
WHERE MIN(customers_acquired) > 1500;

The above code will result in an error. In the WHERE condition, the user tries to find the minimum value of the customers_acquired field using a MIN() function. This is an aggregate function and can’t be used in the WHERE condition. If we need to implement a condition based on aggregation, then HAVING must be used. Below is the correct code for such scenarios.

Code Without Error

SELECT advertising_channel
FROM uber_advertising
GROUP BY advertising_channel
HAVING MIN(customers_acquired) > 1500;

The above code will successfully run and show the following output.

All required columns and the first 5 rows of the solution are shown

advertising_channel
buses
tv
celebrities
billboards

In this case, we have used the aggregate function MIN() in the HAVING clause instead of the WHERE clause. Thus, it's crucial to read the question carefully and deduce if the condition needs to be satisfied on every occasion or only once. With practice, beginners in data science must get familiar with WHERE and HAVING clauses and when to use what.

You can find more on this topic in our Database Interview Questions article.

SQL Coding Error #7: Float Division

This is another common SQL coding error beginners make when computing a division between two integer values. Let’s take an example.

Consider you have three columns in the table sales_table – date, sales, and orders.

You need to calculate a derived column sales_per_order. The columns sales and orders are integers, but sales_per_order should be of floating type.

Code With Error (Semantic)

SELECT date, 
       sales,
       orders,
       sales/orders AS sales_per_order
FROM sales_table;

The above query will run and give an output, but the new column generated will have an integer rather than a floating value.

This is not what we wanted, and thus, we are calling it a semantic error. To get the floating type column, you need to have at least 1 column with float type. Below is the correct query where we convert one column into a floating type.

Code Without Error

SELECT date, 
       sales,
       orders,
       CAST(sales AS FLOAT)/orders AS sales_per_order
FROM sales_table;

From the above query, we will get the correct result for the derived column. In this, we first converted the sales column into float using a CAST() function and then divided it with the orders column. Even if there is only one float, the operation's output will result in a float type.

Thus, by just transforming at least one column to a float type, the result of the division is going to be a floating number. Another way to change the integer column to a float is by multiplying the value with 1.0, which is similar to casting.

SQL Coding Error #8: Entities Need to Be Associated With Two Specific Instances From a Category

It might be difficult to understand what the problem here is, but we’ll explain. It’s a common problem where you need to output entities that need to have a combination of two (or more) specific values from the same column. For example, the user needs to be both an Android and iPhone user.

Many users try to solve this problem by using the AND logical operator in the WHERE clause.

Let’s look at the example where we want to find the user IDs that have at least one ‘Refinance’ and one ‘InSchool’ submission. In other words, they need to have at submissions of both categories.

Submission Types

Interview Question Date: January 2021

MetLife

Credit Acceptance

Credit Karma

EasyID 2002

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Write a query that returns the user ID of all users that have created at least one ‘Refinance’ submission and at least one ‘InSchool’ submission.

Table: loans

Link: https://platform.stratascratch.com/coding/2002-submission-types/discussion

Table: loans

id	user_id	created_at	status	type
1	100	2017-04-21	prequal_completd_offer	Refinance
2	100	2017-04-27	offer_accepted	Refinance
3	101	2017-04-22	prequal_completd_no_offer	Refinance
4	101	2017-04-23	offer_accepted	Refinance
5	101	2017-04-25	offer_accepted	Personal

Code With Error

SELECT user_id
FROM   loans
WHERE  type = 'Refinance' AND type = 'InSchool';

This code won’t return an error per se, but the output will be empty. Why? The WHERE clause doesn’t have the context of other rows. The AND operator says the type has to be ‘Refinance’ or ‘InSchool’, but this is never true – there’s no one row with the ‘Refinance’ and ‘InSchool’ values in the same row.

All required columns and the first 5 rows of the solution are shown

user_id

Code Without Error

One way to write the correct code is to use the INTERSECT operator. It will combine two SELECT statements and return their intersections as output, i.e., the rows common to both SELECT statements.

SELECT user_id
FROM   loans
WHERE  type = 'Refinance'
INTERSECT
SELECT user_id
FROM   loans
WHERE  type = 'InSchool';

The other approach could be to write two CTE. One that returns user IDs with the ‘Refinance’ submission, the other with ‘InSchool’.

Then write the SELECT statement where you JOIN the two CTEs and return the distinct user IDs to remove duplicates.

WITH refinance AS (
  SELECT user_id 
  FROM loans 
  WHERE type = 'Refinance'
),

inschool AS (
  SELECT user_id 
  FROM loans 
  WHERE type = 'InSchool')

SELECT DISTINCT r.user_id
FROM refinance r 
JOIN inschool i 
ON r.user_id = i.user_id;

Both queries will return the user whose ID is 108.

All required columns and the first 5 rows of the solution are shown

user_id
108

Summary

In this article, we covered the top SQL coding errors that Data Science beginners make by analyzing the solutions submitted to our platform. This will help the readers understand what these mistakes are, how to avoid them in the future, or what the workarounds can be. Also, we discussed the order of execution of SQL queries to help beginners understand what part executes first and what part executes last. We hope this article will help you in your journey to become a data scientist.

Don’t be overwhelmed with the topics that we discussed today. Remember, Rome wasn't built in one day, so stick with StrataScratch, and slowly and steadily you will get to your desired position. All the best.

Top Most Common SQL Coding Errors in Data Science

Categories

Order of Execution of SQL Query

Most Common SQL Coding Errors

SQL Coding Error #1: Use of Reserved Keywords

Most Active Users On Messenger

Code With Error

Code Without Error

SQL Coding Error #2: Column as Reserved Keyword

Code With Error

Code Without Error

SQL Coding Error #3: Data De-Duplication

Finding User Purchases

Code With Error (Semantic - Duplication in the Data)

Code Without Error

SQL Coding Error #4: Wrong Understanding of the DISTINCT Clause

Find matching hosts and guests in a way that they are both of the same gender and nationality

Code With Error

Code Without Error

SQL Coding Error #5: Incorrect Use of LIMIT in Questions Where RANK() or DENSE_RANK() Should Ideally Be Used

Ranking Hosts By Beds

Code With Error (Incorrect Solution)

Code Without Error (Using DENSE_RANK)

SQL Coding Error #6: WHERE vs HAVING

Positive Ad Channels

Code With Error

Code Without Error

SQL Coding Error #7: Float Division

Code With Error (Semantic)

Code Without Error

SQL Coding Error #8: Entities Need to Be Associated With Two Specific Instances From a Category

Submission Types

Code With Error

Code Without Error

Summary

Latest Posts:

From Arrays to Lists: Converting Numpy Arrays with Python

Tips for Crafting an Impressive Data Analyst Portfolio

PySpark GroupBy Guide: Super Simple Way to Group Data

Share

Follow

Categories