Data Engineer SQL Interview Questions From Top Employers

Data Engineer SQL Interview Questions
Categories


In this article, we will list important SQL concepts and walk you through 10 data engineer SQL interview questions asked by employers like Google and Amazon.

Companies collect and maintain large volumes of data to better understand their customers and improve various facets of their business. However, raw data can be a bit messy and disorganized. Companies need data engineers to refine initially collected data so that it’s easier to work with.

Data can be used for various tasks - analysis, visualization, or fed into machine learning models, just to name a few. Data engineers prepare data according to the specifics of the task so that data scientists can utilize the data instead of fixing its inconsistencies.

To do this, data engineers need to have a specific set of skills. In this article, we will discuss SQL -  the main programming language for working with relational databases. We’ll go over the most important SQL concepts tested in SQL interview questions and walk you through data engineer SQL interview questions asked at top organizations today.

Concepts tested in data engineer SQL interview questions

Data engineers perform a wide variety of tasks. Few examples: validating and ordering data, handling edge cases, implementing consistent formatting, fixing bad data, and changing data types.

Data engineers will most likely use a combination of programming languages and tools. SQL and Python are two of the most important skills. StrataScratch blog has published an article about Python Data Engineer Interview Questions as well.

We compiled a list of SQL concepts most often used in data engineers’ day-to-day jobs. If you’re preparing for an interview, make sure to master these concepts. Data engineer SQL Interview questions often require a strong knowledge of these features.

Concepts tested in data engineer SQL interview questions

We will start with easier concepts and cover difficult ones as well.

Filtering

As a data engineer, you must know how to use the WHERE clause to filter tables. It is usually paired with a condition and returns only rows that satisfy the condition.

It’s important to know how to set up a condition for the WHERE clause and how to couple it with SELECT, UPDATE and DELETE statements.

HAVING is another important filtering feature of SQL. Unlike WHERE, it is used to filter aggregated group data. It’s important to know the differences between WHERE and HAVING clauses and when to use them.

DDL and DML

Data Definition Language (DDL) and Data Manipulation Language (DML) allow data engineers to define and modify database structure - create new tables and objects, specify their name, define columns and specify types of values in them. This is a very short list of what data engineers can do using DDL and DML.

Data Type Conversion

Sometimes, data engineers have to convert values to another type. This may be necessary to perform certain SQL operations like comparison.

To be an effective data engineer, you need to know the difference between implicit and explicit type conversions. It’s important to know how to use CAST(), and CONVERT() functions for explicit conversion, and even better if you can explain the differences between these two functions. This will set you apart from other candidates interviewing for the same job.

You should also be prepared to explain implicit conversions and when they occur. This is important because one unexpected data conversion could lead to errors that are hard to identify and resolve.

ORDER BY

Arranging data using the ORDER BY statement is a basic but important SQL skill. Sometimes complex data engineering tasks can be solved using a simple ORDER BY statement.

The basic syntax of the ORDER BY statement is simple - you need to specify column values and the order (ascending or descending) to arrange rows. Advanced data engineer SQL interview questions might ask you to order rows based on values in multiple columns.

Subqueries

Any candidate interviewing for a data engineer job should know what subqueries are and how to write them.

Sometimes subqueries are described as a query in another query. They allow us to perform complex SQL operations.

Data engineers should also know how to use subqueries with various SQL features: WHERE, HAVING, FROM, SELECT.

Aggregate functions

An essential SQL feature to summarize large volumes of data. Aggregate functions are often used with GROUP BY statements to separate rows into groups and aggregate values for each group. Other times, you’ll need to aggregate data for the entire table.

There are five main aggregate functions in SQL:

  • SUM() finds the total sum of all numeric values in a column.
  • COUNT() finds the number of rows in the entire table or a group
  • MIN and MAX() are used to find the highest and lowest values in a table or a group.
  • AVG() calculates the average of all values in the column or a group

You can use these aggregate functions within a subquery to set conditions for WHERE and HAVING clauses.

You can find many examples of SQL Interview Questions involving aggregate functions on our blog.

In this article, we’ll solve interview questions in PostgreSQL, which has a wide variety of aggregate functions. Go to PostgreSQL documentation for a full list.

NULL values

Data engineers sometimes have to do what other data scientists don’t want to do - deal with NULL values.

Removing or replacing NULL values can improve the quality of data, which can make data more accessible and easy to work with.

Before dealing with them, you must understand what a NULL value is. It is not 0 or an empty text (‘ ‘) but an absence of any value. The absence of a value is typically a challenge to performing accurate data analysis.

You can’t use normal comparison operators like > < = to work with NULL values.

You can use IS NULL and IS NOT NULL operators to find values that are (and are not) NULL. Or use functions like ISNULL() to return a specific value instead of NULL.

Finally, it’s important to know how aggregate functions and NULL values work together. For example, how you can use the COUNT() aggregate function to find the number of NULL values.

COUNT(*) returns a number of values including NULL ones, whereas COUNT(column) returns the number of non-null values. The difference between them will be the number of NULL values in the table.

Text functions

In SQL, text values belong to data types such as VARCHAR, CHAR, or TEXT, and they are very common. It’s important for data engineers to be able to use built-in text functions in SQL.

CHAR_LENGTH() returns a number of symbols in the text. LOWER() and UPPER() convert text values to lowercase or uppercase. They help you ensure consistent capitalization of letters, which is absolutely necessary if the task involves case-sensitive functions.

Other important text functions are TRIM() and SUBSTRING(). The former allows you to remove unnecessary spaces before and after text values. The latter allows you to extract a specific portion of the string.

Finally, the REPLACE() function finds substring instances and replaces them with another text.

This was a short list of the most important text functions in PostgreSQL. Refer to the documentation for a full list. The more text functions you know, the better.

JOINs

Data engineers use JOINs to combine data from multiple tables into one. There are four main types of JOINs in SQL.

  • INNER JOINs return rows with overlapping values in the shared dimension.
  • LEFT JOINs return all rows from the first table but only overlapping values from the second table.
  • RIGHT JOINs are the opposite and return all rows from the second table and only matching values from the first table.
  • FULL JOINs return combined columns of both tables. If rows do not overlap, it fills absent columns with NULL.
  • CROSS JOINs return all possible combinations of columns from both tables.

Understanding various types of JOINs will help you succeed as a data engineer. You should understand their syntax and be able to recognize which type of JOIN is needed for a given task. Knowing how to use the ON statement to define a shared dimension is also important.

Window functions

You can use window functions to create new values. For example, they can be used to assign each record a numerical rank. Or create new aggregate values of the entire table or certain group of rows.

Unlike the GROUP BY statement, aggregate values are stored in a separate column, not collapsed into one row.

Before going into an interview, practice important window functions like ROW_NUMBER(), RANK(), and DENSE(). Learn how to use PARTITION BY to create groups and apply window functions to each group, not the entire table.

In later parts of this article, we’ll see practical examples of window functions and how the aforementioned functions can solve real business challenges.

Array functions

As a data engineer, you’ll have to work with complex data structures like arrays. So it’s important to have a good grasp of array functions.

You can start with learning the ARRAY() command, which takes values and adds them to an array.

ARRAY_AGG() function aggregates all items of the array. ARRAY_APPEND() function allows you to add specific value to an array. Or join two arrays together using the ARRAY_CONCAT() function.

CASE expression

Setting up conditions for filtering and joining tables is an integral part of a data engineer’s job. CASE expression can be used to set up complex conditions for WHERE and HAVING clauses.

You should know how to use CASE expressions to chain multiple conditions. That includes WHEN, THEN, and ELSE clauses, which make up the condition.

It’s important to understand how SQL determines the outcome of a CASE expression. A chain of CASE statements will return the value of whichever condition is met first. If none of the conditions are met, it will return the value of the ELSE clause. If there is no ELSE clause, CASE will return NULL.

DISTINCT

Data engineers use the DISTINCT clause for many purposes. When used with a SELECT statement, it returns only unique values. When used with aggregate functions, it ensures that they are applied to unique values only.

DISTINCT provides a very simple yet effective way to remove (or ignore) duplicate values. For that reason, it is a useful tool at any data engineer’s disposal.

Basic Data Engineer SQL Interview Questions

Basic Data Engineer SQL Interview Questions

Now, let’s see these SQL concepts used in practice.

Question #1: Cast Values to Integer

This is a typical data engineer SQL interview question, where you are asked to cast values, remove inconsistencies, and clean up the table.


Table: yelp_reviews

Link to the question: https://platform.stratascratch.com/coding/10056-cast-stars-column-values-to-integer-and-return-with-all-other-column-values

Understand the question

This data engineer SQL interview question is fairly straightforward as long as you take the time to understand what’s being asked.

There are two clear objectives: cast stars values to integer and clear the table of rows that don’t have numeric star values. Otherwise, leave the table as it is and return all other columns as well.

Analyze data

To answer this question, we have to work with a single yelp_reviews table. Let’s look at the types of values in all columns:

Table: yelp_reviews
business_namereview_iduser_idstarsreview_datereview_textfunnyusefulcool
AutohausAZC4TSIEcazRay0qIRPeMAFgjlbPUcCRiXlMtarzi9sW5w52011-06-27Autohaus is my main source for parts for an old Mercedes that has been in the family since new that I am restoring. The old beater is truly a labor of121
Citizen Public HouseZZ0paqUsSX-VJbfodTp1cQEeCWSGwMAPzwe_c1Aumd1w42013-03-18First time in PHX. Friend recommended. Friendly waitstaff. They were understaffed and extremely busy, but we never knew. The short ribs were tende000
Otto Pizza & PastrypF6W5JOPBK6kOXTB58cYrwJG1Gd2mN2Qk7UpCqAUI-BQ52013-03-14LOVE THIS PIZZA! This is now one of my favorite pizza places in phoenix area. My husband i walked into this cute family owned business and were greet000
Giant HamburgersQBddRcflAcXwE2qhsLVv7wT90ybanuLhAr0_s99GDeeg32009-03-27ok, so I tried this place out based on other reviews. I had the cheeseburger with everything, some fries, and a chocolate shake. The burger was okay, 011
Tammie Coe CakesY8UMm_Ng9oEpJbIygoGbZQMWt24-6bfv_OHLKhwMQ0Tw32008-08-25Overrated. The treats are tasty but certainly not the best I have ever had. I would have rated this a two star but the cakes and cookies are REALLY pr132

As you can see, the stars column stores text values of the ‘varchar’ type.

Business_name, review_id, user_id, review_date, and review_text similarly contain text values.

Funny, useful, and cool columns are integers.

All columns will stay the same except for the stars column. Values in this column will be converted to an integer. We must also remove rows where the stars column contains values other than numbers.

We can see that the stars column contains whole numbers to represent the yelp rating of each business. However, there are exceptions when the value is a question sign (‘?’).

It might look like values in the stars column are already numbers and don’t need to be converted. However, it’s important to understand that the ‘4’ you see in the column is actually text. Not a number 4, but a piece of text, like ‘4 apples’.

Plan your approach

To solve this data engineer SQL interview question, we need to perform two tasks - keep rows where the value of the stars column is a number (not the ‘?’), and convert values of the stars column to the integer type.

Finally, we need to return all columns. Only values in the stars column need to be converted to integers, others are returned as they were before.

Write the code

  1. Convert stars values

Let’s start by converting stars values from ‘varchar’ to ‘integer’ type.

Let’s select all other columns as well since our query needs to return them all.

In PostgreSQL, we can use the double-semicolon syntax to easily cast the value of the stars column.

SELECT business_name,
       review_id,
       user_id,
       stars :: INTEGER,
       review_date,
       review_text,
       funny,
       useful,
       cool
FROM yelp_reviews

In MySQL, you can use either CAST() or CONVERT() functions instead.

  1. Remove rows with non-numeric star values

As we’ve seen in the previous section, some records have the stars value of ‘?’.

We can simply set a WHERE clause to only keep rows where the value of the stars column is not ‘?’.

SELECT 
    business_name,
    review_id,
    user_id,
    stars :: INTEGER,
    review_date,
    review_text,
    funny,
    useful,
    cool
FROM yelp_reviews
WHERE 
    stars <> '?'

That’s it. This query will convert the stars column to integer, and filter out rows where the value of stars column is ‘?’.

Output

Output table should look exactly like before, but the value of the stars column will be of integer type. Also, records with the  ‘?’ value in the stars column will be filtered out.

All required columns and the first 5 rows of the solution are shown

business_namereview_iduser_idstarsreview_datereview_textfunnyusefulcool
AutohausAZC4TSIEcazRay0qIRPeMAFgjlbPUcCRiXlMtarzi9sW5w52011-06-27Autohaus is my main source for parts for an old Mercedes that has been in the family since new that I am restoring. The old beater is truly a labor of121
Citizen Public HouseZZ0paqUsSX-VJbfodTp1cQEeCWSGwMAPzwe_c1Aumd1w42013-03-18First time in PHX. Friend recommended. Friendly waitstaff. They were understaffed and extremely busy, but we never knew. The short ribs were tende000
Otto Pizza & PastrypF6W5JOPBK6kOXTB58cYrwJG1Gd2mN2Qk7UpCqAUI-BQ52013-03-14LOVE THIS PIZZA! This is now one of my favorite pizza places in phoenix area. My husband i walked into this cute family owned business and were greet000
Giant HamburgersQBddRcflAcXwE2qhsLVv7wT90ybanuLhAr0_s99GDeeg32009-03-27ok, so I tried this place out based on other reviews. I had the cheeseburger with everything, some fries, and a chocolate shake. The burger was okay, 011
Tammie Coe CakesY8UMm_Ng9oEpJbIygoGbZQMWt24-6bfv_OHLKhwMQ0Tw32008-08-25Overrated. The treats are tasty but certainly not the best I have ever had. I would have rated this a two star but the cakes and cookies are REALLY pr132

Question 2: Arrange worker records based on values in two columns

In this question, aspiring data engineers are asked to arrange names in an alphabetic order.


Table: worker

Link to the question: https://platform.stratascratch.com/coding/9836-sort-workers-in-ascending-order-by-the-first-name-and-in-descending-order-by-department-name

Understand the question

The premise of this data engineer SQL interview question is fairly simple. We need to arrange records based on names (text values) in two columns.

The question only tells us to sort records in ascending and descending orders. However, since values are text, it means we need to order rows alphabetically.

There’s a twist to this question - for one column (first names), we must order in ascending order (from A to Z). For the second column (department names), we must order them in descending order (from Z to A).

Analyze data

All the necessary information is contained in a single workers table. Let’s start by looking at the types of values in each column.

Table: worker
worker_idfirst_namelast_namesalaryjoining_datedepartment
1MonikaArora1000002014-02-20HR
2NiharikaVerma800002014-06-11Admin
3VishalSinghal3000002014-02-20HR
4AmitahSingh5000002014-02-20Admin
5VivekBhati5000002014-06-11Admin

We have six columns.

  • worker_id contains integer values to identify workers. Most likely, we won’t have to work with this column.
  • The first_name column contains first name values. We will order records by looking at values in this column.
  • We don’t need to work with values in the last_name, salary, or joining_date columns.
  • Department names are stored in the department column. We will order records by looking at values in this column.

Looking at the table, all values look standard. It’s worth noting that first and last names are capitalized.

Plan your approach

This question asks us to order rows based on values in two columns - first_name and department. We don’t have to change rows or values in them.

We need to select all rows and use the ORDER BY statement to arrange them in a specific order.

It’s important that you know how to order rows according to values in two columns.

Write the code

Try to write the code yourself.

Output

The correct query should return a sorted table.

All required columns and the first 5 rows of the solution are shown

worker_idfirst_namelast_namesalaryjoining_datedepartment
9AgepiArgon900002015-04-10Admin
4AmitahSingh5000002014-02-20Admin
8GeetikaChauhan900002014-04-11Admin
12JaiPatel850002014-03-21HR
10MoeAcharya650002015-04-11HR

Question 3: Find the last five records of the dataset

In this data engineer SQL interview question, data engineers are asked to select a specific part of a large dataset.


Table: worker

Link to the question: https://platform.stratascratch.com/coding/9864-find-the-last-five-records-of-the-dataset

Understand the question

Description of this question is fairly simple.

Before attempting to answer the question, it’s a good idea to look at available data and examine values, their types, and the ordering of records.

Analyze data

There is one available table named worker. We already worked with this table in the previous question.

In this case, we are not concerned with any of the six columns or values contained in them. We simply have to find and return the last five records.

However, it’s important to look at the actual table (with data) to see if we can notice anything that will help us find the last five records.

Without further ado, let’s look at the worker table with actual data:

Table: worker
worker_idfirst_namelast_namesalaryjoining_datedepartment
1MonikaArora1000002014-02-20HR
2NiharikaVerma800002014-06-11Admin
3VishalSinghal3000002014-02-20HR
4AmitahSingh5000002014-02-20Admin
5VivekBhati5000002014-06-11Admin

As you can see, every record has its own unique worker_id value.

It’s safe to assume that records are arranged in ascending order according to worker_id values. Double-check with the interviewer to confirm this pattern. Once you do, solving this data engineer SQL interview question becomes much easier.

Plan your approach

Before we can find and extract the last five records, we need to define the criteria for identifying them.

We’ve seen that each record has its own unique identifier of integer type. As long as these numbers are unique and increasing, counting the number of all rows will give us the id of the last record.

We can use that information to get the worker_id value of the last five records and set up the condition only to keep rows with those ids.

Or we can find the worker_id of the fifth row before the last and keep records where worker_id is higher than the threshold.

Write the code

Practice your SQL skills by answering the question yourself.


Output

Your final query should return the last five records.

All required columns and the first 5 rows of the solution are shown

worker_idfirst_namelast_namesalaryjoining_datedepartment
8GeetikaChauhan900002014-04-11Admin
9AgepiArgon900002015-04-10Admin
10MoeAcharya650002015-04-11HR
11NayahLaghari750002014-03-20Account
12JaiPatel850002014-03-21HR

Question 4: Primary Key Violation

In this data engineer SQL interview question, you have to find duplicate cust_id values and return the number of times they are repeated. You can find other Amazon Data Engineer Interview Questions on the StrataScratch blog.


Table: dim_customer

Link to the question: https://platform.stratascratch.com/coding/2107-primary-key-violation

Understand the question

The question has detailed instructions, but it can be confusing. Make sure to read it multiple times to get to the essence of the question.

To solve this challenge, you must know that primary keys are unique identifiers for each record. They can not be repeated. The question refers to this as the ‘primary key constraint’.

There are also instructions for how to format the output. We need to return rows with the duplicate cust_id values and the number of times they appear.

Analyze data

Looking at the available data could help you better understand the question.

Let’s start by looking at the types of values in every column of the only available table named dim_customer:

Table: dim_customer
cust_idcust_namecust_citycust_dobcust_pin_code
C273Stephen V. CookeNew York1996-11-288235
C274Peter P. MankinMount Upton1984-06-256050
C274Juan C. ParkerMertzon1989-07-076867
C274Eve E. McClureSouthfield1995-05-187791
C275Charles J. StevensOakland1975-12-025930

To solve this data engineer SQL interview question, we’ll have to find duplicate cust_id values and the number of times they are repeated. This column contains text values, specifically of the ‘varchar’ type.

cust_name, cust_city, cust_dob and cust_pin_code values are not important.

Plan your approach

In simple words, we need to find duplicate cust_id values and the number of times they are repeated.

The simplest way is to create a group of rows for each unique cust_id value, find the number of rows in each group, and return those with more than one row in the group.

Then return the number of records in each group. This will return the number of rows with the same cust_id value.

Write the code

Now that you have a logical outline of what to do, try to match it by writing the SQL query yourself.

Output

Query should return two columns - the key and the number of times it was repeated.

All required columns and the first 5 rows of the solution are shown

cust_idn_occurences
C2762
C2812
C2743

Advanced data engineer SQL interview questions

Advanced data engineer SQL interview questions

Let’s also look at some of the more difficult SQL questions asked during data engineer interviews.

Question 5: Verify the first 4 numbers of all phone numbers

In this question, we have to validate data by checking phone number values and confirm that they begin with specific 4 numbers.


Table: sf_restaurant_health_violations

Link to the question: https://platform.stratascratch.com/coding/9737-verify-that-the-first-4-digits-are-equal-to-1415-for-all-phone-numbers

Understand the question

This data engineer SQL interview question asks us to verify that all phone numbers begin with 1415. It’s safe to assume that one of the columns contains phone number values. We will have to make sure that the first four symbols of values in this column are 1415.

It’s unclear what kind of answer we need to return. It could be binary - to show whether or not all phone numbers begin with 1415. Or we might have to return the number of phone numbers that begin with 1415 or do not begin with 1415.

In situations like this, you should ask the interviewer for specific instructions.

Analyze data

The sf_restaurant_health_violations table contains information about businesses, including their phone numbers.

Let’s take a closer look at the types of values contained in each column.

Table: sf_restaurant_health_violations
business_idbusiness_namebusiness_addressbusiness_citybusiness_statebusiness_postal_codebusiness_latitudebusiness_longitudebusiness_locationbusiness_phone_numberinspection_idinspection_dateinspection_scoreinspection_typeviolation_idviolation_descriptionrisk_category
5800John Chin Elementary School350 Broadway StSan FranciscoCA9413337.798-122.403{'longitude': '-122.403154', 'needs_recoding': False, 'latitude': '37.798358', 'human_address': '{"address":"","city":"","state":"","zip":""}'}5800_201710172017-10-1798Routine - Unscheduled5800_20171017_103149Wiping cloths not clean or properly stored or inadequate sanitizerLow Risk
64236Sutter Pub and Restaurant700 Sutter StSan FranciscoCA9410237.789-122.412{'longitude': '-122.41188', 'needs_recoding': False, 'latitude': '37.78881', 'human_address': '{"address":"","city":"","state":"","zip":""}'}64236_201707252017-07-2588Routine - Unscheduled64236_20170725_103133Foods not protected from contaminationModerate Risk
1991SRI THAI CUISINE4621 LINCOLN WaySan FranciscoCA9412237.764-122.508{'longitude': '-122.507779', 'needs_recoding': False, 'latitude': '37.764073', 'human_address': '{"address":"","city":"","state":"","zip":""}'}1991_201711292017-11-2986Routine - Unscheduled1991_20171129_103139Improper food storageLow Risk
3816Washington Bakery & Restaurant733 Washington StSan FranciscoCA9410837.795-122.406{'longitude': '-122.405845', 'needs_recoding': False, 'latitude': '37.795174', 'human_address': '{"address":"","city":"","state":"","zip":""}'}3816_201607282016-07-2867Routine - Unscheduled3816_20160728_103108Contaminated or adulterated foodHigh Risk
39119Brothers Restaurant4128 GEARY BlvdSan FranciscoCA9411837.781-122.464{'longitude': '-122.463762', 'needs_recoding': False, 'latitude': '37.781148', 'human_address': '{"address":"","city":"","state":"","zip":""}'}39119_201607182016-07-1879Routine - Unscheduled39119_20160718_103133Foods not protected from contaminationModerate Risk

The table has a lot of columns.

The only column that contains a phone number is business_phone_number, so it’s safe to assume that we need to check values in this column.

We can confirm our assumption by looking at the actual table above.

It's important to note that not all rows have values in this column. Some are empty.

Plan your approach

To solve this question, we have to filter rows by two criteria - the value in the business_phone_number column is not empty. If it has a number, it must start with 1415.

Let’s assume that the question asks us to return the number of rows where the phone number does not start with 1415.

Finally, we should return the number of rows that satisfy both criteria.

Write the code

1. Get the number of all rows

We can use the COUNT() aggregate function to get the number of all records in the table.

SELECT COUNT (*)
FROM sf_restaurant_health_violations

COUNT() will return the total number of records in the table. Output will be collapsed into one row.


2. Set up two conditions

To get the answer, we need to find the number of records that satisfy the two aforementioned criteria.

We can add a WHERE clause to remove records that do not meet the criteria.

SELECT COUNT (*)
FROM sf_restaurant_health_violations
WHERE business_phone_number IS NOT NULL
  AND LEFT(business_phone_number :: TEXT, 4) <> '1415'

The COUNT(*) aggregate function will apply to filtered records.

We use the LEFT() function to take four symbols from the left side (beginning) of the value, and compare it with 1415.

Output

Our query will return the number of phone number values that exist but do not start with 1415.

All required columns and the first 5 rows of the solution are shown

count
0

Question 6: SMS Confirmations From Users

In this data engineer SQL interview question, data engineers must identify and get the number of invalid records. It is one of many Facebook Data Engineer Interview Questions.


Tables: fb_sms_sends, fb_confirmers

Link to the question: https://platform.stratascratch.com/coding/10291-sms-confirmations-from-users

Understand the question

Read the description of this question multiple times because it’s a lot to take in at once.

In short, a table contains records that should not be there. Specifically, confirmation and friend request records. It’s our job to identify misplaced records.

Finally, we are asked to find what percentage of confirmation texts were valid on August 4.

Analyze data

You’ll have to work with two tables to find the answer.

One is the fb_sms_sends table. The question tells us that some of the records in this table are invalid.

The other fb_confirmers table contains correct confirmation records.

We need to compare these two tables to calculate the percentage of confirmed texts for a specific date.

Let’s look at columns in each of these tables:

Table: fb_sms_sends
dscountrycarrierphone_numbertype
2020-08-07ESat&t9812768911confirmation
2020-08-02ADsprint9812768912confirmation
2020-08-04SAat&t9812768913message
2020-08-02AUsprint9812768914message
2020-08-07GWrogers9812768915message

the ds column contains datetime values, most likely the date of sending a message. We will need to check this column to find sms confirmations for the 4th of August.

The question does not ask us anything about the location or network of the user, so we can ignore the country and carrier columns.

Values in the type column might help us identify whether the record represents a valid message, or false records.

The phone_number contains a phone number that receives the confirmation message.

Let’s move on to the next table, which contains correct records of confirmation messages.

Table: fb_confirmers
datephone_number
2020-08-069812768960
2020-08-039812768961
2020-08-059812768962
2020-08-029812768963
2020-08-069812768964

Here we have just two columns, values in the date column indicate when the message was sent. Phone numbers are stored in the column of the same name.

Some of the confirmation records in fb_sms_sends table represent an unsuccessful attempt. We need to check them against information in the fb_confirmers table, which contains records of successful confirmations.

Plan your approach

We need some way to identify ‘confirmation’ and ‘friend request’ records that should not be in the fb_sms_sends table. We can do that by merging two tables on shared dimensions of the phone number and the date. If records from the ‘unverified’ table are not in the valid fb_confirmers table, it means they are invalid.

We will also have to set up a condition to only keep rows for messages that were sent on the 4th of August 2020 and remove messages sent on all other days.

Finally, we need to calculate what percentage of all messages were confirmed. Once we merge the table, we can get the number of records from the fb_confirmers table that are also present in the fb_sms_sends table. Then find the number of all records of the ‘message’ type and calculate the percentage.

Write the code

1. Perform a LEFT JOIN to filter out invalid records

To compare valid and invalid confirmation records, we need to combine data from two tables. The safest bet is to perform a LEFT JOIN, which will keep all records from the fb_sms_sends (including invalid records) and only match values from the fb_confirmers table.

SELECT *
FROM fb_sms_sends a
LEFT JOIN fb_confirmers b ON a.ds = b.date
AND a.phone_number = b.phone_number

We will use the ON clause to define two shared dimensions - the date when the message was sent and the phone number.

In the combined table, records from the fb_sms_sends table that don’t share the date and phone_number values will have empty columns. This will help us identify messages that were not confirmed.

All required columns and the first 5 rows of the solution are shown

dscountrycarrierphone_numbertypedatephone_number
2020-08-07ESat&t9812768911confirmation
2020-08-02ADsprint9812768912confirmation
2020-08-04SAat&t9812768913message
2020-08-02AUsprint9812768914message
2020-08-07GWrogers9812768915message


2. Filter records by send date and their type

The question specifically tells us that we need to work with messages sent on the 4th of August in 2020. Looking at the table, we can see that dates are written in the ‘MM-DD-YYYY’ format.

We need to chain another condition to specifically select records of the ‘message’ type.

SELECT *
FROM fb_sms_sends a
LEFT JOIN fb_confirmers b ON a.ds = b.date
AND a.phone_number = b.phone_number
WHERE a.ds = '08-04-2020'
  AND a.type = 'message'

Running this query will only keep message records where the value of ds column is ‘08-04-2020’ and has a type of ‘message’.

All required columns and the first 5 rows of the solution are shown

dscountrycarrierphone_numbertypedatephone_number
2020-08-04ITat&t9812768991message2020-08-049812768991
2020-08-04SAat&t9812768913message
2020-08-04TJrogers9812768934message
2020-08-04NRrogers9812768997message
2020-08-04HNat&t9812768921message

3. Calculate the percentage of valid confirmations

We need to find the percentage of valid confirmation messages vs. all messages from the fb_sms_sends table.

Looking at the output of the previous step, we know that valid confirmation records from the fb_confirmers table have date and phone_number values.

We can use the COUNT() aggregate function to find the number of valid confirmation records and the number of all messages.

Provide phone_number or date columns as an argument to the COUNT() function to find the number of records with values that are not NULL.

Repeat the same to find the number of all message records. Only this time, the argument to the COUNT() function should be one of the five columns where all records have values.

Then divide two aggregates, and you’ll have a decimal ratio. Finally, multiply the ratio by 100 to get the percentage value.

SELECT COUNT(b.phone_number)::float / COUNT(a.phone_number) * 100 AS perc
FROM fb_sms_sends a
LEFT JOIN fb_confirmers b ON a.ds = b.date
AND a.phone_number = b.phone_number
WHERE a.ds = '08-04-2020'
  AND a.type = 'message'

The result of the COUNT() function is always a whole number. Don’t forget to convert it to a float so it can be divided by another whole number.

Finally, use the AS keyword to give the final column a descriptive name.

Output

Query should return one row with a single percentage value.

All required columns and the first 5 rows of the solution are shown

perc
20

Question 7: Find the first 50% of the dataset

In this question, candidates have to return the first half of all rows in the table.


Table: worker

Link to the question: https://platform.stratascratch.com/coding/9859-find-the-first-50-records-of-the-dataset

Understand the question

The premise is fairly simple - we need to return the first half of all rows in the dataset.

Analyze data

We are given one table named worker.

Let’s look at the types of values in every column:

Table: worker
worker_idfirst_namelast_namesalaryjoining_datedepartment
1MonikaArora1000002014-02-20HR
2NiharikaVerma800002014-06-11Admin
3VishalSinghal3000002014-02-20HR
4AmitahSingh5000002014-02-20Admin
5VivekBhati5000002014-06-11Admin

To answer this data engineer interview question, we don’t need to work with any of these columns. We just have to return the first half of all rows.

While looking at data, you might notice that every record seems to have an incremental worker_id value.

This is a limited example with only 8 rows. In reality, data engineers work with much larger datasets.

Plan your approach

To get the first 50% of all rows, first, we must find the total number of rows in the table.

Once we do, we can divide the total by 2. This will give us half of all rows, let’s call this number X. We can tell SQL to return the first X number of rows, but not more.

Rows already have worker_id values, which appear unique and incremental. We can set a condition to return records where the worker_id value is less than X (a number that represents half of all records). However, there are no guarantees that these values are unique and/or incremental.

Instead, we can use window functions to assign each row a unique incremental number and tell SQL to return records with numerical values of less than X (half of all records).

Write the code

If you like a challenge, try to write a query yourself.

Output

The query should output four records because the example data showed 8 records.

All required columns and the first 5 rows of the solution are shown

worker_idfirst_namelast_namesalaryjoining_datedepartment
1MonikaArora1000002014-02-20HR
2NiharikaVerma800002014-06-11Admin
3VishalSinghal3000002014-02-20HR
4AmitahSingh5000002014-02-20Admin
5VivekBhati5000002014-06-11Admin

Question 8: Formatting Names

This question involves cleaning up data and ensuring its consistency.


Table: worker_ws

Link to the question: https://platform.stratascratch.com/coding/9831-formatting-names

Understand the question

The task is fairly simple - remove unnecessary white spaces from the left side of first name values.

The question asks us to print (return) trimmed values.

Analyze data

Let’s start by looking at the table and dataset available for this question:

Missing data

To solve this data engineer SQL interview question, we are only concerned with values in the first_name column.

Plan your approach

We need to take first name values and trim them.

The question tells us to remove white spaces only from the left side of first name values. So the standard TRIM() function is not going to work.

Write the code

Solving this challenge would be a great practice before going into a data engineering interview.

Try to write the query yourself:


Output

The query should return only trimmed first name values and nothing else.

All required columns and the first 5 rows of the solution are shown

first_name_trim
Monika
Niharika
Vishal
Amitah
Vivek

Question 9: Rows With Missing Values

In this data engineer SQL interview question, we have to find and return records with multiple empty columns.


Table: user_flags

Link to the question: https://platform.stratascratch.com/coding/2106-rows-with-missing-values

Understand the question

The objective is clear - identify rows with more than one missing value and return them.

Analyze data

Let’s look at columns in the user_flags table:

Table: user_flags
user_firstnameuser_lastnamevideo_idflag_id
RichardHassony6120QOlsfU0cazx3
MarkMayCt6BUPvE2sM1cn76u
GinaKormandQw4w9WgXcQ1i43zk
MarkMayCt6BUPvE2sM1n0vef
MarkMayjNQXAC9IVRw1sv6ib

As you can see, all columns contain text values.

We’ll have to check all columns to find out whether they are empty or have values in them.

The last record has two empty columns. It is a prime example of the kind of records we’re looking for.

Plan your approach

The question asks us to return rows with missing columns, so we need to:

  • Use the WHERE clause to remove rows that don’t have two or more missing values.
  • Set up a complex condition (using CASE statement) to identify and keep count of empty columns for each row.

Write the code

Try to write the query yourself.


Output

The final query should return all the records with more than one missing value.

All required columns and the first 5 rows of the solution are shown

user_firstnameuser_lastnamevideo_idflag_id
CourtneydQw4w9WgXcQ
GinaKorman
Greg5qap5aO4i9A
Ct6BUPvE2sM

Question 10: Find the combinations

As a data engineer, sometimes you’ll have to prepare data for analysis. For example, extract data that meets certain criteria.


Table: transportation_numbers

Link to the question: https://platform.stratascratch.com/coding/10010-find-the-combinations

Understand the question

You might have to read this data engineer SQL interview question multiple times to understand it fully.

We are given a list of numbers and need to find and output various combinations of three numbers that add up to 8.

There is an extra condition - we can not add a number to itself. Every number in the equation must be unique.

Analyze data

The transportation_numbers table is essentially a list of numbers. It has two columns, index and number, and both of them contain integer values.

We need to go through the list and pick three that add up to 8.

Table: transportation_numbers
indexnumber
15
23
37
41
50

It looks like all values in the number column are below 8.

Plan your approach

The initial table has one column with numbers.

We need to combine the table with itself three times, so that every record contains three number values.

If you use JOIN to combine tables, you can set the ON condition to ensure that all three numbers are different from each other.

Finally, you need to set up a filter that checks if three number columns add up to 8.

Write the code

Try to solve this interesting challenge yourself.


Output

All required columns and the first 5 rows of the solution are shown

num_1num_2num_3
503
341
314
305
701

Summary

Hopefully, by now, you better understand the day-to-day responsibilities of a data engineer and what a data engineer does and the importance of their work.

To do the job successfully, data engineers need to have a very specific set of skills, one of them is SQL. In this article, we listed the most important SQL concepts and walked you through data engineer SQL interview questions asked by reputable employers.

Practice on these questions to maximize your chances of landing a data engineer job. Go to the Data Engineer Interview Questions post on the StrataScratch blog to find more questions to solve. The platform has hundreds of SQL interview questions of various levels of difficulty. You can write queries, check their correctness, get hints and guidance all within one platform. Soon enough, you’ll master SQL and be prepared to solve any interview questions with confidence.

Data Engineer SQL Interview Questions
Categories


Become a data expert. Subscribe to our newsletter.