Apple Data Scientist Interview Questions

Basic to Intermediate Concepts Tested in Apple Data Scientist Interviews

Proficient data scientists can take messy sales data and turn it into well-organized data. Having knowledge of some basic, intermediate and even advanced concepts can give you the flexibility to solve any data science interview questions at hand. To answer this specific Apple data scientist interview question, you’ll need to thoroughly understand these SQL concepts:

Basic to Intermediate Concepts Tested in Apple Data Scientist Interviews

SELECT and FROM

These two statements, in that order, are essential to compose an SQL query. These two statements are so basic that not being able to use them can instantly disqualify you from getting the job.

We use the SELECT statement to specify the data columns we’d like to view, whereas the FROM statement is used to specify the source table. You should know that SELECT allows you to view any number of columns, as long as you follow the syntax rules. Needless to say, when answering practical questions, you should understand these syntax rules and follow them.

During an interview, it’s essential to show a thorough understanding of this concept. For instance, it’s important to remember that you can only use one SELECT statement at a time. Also, you should be able to use the AS command to rename a column.

Besides writing the actual query, it’s useful to have background knowledge about the SELECT statements. For example, you should be able to explain what happens when you run it. Does it modify the data in the original table?

Also, it doesn’t hurt to understand the formatting rules for writing SELECT statements. Is it necessary to capitalize them? Why is it a good practice to use underscores instead of spaces when writing column names?

WHERE

Most of the time, the questions that involve filtering of the rows are easily solved using the WHERE statement. You should know where, when and how to use it. A good candidate should know how to set up the condition to filter the rows.

Properly setting up a condition for WHERE statement is more complicated than it might seem at first. You have to know which condition operators you can use, and how to set them up to get the desired result. A good candidate should understand the comparison of different types and choose the right approach.

Comparing numbers is easy, but comparing dates and text is a bit more complicated. An experienced candidate should be able to instinctively write these conditions to successfully filter the data. There are also syntax rules for comparing different types of data.

A good candidate should also understand and explain the underlying principles of this statement. For instance, the fact that rows in SQL are inseparable. When you use the WHERE statement, you don’t just filter out the values in one column, but if the condition after the WHERE statement is not met, the entire row is disqualified. Knowing these key details is essential, even for junior positions.

Subqueries

As a data scientist at Apple, you’ll have to write complex queries in SQL. Subqueries are a tool that can help you do that. To maximize your chances of getting a job, you should understand subqueries, their syntax, and how to use them in combination with SQL statements. In our example, we use subquery to set the condition for the WHERE statement. This is common, so a good candidate should also know how to use subqueries for setting conditions.

Sometimes writing nested subqueries is difficult. It takes some time before you can write them effortlessly. If you’re a junior candidate, you might find it easier to write inner queries first, and work your way up to outer queries.

SQL code that contains subqueries can be confusing to understand, especially if the subqueries are nested multiple levels deep. It’s important to follow the formatting rules like indentation to make your code easier to read. The common rule of thumb is to indent subqueries by one or two spaces.

LEFT JOINs

JOINs are not a basic concept, but they’re very useful for performing many different operations. There are multiple types of JOINs, all of them can serve a different purpose. Great candidates should understand all different types and explain them. More importantly, when faced with a task, a good candidate should know how to use each type of JOIN to achieve the desired result.

JOINs are sometimes used for filtering data. We use the LEFT JOIN statement to answer this particular Apple data scientist interview question, because it maps values from two tables. When faced with a question like this one, a good candidate will know the default behavior of LEFT JOINs and why they’re useful for finding records that don’t have corresponding values in another table.

JOINs are usually useless without a properly written ON statement. This includes knowing proper syntax rules for writing the actual condition and mapping the values from the two tables.

It’s important to practice writing queries using different types of JOINs, so when the need arises, you instinctively know which one you need to use. Find some most common SQL JOIN interview questions here and learn how to answer them.

Apple Data Scientist Interview Question Walkthrough

In this section, we want to walk you through one of the SQL questions recently asked to candidates who aspire to work at Apple.

Customers Without Orders

To solve this Apple data scientist interview question, a candidate must write the query that returns the name of the customers who have never placed any orders. Filtering the sales data is a common business problem at large companies like Apple.

This question is marked as being of ‘Medium’ difficulty, but as long as you know how to use the right SQL statements to view and filter the values, it can be solved easily.

Customers Without Orders

Interview Question Date: April 2019

Amazon

Apple

MediumID 9896

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find customers who have never made an order. Output the first name of the customer.

Tables: customers, orders

Link to the question: https://platform.stratascratch.com/coding/9896-customers-without-orders

Interviewers at large tech manufacturers often ask questions that involve working with sales data. Even if you don’t get this exact question, the principles described in this article will help you to solve many similar type data science interview questions.

All the values necessary to solve this Apple data scientist interview question are contained in the two available tables. The challenge could be separated in two steps: write the condition to find all customers who satisfy the criteria, and filter the results to remove those who don’t. The customers table contains no information to indicate the order value, so we’ll have to work with both tables to find the answer.

Both tables have a column for unique customer identifiers. For the customers table, that is the id column, and for the orders table, it’s the cust_id column. Logically, the customers who have never placed an order should be absent from the orders table. Setting up a condition to find them and returning their corresponding name is the most logical solution to the question.

Once you’ve digested the question and understand what needs to be done, the challenge is the execution. You should use the right SQL statements and clauses to set the condition and filter the results. There are two valid solutions to this question. We’ll demonstrate both and let you decide which one’s better.

Available Dataset

Table: customers

id	first_name	last_name	city	phone_number
8	John	Joseph	San Francisco	928-386-8164
7	Jill	Michael	Austin	813-297-0692
4	William	Daniel	Denver	813-368-1200
5	Henry	Jackson	Miami	808-601-7513
13	Emma	Isaac	Miami	808-690-5201

Table: orders

id	cust_id	order_date	order_details	total_order_cost
1	3	2019-03-04	Coat	100
2	3	2019-03-01	Shoes	80
3	3	2019-03-07	Skirt	30
4	7	2019-02-01	Coat	25
5	7	2019-03-10	Shoes	80

Data Assumptions

Correctly answering interview questions like this one comes down to studying the available data. In this case we have two tables, customers and orders. We need to use values from both to arrive at the final result. The first thing to do is to decide which columns we’ll be working with, and which can be safely ignored.

Looking at the column names and their corresponding data type can give you a broad idea of the data you’ll be working with. It’s great if you can recognize columns in two tables that contain identical data. For instance, to solve the question at hand, the candidate must recognize that the id column from the customers table can be mapped to the cust_id column from the orders table.

At this stage, you should also closely study the question to determine whether or not you have to cast values to another data type, or format them in any way.

Previewing multiple rows of actual data can help you study the table and get in the right headspace to solve the question. Seeing the actual values in columns could help you remember their type and whether or not working with them is necessary to solve the question.

It’s great when the question description is specific and leaves no room for interpretation, but that’s rarely the case. Looking at the actual data can help you answer some of the questions you might have about the data.

For instance, we can make the assumption that if the customer didn’t order anything, her unique identifier id should be absent from the orders table. We can also assume that the total_order_cost describes the full cost, not cost per item.

When interviewing, you shouldn’t risk making the wrong assumptions. If you’ve tried everything: carefully read the question, looked at data types, column names, and the actual rows, but still can’t make up your mind, you can ask the interviewer. When checking assumptions, ask specific questions to make sure you understand the task correctly.

If you follow all these steps, you should be able to write an SQL query that handles the edge cases and requires a minimal amount of code.

To stress the importance of looking at the available data, let’s look at the columns from the customers table:

The id column is important, because we’ll use it to identify the users and to check whether or not they are in the cust_id column of the orders table.
The first_name column is essential, because once we’ve found the records that satisfy the criteria, we must return the corresponding value in the first_name column.
Our solution doesn’t involve the values in the last_name column, so we can ignore this part of the table.
Similarly, we don’t have to filter customers or group them by the city they live in, so we can ignore the following columns: city, address, phone_number.

Now, let’s look at the columns from the orders table:

We are not filtering orders by their id, or looking for them in another table, so we can safely ignore this column.
The cust_id column can be mapped to the values from the id column of the customers table. We’ll use this column to find the customers that are absent from the orders table, and therefore, did not place any orders.
We don’t have to track the total volume of the orders, arrange them by dates and get detailed information about them. So we can ignore the order_date, order_details, and total_order_cost columns.

Solution Logic

Apple Data Scientist Interview Questions and Answers

Once you’ve familiarized yourself with the question and all of the available data, you can formulate your approach. The solution to this Apple data scientist interview question should consist of three easy steps:

SELECT the columns you’d like to view in your final result. In this case, we will SELECT the values from the first_name column from the customers table.
Set up the condition to find the rows that satisfy the criteria.
Use WHERE statement, or LEFT JOINs to filter out all other rows.

Once broken down in three easy steps, the question becomes easy to solve. To perform the first step, you’ll need two of the most basic statements in SQL: SELECT and FROM. Even junior data scientists should be familiar with these two statements and how to use them to compose a query.

The next step involves setting the condition. To do this, you must first understand that the id column from the customers table maps to the cust_id column from the orders table. If you go with the second approach that involves LEFT JOINs, you will have to use the ON statement to map the columns.

The final step is to use the WHERE statement to filter the results. You can check if the unique customer identifier from the id column appears in the cust_id column of the orders table.

One additional step can make your query even more efficient. You can add the DISTINCT statement to only keep the unique cust_id values.

Solution Approach

Get the values for final output

If you look at the question description, the expected output is supposed to be the first names of customers that didn’t place an order.

SELECT first_name
FROM customers

This query is going to return all of the first_name values in the customers table. Let’s take a look at the output:

All required columns and the first 5 rows of the solution are shown

first_name
John
Jill
William
Henry
Emma

So far, everything is going according to plan. Now it’s time for the second step.

Set the condition and filter results

Now that we’ve written a query that returns the first names of all customers, all we have to do is check which of these customers DO NOT appear in the orders table, and keep their first names. We’ll use the value in the id column as a unique identifier.

SELECT first_name
FROM customers
WHERE id NOT IN
    (SELECT cust_id
     FROM orders)

We check whether or not each customer appears in the orders table using id as an identifier.

We use the SELECT statement to view all the cust_id values from the orders table, and check whether or not the id of each customer appears in the list. We use the WHERE statement coupled with a NOT IN condition to remove the customers whose id value appears in the output of the subquery.

This is the final result. If we run the code, we will see a list of first_name values for the customers who were absent from the orders table:

All required columns and the first 5 rows of the solution are shown

first_name
John
Emma
Liam
Mark
Jack

Use DISTINCT clause to remove duplicates

Even though the last step answers the question, the cust_id values it extracts are sometimes repeated. This is unnecessary and sub-optimal, so we can use the DISTINCT clause to keep only unique cust_id values from the orders table.

SELECT first_name
FROM customers
WHERE id NOT IN
    (SELECT DISTINCT cust_id
     FROM orders)

The final output is going to be the same as before, but we will have to conditionally check a smaller list of unique cust_id values from the orders table.

Another Right Solution

The solution discussed above is fairly simple, but it’s not the only right way to approach this Apple data scientist interview question. LEFT JOINs are also useful for solving questions that involve checking if a specific value from one table appears in another table.

To write this solution, you should be closely familiar with LEFT JOINs and their default behavior. In this case, using LEFT JOINs will have the following effect: for every id in customers, we check if there is a matching value in the cust_id column of the orders table. If there is, the tables will be joined with their respective values in the columns. If the values from two id columns can’t be mapped, all the column values will be assigned a NULL value.

Once the tables are joined, we can check which records have a NULL value in total_cost column (any other besides the id table will do) and identify the customers that were absent from the original orders table. Then we, once again, use the WHERE statement to filter the initial list to get final results.

SELECT first_name
FROM customers a
LEFT JOIN orders b ON a.id = b.cust_id
WHERE b.order_date IS NULL

If you are very comfortable with LEFT JOINs as a concept, this approach might be even simpler than the one described above. Taking this approach would not hurt your chances of getting the data science job. However, it remains a fact that it involves writing a bit more code.

2022 Update: Available data doesn’t contain the ‘order_quantity’ values anymore, so we have to slightly change the solution mentioned in the video. In the updated solution, we conditionally filter the results based on the value of the ‘order_date’ column instead.

Final Words

Working with sales data is a common task for data scientists at Apple. This, and similar questions where candidates have to filter, arrange or manipulate the data are a great way to gauge candidates’ readiness for the job. This specific question is good because it gives you the freedom to approach it multiple ways, as long as you arrive at the right answer.