Technical Skills Needed as a Data Analyst with 1-2 Years of Experience

Technical Skills needed as a Data Analyst

While real-world problems are not the ones that you encountered in the interview process, you still need to have a few technical skills to excel as a Data Analyst. I am focussing on SQL-specific techniques, similar functions exist in Python and R languages as well. I will use Postgres to solve this problem as that is the flavor I am most comfortable with. MySQL too will have similar features. You can refer to the article “Postgres vs MySQL: Which is Better for Analytics” to understand the differences between Postgres and MySQL.

Data Analyst Technical Skill #1: Using JOINS

Data analyst technical skills of using JOINs

Most modern database systems store data in a highly normalized form. However, to create reports, we need to aggregate data from the individual tables. To accomplish this we need to use JOINs which is one of the most important data analyst technical skills. Understanding the type of join to be used is the bare minimum to clear any Data Science or Data Analyst interview. A Data Analyst will be using JOINS on a regular basis. Let us review the type of joins using a simple problem. This one is from an Amazon Interview.

Total Cost Of Orders

Interview Question Date: July 2020

Etsy

Amazon

EasyID 10183

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the total cost of each customer's orders. Output customer's id, first name, and the total order cost. Order records by customer's first name alphabetically.

Tables: customers, orders

You can solve the problem on the platform here. https://platform.stratascratch.com/coding/10183-total-cost-of-orders

This problem uses two datasets: customers

id	int
first_name	varchar
last_name	varchar
city	varchar
address	varchar
phone_number	varchar

The data in the customers dataset looks like this.

Table: customers

id	first_name	last_name	city	phone_number
8	John	Joseph	San Francisco	928-386-8164
7	Jill	Michael	Austin	813-297-0692
4	William	Daniel	Denver	813-368-1200
5	Henry	Jackson	Miami	808-601-7513
13	Emma	Isaac	Miami	808-690-5201

and orders

id	int
cust_id	int
order_date	datetime
order_details	varchar
total_order_cost	int

that contains the following data.

Table: orders

id	cust_id	order_date	order_details	total_order_cost
1	3	2019-03-04	Coat	100
2	3	2019-03-01	Shoes	80
3	3	2019-03-07	Skirt	30
4	7	2019-02-01	Coat	25
5	7	2019-03-10	Shoes	80

Solution

Here we assume that we need to report only those customers who had made a purchase. This is a simple problem. However, we can use this to explore the effects of different kinds of joins. We can solve this by simply joining the two tables based, aggregating the total_order_cost field from the orders dataset. The key to solving this problem is to identify the kind of join we need. We can solve the problem with any of the three joins.

Using an INNER JOIN

We start by aggregating the order value for each customer from the orders table.

SELECT
orders.cust_id
, sum(orders.total_order_cost)
FROM orders
GROUP BY 1
ORDER BY 2 DESC
;

This gives the following output.

Output for Amazon data analyst interview question

We can then use an INNER JOIN to add the customer information from the customers table.

SELECT
orders.cust_id
, customers.first_name
, sum(orders.total_order_cost)
FROM orders
INNER JOIN customers
    ON orders.cust_id = customers.id
GROUP BY 1,2
ORDER BY 3 DESC
;

Using a LEFT JOIN

If we consider the orders table as the left table, we can replace the INNER JOIN in the above condition with a LEFT JOIN and get the same results.

SELECT
orders.cust_id
, customers.first_name
, sum(orders.total_order_cost)
FROM orders
LEFT JOIN customers
    ON orders.cust_id = customers.id
GROUP BY 1,2
ORDER BY 3 DESC
;

Using a RIGHT JOIN

Assuming that we need to keep the order of the datasets in the query - first use the customers table and then the orders table. This scenario will be common when you are reusing existing code and adding additional fields from the right table to the base table in each join. You can learn more about joining multiple tables in SQL here. In this case, we need to use a RIGHT JOIN as we need to keep the data that is present in the RIGHT table only.

SELECT
orders.cust_id
, customers.first_name
, sum(orders.total_order_cost)
FROM customers
RIGHT JOIN orders
    ON orders.cust_id = customers.id
GROUP BY 1,2
ORDER BY 3 DESC
;

* Note: nothing will get the goat of SQL programmers more than a new analyst using a RIGHT JOIN. Even though it is a perfectly valid solution, you might end up triggering intense debates on why this is blasphemy as can be seen here. However, all is not lost. If you must pacify these religious types, you can still use an INNER JOIN or even better use a LEFT JOIN like this.

SELECT
orders.cust_id
, customers.first_name
, sum(orders.total_order_cost)
FROM customers
LEFT JOIN orders
    ON orders.cust_id = customers.id
WHERE orders.cust_id IS NOT NULL    
GROUP BY 1,2
ORDER BY 3 DESC
;

What the above does is remove any rows in the final output that does not have a matching value from the right table (essentially a convoluted INNER JOIN)

Check out our post "SQL JOIN Interview Questions" to practice such more questions.

Data Analyst Technical Skill #2: Subqueries, CTEs, and TEMP Tables

Data analyst technical skills of using Subqueries and CTEs

As queries become more and more complex, it is not always possible to solve the problem with a single SQL statement. We need to store the result set so as to reuse that again as if it were a separate table. We can use a subquery, a CTE (Common Table Expression), or a TEMP (Temporary) table for this. As with other SQL practices, this is another raging discussion point among SQL geeks. If you are writing large queries, especially for creating a data pipeline, my personal preference is to use a TEMP TABLE since we can iteratively test our code without having to run the entire query every single time. However, sometimes the choice is forced upon us. For example, if you are using an SQL query in a visualization tool, chances are that you may not be allowed to create a TEMP table. Organizations also prevent the creation of TEMP TABLES to prevent table bloat as the TEMP tables are not removed till the user session is ended and this can result in large TEMP tables eating up fixed resources for others.

Between a CTE and a subquery, I prefer to use a CTE most of the time as it makes it easier to debug and read unless it is a one-liner subquery. Let us try to solve a question using each of the three approaches. This one is from a Google Data Science Interview.

Email Details Based On Sends

Google

MediumID 10086

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find all records from days when the number of distinct users receiving emails was greater than the number of distinct users sending emails

Table: google_gmail_emails

You can solve this problem on the platform here - https://platform.stratascratch.com/coding/10086-email-details-based-on-sends

The problem uses the google_gmail_emails dataset with the following fields.

id	int
from_user	varchar
to_user	varchar
day	int

The data in the table looks like this.

Table: google_gmail_emails

id	from_user	to_user	day
0	6edf0be4b2267df1fa	75d295377a46f83236	10
1	6edf0be4b2267df1fa	32ded68d89443e808	6
2	6edf0be4b2267df1fa	55e60cfcc9dc49c17e	10
3	6edf0be4b2267df1fa	e0e0defbb9ec47f6f7	6
4	6edf0be4b2267df1fa	47be2887786891367e	1

Solution

This is not a difficult question to solve. However, we will use this to describe how to use each of Subquery, CTE, and a TEMP table. The approach to the problem is very straightforward -

Firstly, we need to count the DISTINCT number of sending users (from_user) and the DISTINCT number of receiving users (to_user) and select only those days where the number of receiving users is more than the number of sending users.
Then for these days, we need to output all the records from the base table google_gmail_emails

Since we cannot accomplish these in a single SQL statement, we have to use a subquery, CTE, or TEMP table.

The basic query to find the days where the number of receiving users is more than the number of sending users is

SELECT 
day
, COUNT(DISTINCT from_user) AS NUM_SENDERS
, COUNT(DISTINCT to_user) AS NUM_RECEIVERS
FROM google_gmail_emails
GROUP BY 1
HAVING COUNT(DISTINCT from_user) < COUNT(DISTINCT to_user)

Let us use this in different ways

Subquery

SELECT * FROM 
google_gmail_emails
WHERE day in 
    (
        SELECT day from 
            (
                SELECT 
                day
                , COUNT(DISTINCT from_user) AS NUM_SENDERS
                , COUNT(DISTINCT to_user) AS NUM_RECEIVERS
                FROM google_gmail_emails
                GROUP BY 1
                HAVING COUNT(DISTINCT from_user) < COUNT(DISTINCT to_user)
        ) AS subquery
    )

CTE

WITH cte_table AS 
    (
        SELECT 
        day
        , COUNT(DISTINCT from_user) AS NUM_SENDERS
        , COUNT(DISTINCT to_user) AS NUM_RECEIVERS
        FROM google_gmail_emails
        GROUP BY 1
        HAVING COUNT(DISTINCT from_user) < COUNT(DISTINCT to_user)
    ) 
SELECT * FROM 
google_gmail_emails
WHERE day in (SELECT day from cte_table)

TEMP Tables

DROP TABLE IF EXISTS tmp;
CREATE TEMP TABLE tmp AS
    SELECT 
    day
    , COUNT(DISTINCT from_user) AS NUM_SENDERS
    , COUNT(DISTINCT to_user) AS NUM_RECEIVERS
    FROM google_gmail_emails
    GROUP BY 1
    HAVING COUNT(DISTINCT from_user) < COUNT(DISTINCT to_user)
;        
SELECT * FROM 
google_gmail_emails
WHERE day in (SELECT day from tmp)
;

Note the above code will not work on the StrataScratch platform as TEMP tables are disabled.

Data Analyst Technical Skill #3: Transforming from Long to Wide Format (Pivot tables)

Data analyst technical skills of pivot tables

One of the key reasons why people, including yours truly love spreadsheets is the ability to create cross-tabulations or pivot tables easily. To give you a simple example. Let us consider this dataset of customers of different genders and age groups.

Using a pivot table skill as a data analyst

Suppose we want to see how the distribution varies across gender and age categories. With spreadsheet software, we can simply use a pivot table to convert from rows to columns easily.

While some implementations of SQL like BigQuery and Microsoft SQL Server have their implementation of PIVOT functions. The functions are not as versatile as that in a spreadsheet program. We can overcome this by using the CASE WHEN expression. Let us illustrate this using an example. This one is from an ESPN Data Science Interview.

Olympic Medals By Chinese Athletes

ESPN

HardID 9959

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the number of medals earned in each category by Chinese athletes from the 2000 to 2016 summer Olympics. For each medal category, calculate the number of medals for each olympic games along with the total number of medals across all years. Sort records by total medals in descending order.

Table: olympics_athletes_events

You can solve this question here https://platform.stratascratch.com/coding/9959-olympic-medals-by-chinese-athletes

The problem uses the olympic_atheletes_events dataset.

id	int
name	varchar
sex	varchar
age	float
height	float
weight	datetime
team	varchar
noc	varchar
games	varchar
year	int
season	varchar
city	varchar
sport	varchar
event	varchar
medal	varchar

The data is presented in the following manner.

Table: olympics_athletes_events

id	name	sex	age	team	noc	games	year	season	city	sport	event
3520	Guillermo J. Amparan	M		Mexico	MEX	1924 Summer	1924	Summer	Paris	Athletics	Athletics Men's 800 metres
35394	Henry John Finchett	M		Great Britain	GBR	1924 Summer	1924	Summer	Paris	Gymnastics	Gymnastics Men's Rings
21918	Georg Frederik Ahrensborg Clausen	M	28	Denmark	DEN	1924 Summer	1924	Summer	Paris	Cycling	Cycling Men's Road Race Individual
110345	Marinus Cornelis Dick Sigmond	M	26	Netherlands	NED	1924 Summer	1924	Summer	Paris	Football	Football Men's Football
54193	Thodore Tho Jeitz	M	26	Luxembourg	LUX	1924 Summer	1924	Summer	Paris	Gymnastics	Gymnastics Men's Individual All-Around

Solution

In order to solve this problem, we need to tag the year separately creating new variables for each of the years 2000 to 2016.
Then we can aggregate these new variables by the medal and output in the descending order of the total medals.

To create a new variable for each year, we use the CASE WHEN expression.

SELECT *
, CASE WHEN YEAR = 2000 THEN 1 ELSE 0 END AS MEDALS_2000
, CASE WHEN YEAR = 2004 THEN 1 ELSE 0 END AS MEDALS_2004
, CASE WHEN YEAR = 2008 THEN 1 ELSE 0 END AS MEDALS_2008
, CASE WHEN YEAR = 2012 THEN 1 ELSE 0 END AS MEDALS_2012
, CASE WHEN YEAR = 2016 THEN 1 ELSE 0 END AS MEDALS_2016
, CASE WHEN YEAR>= 2000 AND YEAR <= 2016 THEN 1 END AS TOTAL_MEDALS
FROM olympics_athletes_events
WHERE 
YEAR>= 2000 AND YEAR <= 2016
AND UPPER(TEAM) = 'CHINA'

We get an output like this.

Output for ESPN data analyst interview question

As you can see, there are additional columns being created representing the year that the medal was won and the total. Note we can also calculate the total separately.

Now we simply aggregate this table and get the required output.

WITH CAT_MEDALS AS (
    SELECT *
    , CASE WHEN YEAR = 2000 THEN 1 ELSE 0 END AS MEDALS_2000
    , CASE WHEN YEAR = 2004 THEN 1 ELSE 0 END AS MEDALS_2004
    , CASE WHEN YEAR = 2008 THEN 1 ELSE 0 END AS MEDALS_2008
    , CASE WHEN YEAR = 2012 THEN 1 ELSE 0 END AS MEDALS_2012
    , CASE WHEN YEAR = 2016 THEN 1 ELSE 0 END AS MEDALS_2016
    , CASE WHEN YEAR>= 2000 AND YEAR <= 2016 THEN 1 END AS TOTAL_MEDALS
    FROM olympics_athletes_events
    WHERE 
    YEAR>= 2000 AND YEAR <= 2016
    AND UPPER(TEAM) = 'CHINA'
)
SELECT 
    MEDAL
    , SUM(MEDALS_2000) AS MEDALS_2000
    , SUM(MEDALS_2004) AS MEDALS_2004
    , SUM(MEDALS_2008) AS MEDALS_2008
    , SUM(MEDALS_2012) AS MEDALS_2012
    , SUM(MEDALS_2016) AS MEDALS_2016
    , SUM(TOTAL_MEDALS) AS TOTAL_MEDALS
FROM CAT_MEDALS
GROUP BY 1
ORDER BY 7 DESC
;

CASE WHEN statement is very helpful to simulate pivot table-like outputs and can be used every time one needs to move from long to wide format data.

Data Analyst Technical Skill #4: Window Functions

Data analyst technical skills of Window functions

As the name suggests a window function differs from the usual SQL operations that work on the entire table. This is one of the must-know data analyst technical skills. These functions work on a part of the table (a window). With SQL window functions, we can avoid aggregation and merge sequence by simply creating a window and then having the function perform these operation on this window. If you want to save time by performing some aggregations

The window functions available in Postgres are

To understand how these window functions work, let us take this example. This one appeared in a DoorDash Data Science Interview.

Lowest Revenue Generated Restaurants

Interview Question Date: May 2021

DoorDash

HardID 2036

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Write a query that returns a list of the bottom 2% revenue generating restaurants. Return a list of restaurant IDs and their total revenue from when customers placed orders in May 2020.

You can calculate the total revenue by summing the order_total column. And you should calculate the bottom 2% by partitioning the total revenue into evenly distributed buckets.

Table: doordash_delivery

You can solve this problem here https://platform.stratascratch.com/coding/2036-lowest-revenue-generated-restaurants

This problem uses the doordash_delivery dataset that has the following fields.

customer_placed_order_datetime	datetime
placed_order_with_restaurant_datetime	datetime
driver_at_restaurant_datetime	datetime
delivered_to_consumer_datetime	datetime
driver_id	int
restaurant_id	int
consumer_id	int
is_new	bool
delivery_region	varchar
is_asap	bool
order_total	float
discount_amount	int
tip_amount	float
refunded_amount	float

The data is arranged in the following manner.

Table: doordash_delivery

customer_placed_order_datetime	placed_order_with_restaurant_datetime	driver_at_restaurant_datetime	delivered_to_consumer_datetime	driver_id	restaurant_id	consumer_id	is_new	delivery_region	is_asap	order_total	discount_amount	tip_amount
2020-05-25 16:36:00	2020-05-25 18:17:00		2020-05-25 18:47:00	257	154	190327	TRUE	San Jose	FALSE	36.99	6	1.42
2020-05-02 16:32:00	2020-05-02 17:59:00		2020-05-02 18:50:00	168	297	12420	FALSE	San Jose	FALSE	32.71	0	3.67
2020-05-24 16:38:00	2020-05-24 20:03:00	2020-05-24 20:37:00	2020-05-24 20:49:00	290	201	106412	TRUE	San Jose	FALSE	33.35	6	1.25
2020-05-17 17:34:00	2020-05-17 18:01:00	2020-05-17 18:09:00	2020-05-17 18:35:00	232	188	104751	FALSE	San Jose	FALSE	33.02	6	2.48
2020-05-24 17:40:00	2020-05-24 18:04:00		2020-05-24 18:51:00	290	135	13641	TRUE	San Jose	FALSE	24.49	6	2.45

Solution

To solve this problem,

We need to aggregate the order_total for the month of May 2020 for each restaurant
Calculate the percentile based on the order_total aggregate
Output those restaurant and their sales whose ranks are in the bottom 2%

We start off by aggregating the totals for the month of May 2020.

SELECT
restaurant_id
, sum(order_total)
FROM doordash_delivery
WHERE delivered_to_consumer_datetime BETWEEN '2020-05-01' AND '2020-05-31'
GROUP BY 1
;

Note: We have used the full date range to get the May 2020 data. Later in this article, we will explore some of the most common Date and Time functions in SQL that a Data Analyst should know.

We get an output like this.

Output for DoorDash data analyst interview question

Now we need to assign them percentiles. To do this we invoke the built in ntile() window function.

SELECT
restaurant_id
, sum(order_total)
, NTILE(100) OVER
(
ORDER BY sum(order_total) 
) AS ntile
FROM doordash_delivery
WHERE delivered_to_consumer_datetime BETWEEN '2020-05-01' AND '2020-05-31'
GROUP BY 1
;

We get an output like this.

Now we can simply subset this dataset with the ntile value <= 2. Note we cannot use the WHERE or the HAVING clause since a window function is executed after the WHERE and HAVING statements. We, therefore, use a CTE.

WITH PCTILES AS (
SELECT
restaurant_id
, sum(order_total) AS total_order
, NTILE(100) OVER
(
ORDER BY sum(order_total)
) AS ntile
FROM doordash_delivery
WHERE delivered_to_consumer_datetime BETWEEN '2020-05-01' AND '2020-05-31'
GROUP BY 1
)
SELECT
restaurant_id
, total_order
FROM PCTILES
WHERE ntile <= 2
;

This is the power of a window function, you can work with aggregations in the same query thus saving on computational time as well as providing an elegant solution.

Data Analyst Technical Skill #5: Date and Time Manipulation

Data analyst technical skills of DateTime Manipulation

One of the key requirements for most businesses is to understand the evolution of key metrics over time. Time-series-based analyses are therefore quite commonplace. This makes DateTime manipulation one of the important data analyst technical skills. As a Data Analyst one should be comfortable working with Date and Time fields. Thankfully most modern flavors of SQL (SQLite excepting) support a range of date and time functions. Let us solve a few questions using date-time functionalities. We start off by solving the previous question with built-in datetime functions. We need to find out percentile values for the month of May 2020. We can simply use the DATE_TRUNC function. As the name suggests, the DATE_TRUNC function truncates the date-time field. We can specify the truncation period. In our case, we want only records for the month of May 2020. So we use just truncate it at the monthly level. DATE_TRUNC will truncate each datetime value to the first of each month. We can then simply subset for that date only.

WITH PCTILES AS (
SELECT
restaurant_id
, sum(order_total) AS total_order
, NTILE(100) OVER
(
ORDER BY sum(order_total)
) AS ntile
FROM doordash_delivery
WHERE DATE_TRUNC('MONTH', delivered_to_consumer_datetime) = '2020-05-01'::DATE
GROUP BY 1
)
SELECT
restaurant_id
, total_order
FROM PCTILES
WHERE ntile <= 2
;

We will get the exact result as our solution using the BETWEEN operator.

Let us try another problem on datetime functions. This one is from an Amazon Data Science Interview.

Finding User Purchases

Interview Question Date: December 2020

Amazon

MediumID 10322

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Write a query that'll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.

Table: amazon_transactions

You can solve the problem here: https://platform.stratascratch.com/coding/10322-finding-user-purchases

The problem uses the amazon_transactions dataset with the following fields.

id	int
user_id	int
item	varchar
created_at	datetime
revenue	int

The data looks like this -

Table: amazon_transactions

id	user_id	item	created_at	revenue
1	109	milk	2020-03-03	123
2	139	biscuit	2020-03-18	421
3	120	milk	2020-03-18	176
4	108	banana	2020-03-18	862
5	130	milk	2020-03-28	333

Solution

The key to solving the problem is to tag those transactions that happened within 7 days of the previous transaction for the same user. To do this

We compare every transaction for a given user with the immediately previous transaction
If the time period is within 7 days, then we tag the transaction as within7days
Finally we output all the users with one or more within7days tag.

To do a comparison, we create a window. Unlike the previous windows, we will use a window for a particular user only. To do this, we use the PARTITION BY clause. We can then order by the timestamp (created_at) and use the LAG() function. The LAG() function gives the nth previous value in the window for a given n.

SELECT 
    user_id
    , created_at
    , LAG(created_at, 1) OVER (
    PARTITION BY user_id
    ORDER BY user_id, created_at
    )
FROM amazon_transactions
;

We get this output.

As you can see, the PARTITION BY clause splits the window on the basis of user_id. The LAG function is then applied only on this partition. Therefore we are not comparing the timestamps of different users.

Once this is done, the rest of the problem is pretty straightforward. We compare this lag timestamp with the current timestamp and check if it is within 7 days. To do this we use the INTERVAL operator.

SELECT 
    user_id
    , created_at
    , LAG(created_at, 1) OVER (
    PARTITION BY user_id
    ORDER BY user_id, created_at
    ) AS previous_timestamp
    , CASE 
    WHEN created_at <= LAG(created_at, 1) OVER (
    PARTITION BY user_id
    ORDER BY user_id, created_at
    ) + INTERVAL '7 DAYS' 
    THEN 1 ELSE 0 END as within7days
FROM amazon_transactions
;

We get this output. We have successfully tagged if a transaction happened within 7 days of the previous transaction.

We can now simply put this into a CTE and use SELECT DISTINCT to get the relevant user_ids

WITH time_lag AS (
SELECT 
    user_id
    , created_at
    , LAG(created_at, 1) OVER (
    PARTITION BY user_id
    ORDER BY user_id, created_at
    ) AS previous_timestamp
    , CASE 
    WHEN created_at <= LAG(created_at, 1) OVER (
    PARTITION BY user_id
    ORDER BY user_id, created_at
    ) + INTERVAL '7 DAYS' 
    THEN 1 ELSE 0 END as within7days
FROM amazon_transactions
)
SELECT DISTINCT user_id 
FROM time_lag 
WHERE within7days = 1
;

We have a detailed article on date-time functionalities in SQL here “SQL Scenario Based Interview Questions and Answers”.

Data Analyst Technical Skill #6: Text Manipulation

Data analyst technical skills of Text Manipulation

Another important data analyst skill is the ability to handle text-based fields. Free text fields can cause a lot of problems because without validation, we can have typos, grammatical errors, inconsistent values, et al. One way of manipulating text is to use regex. But we will save that for another day. Even without using regex, there are quite a few text functions that we can employ. Let’s look at an example. This came up in a Postmates Data Science Interview.

Pizza Partners

Interview Question Date: January 2021

Postmates

MediumID 2016

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Which partners have ‘pizza’ in their name and are located in Boston? And what is the average order amount? Output the partner name and the average order amount.

Tables: postmates_orders, postmates_markets, postmates_partners

You can solve the problem here: https://platform.stratascratch.com/coding/2016-pizza-partners

The problem uses three datasets.

postmates_orders

id	int
customer_id	int
courier_id	int
seller_id	int
order_timestamp_utc	datetime
amount	float
city_id	int

With the following data.

Table: postmates_orders

id	customer_id	courier_id	seller_id	order_timestamp_utc	amount	city_id
1	102	224	79	2019-03-11 23:27:00	155.73	47
2	104	224	75	2019-04-11 04:24:00	216.6	44
3	100	239	79	2019-03-11 21:17:00	168.69	47
4	101	205	79	2019-03-11 02:34:00	210.84	43
5	103	218	71	2019-04-11 00:15:00	212.6	47

postmates_markets

id	int
name	varchar
timezone	varchar

With the following data

Table: postmates_markets

id	name	timezone
43	Boston	EST
44	Seattle	PST
47	Denver	MST
49	Chicago	CST

and postmates_partners

id	int
name	varchar
category	varchar

With the following data

Table: postmates_partners

id	name	category
71	Papa John's	Pizza
75	Domino's Pizza	Pizza
77	Pizza Hut	Pizza
79	Papa Murphy's	Pizza

Solution

The datasets are reasonably small. However, let us use a solution that will scale with larger datasets as well. To solve this problem we will need to -

Find the average order value by partner
Keep only those partners that have the word ‘pizza’ in their name and are located in Boston.

We start off by aggregating the postmates_orders dataset and joining it with the postmates_partners dataset to get the partner’s name.

We also join the postmates_partners. You’ll see why in the next step.

SELECT
pp.name
, AVG(amount)
FROM postmates_orders AS po
LEFT JOIN postmates_markets as pm
    ON PO.city_id = PM.id
LEFT JOIN postmates_partners as pp
    ON po.seller_id = pp.id
GROUP BY 1
;

We get the following output.

Output for Postmates data analyst interview question

Then we subset the cities for Boston and business names for pizza. There are multiple ways to go about it. We can convert all strings to lowercase (or uppercase) and then use the LIKE function. Alternatively, with Postgres, we have the ILIKE function that is case-insensitive LIKE. We use both of them to demonstrate how it works.

SELECT
pp.name
, AVG(amount)
FROM postmates_orders AS po
LEFT JOIN postmates_markets as pm
    ON PO.city_id = PM.id
LEFT JOIN postmates_partners as pp
    ON po.seller_id = pp.id
WHERE 
pm.name ILIKE 'boSTon'
AND LOWER(pp.name) LIKE '%pizza%'
GROUP BY 1   
;

You can learn more about SQL text and array manipulation functions here.

Data Analyst Technical Skill #7: Dealing with Messy Data

Data analyst technical skills of Messy Data

Till now we have dealt with relatively clean datasets. However, as mentioned earlier - real-life datasets are not so spick and span. Dealing with messy data is also one of the important data analyst technical skills. SQL comes with a variety of functions to clean the data. Let us look at a couple of cases that are very common.

Missing Data

Missing value imputation is one of the most common problems facing a Data Analyst. Missing data points can wreak havoc with the calculations and might result in a lot of funny-looking trends. One way to impute missing data points is the COALESCE function. The COALESCE function uses the first non-null value from a given set of values. Let us use this in practice with a problem that came up in an Asana Data Science Interview.

Completed Tasks

Interview Question Date: March 2022

Asana

MediumID 2096

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find the number of actions that ClassPass workers did for tasks completed in January 2022. The completed tasks are these rows in the asana_actions table with 'action_name' equal to CompleteTask. Note that each row in the dataset indicates how many actions of a certain type one user has performed in one day and the number of actions is stored in the 'num_actions' column. Output the ID of the user and a total number of actions they performed for tasks they completed. If a user from this company did not complete any tasks in the given period of time, you should still output their ID and the number 0 in the second column.

Tables: asana_users, asana_actions

You can solve this problem here: https://platform.stratascratch.com/coding/2096-completed-tasks

The problem uses two datasets. asana_users that has the following fields.

user_id	int
name	varchar
surname	varchar
company	varchar

The data contained in asana_users looks like this.

Table: asana_users

user_id	name	surname	company
161	Donna	Lee	ClassPass
162	Jennifer	Silcox	Impact Justice
163	Felix	Ballard	ClassPass
164	Janet	Ball	ClassPass
165	John	Cuevas	PGA

And the asana_actions dataset with the following fields

user_id	int
date	datetime
num_actions	int
action_name	datetime

The data in the asana_actions dataset looks like this -

Table: asana_actions

user_id	date	num_actions	action_name
162	2022-01-06	1	BeginTask
165	2022-01-04	1	BeginTask
165	2022-01-06	1	CompleteTask
161	2022-01-05	1	CreateProject
164	2022-01-03	2	BeginTask

Solution

The problem is not too difficult. We just need to ensure that we output it in the manner specified. We start off by subsetting only those users that begin work in the ClassPass company.

SELECT
au.user_id
FROM asana_users AS au
WHERE au.company = 'ClassPass'
;

Output for Asana data analyst interview question

We then add the actions from asana_actions dataset. Note we need to keep only the CompleteTask actions for the month of Jan 2022. So we subset the data in the JOIN condition itself so that only the relevant fields are passed in the merge. If we tried to subset in the main WHERE clause, we will remove all those users who either did not have a CompleteTask type action or did not have the action in Jan 2022 or both.

SELECT
au.user_id
, aa.num_actions
FROM asana_users AS au
LEFT JOIN asana_actions AS AA
    ON AU.user_id = AA.user_id
    AND AA.action_name = 'CompleteTask'
    AND DATE_TRUNC('MONTH', AA.date) = '2022-01-01'::DATE
WHERE au.company = 'ClassPass'
;

Now we can simply aggregate by the user_id. For those users who did not have any actions in Jan 2022, we need to show 0 as the output. Therefore, we use the COALESCE() function.

SELECT
au.user_id
, COALESCE(SUM(aa.num_actions),0)
FROM asana_users AS au
LEFT JOIN asana_actions AS AA
    ON AU.user_id = AA.user_id
    AND AA.action_name = 'CompleteTask'
    AND DATE_TRUNC('MONTH', AA.date) = '2022-01-01'::DATE
WHERE au.company = 'ClassPass'
GROUP BY 1
ORDER BY 1
;

Duplicates

Another challenge that a Data Analyst should be ready to face on a day-to-day basis is working with duplicates. Sometimes repeated values are legitimate - for example, purchases of a user on an app. However, there are some cases where we expect unique values, but end up getting multiple values. If one is not careful with the presence of duplicates while merging the table values can inflate very quickly. These days most companies have their data in cloud-based storage and limited query space that is shared among multiple users. You do not want to be tagged in an email with over a hundred users by the Database Engineer to kill the query that has used up the entire workspace. In reality, one learns by making mistakes. And some duplicates are pretty hard to find. Let us take an example that yours truly messed up. This one is from a Microsoft Data Science Interview.

Find the number of employees who received the bonus and who didn't

Dell

Microsoft

HardID 10081

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the number of employees who received the bonus and who didn't. Bonus values in employee table are corrupted so you should use values from the bonus table. Be aware of the fact that employee can receive more than one bonus. Output value inside has_bonus column (1 if they had bonus, 0 if not) along with the corresponding number of employees for each.

Tables: employee, bonus

You can solve this problem here: https://platform.stratascratch.com/coding/10081-find-the-number-of-employees-who-received-the-bonus-and-who-didnt

The problem uses two datasets. The employee dataset with the following fields.

id	int
first_name	varchar
last_name	varchar
age	int
sex	varchar
employee_title	varchar
department	varchar
salary	int
target	int
bonus	int
email	varchar
city	varchar
address	varchar
manager_id	int

The data looks like this.

Table: employee

id	first_name	last_name	age	sex	employee_title	department	salary	target	bonus	email	city	address	manager_id
5	Max	George	26	M	Sales	Sales	1300	200	150	Max@company.com	California	2638 Richards Avenue	1
13	Katty	Bond	56	F	Manager	Management	150000	0	300	Katty@company.com	Arizona		1
11	Richerd	Gear	57	M	Manager	Management	250000	0	300	Richerd@company.com	Alabama		1
10	Jennifer	Dion	34	F	Sales	Sales	1000	200	150	Jennifer@company.com	Alabama		13
19	George	Joe	50	M	Manager	Management	100000	0	300	George@company.com	Florida	1003 Wyatt Street	1

And the bonus dataset with the following fields

worker_ref_id	int
bonus_amount	int
bonus_date	datetime

This is how the data looks like in the bonus dataset

Table: bonus

worker_ref_id	bonus_amount	bonus_date
1	5000	2020-02-16
2	3000	2011-06-16
3	4000	2020-02-16
1	4500	2020-02-16
2	3500	2011-06-16

Solution

This is a pretty straightforward question. We need to tag the employees who received a bonus (1) and those who did not (0) and then aggregate the numbers based on this tagged field. Ideally we should have been able to do this and get the solution.

SELECT
CASE WHEN bonus.worker_ref_id IS NULL THEN 0 ELSE 1 END AS RECEIVED_BONUS
, COUNT(ID)
FROM employee E
LEFT JOIN BONUS
ON E.ID = BONUS.worker_ref_id
GROUP BY 1
ORDER BY 1
;

However, there is a twist. There are duplicates in the bonus dataset. It took me about an hour to figure that out. The solution is pretty straightforward.

SELECT
CASE WHEN bonus.worker_ref_id IS NULL THEN 0 ELSE 1 END AS RECEIVED_BONUS
, COUNT(DISTINCT ID)
FROM employee E
LEFT JOIN BONUS
ON E.ID = BONUS.worker_ref_id
GROUP BY 1
ORDER BY 1
;

As mentioned earlier, some duplicates are not easy to find or obvious in the first place. It is especially important to ensure that one keeps checking the size of the table when merging multiple tables. This is arguably one of the reasons RIGHT JOINs are fretted upon. We start with a base table and keep adding additional fields from the other tables, always using a LEFT JOIN. It then becomes easier to figure out if the number of rows is increasing or not.

Flavor of SQL

One last piece of advice for those venturing into the world of SQL. Which one should you learn first? MySQL or Postgres? We have a great article here “Postgres vs MySQL” comparing the two most popular open-source versions of SQL. I have worked on Postgres, BigQuery, Snowflake, SQLite, and MySQL. From my experience in dealing with these flavors of SQL, I found both of them and others to be very similar to each other. Yes, there are some functions that are not available in MySQL. But it is not that we cannot Google the requirements. Just to focus on the task, one cannot remember each and every function that is available in a particular variant. Nor should one. That is what documentation is for. One of the most successful Data Engineers that I know who uses SQL cannot function without the documentation and uses Google and Stackoverflow generously. The reason is quite simple - versions change, and new functions get added, but the basic understanding of how to manipulate data remains the same. If you know any one flavor of SQL reasonably well, it will not take you long to transition to another.

Conclusion

In this article, we looked at how a Data Analyst job is different from the interview questions asked. We looked at the different data analyst skills that one needs to build. What are the areas that one should fret over, and which ones you can safely ignore. We also looked at a case study that highlighted a way in which real-life data sets need to be handled. They might appear difficult at first, but the key to improvement is patience, persistence, and practice. All these examples and many more are available on the StrataScratch platform. Join StrataScratch today and join an active community of aspiring Data Science professionals to make your dream of working in companies like Google, Microsoft, Amazon, Apple, etc a reality.

Technical Skills Needed as a Data Analyst with 1-2 Years of Experience

Categories

Technical Skills needed as a Data Analyst

Data Analyst Technical Skill #1: Using JOINS

Total Cost Of Orders

Data Analyst Technical Skill #2: Subqueries, CTEs, and TEMP Tables

Email Details Based On Sends

Data Analyst Technical Skill #3: Transforming from Long to Wide Format (Pivot tables)

Olympic Medals By Chinese Athletes

Data Analyst Technical Skill #4: Window Functions

Lowest Revenue Generated Restaurants

Data Analyst Technical Skill #5: Date and Time Manipulation

Finding User Purchases

Data Analyst Technical Skill #6: Text Manipulation

Pizza Partners

Data Analyst Technical Skill #7: Dealing with Messy Data

Missing Data

Completed Tasks

Duplicates

Find the number of employees who received the bonus and who didn't

Flavor of SQL

Conclusion

Latest Posts:

From Arrays to Lists: Converting Numpy Arrays with Python

Tips for Crafting an Impressive Data Analyst Portfolio

PySpark GroupBy Guide: Super Simple Way to Group Data

Share

Follow

Categories