Module 4: Multi-Step Analysis•30 min

Step-by-Step Analysis

Progress Tracking

When One Line Isn’t Enough

Most interesting analytics questions can’t be answered in one step. "Find everyone earning above average" requires knowing the average first. "Which customers have never ordered?" requires knowing who has ordered. In SQL you’d nest subqueries or chain CTEs. In pandas, you simply use intermediate variables — compute a result, store it, and use it in the next step. It’s more natural, more readable, and easier to debug.

Python

avg_salary = employee["salary"].mean()
employee[employee["salary"] > avg_salary]

Finding the Maximum

Python

max_salary = employee["salary"].max()
employee[employee["salary"] == max_salary]

Using a Scalar Result

Table: employee

id	first_name	last_name	age	sex	employee_title	department	salary	target	bonus	email	city	address	manager_id
5	Max	George	26	M	Sales	Sales	1300	200	150	Max@company.com	California	2638 Richards Avenue	1
13	Katty	Bond	56	F	Manager	Management	150000	0	300	Katty@company.com	Arizona		1
11	Richerd	Gear	57	M	Manager	Management	250000	0	300	Richerd@company.com	Alabama		1
10	Jennifer	Dion	34	F	Sales	Sales	1000	200	150	Jennifer@company.com	Alabama		13
19	George	Joe	50	M	Manager	Management	100000	0	300	George@company.com	Florida	1003 Wyatt Street	1

Tables: employee

Filtering with Lists of Values

A very common pattern: compute a list of values from one query, then use .isin() to filter another DataFrame. This is the pandas equivalent of SQL’s IN (subquery). The key insight is that .unique() gives you an array you can pass directly to .isin().

Python

# Departments that have high earners
high_depts = employee[employee["salary"] > 100000]["department"].unique()

# All employees in those departments
employee[employee["department"].isin(high_depts)]

No NOT IN Trap in Pandas

In SQL, NOT IN fails silently when the list contains NULL. In pandas, ~df["col"].isin(values) works correctly regardless of NaN.

Filtering with a Computed List

Tables: employee

Finding What’s Missing

The pattern for "find what’s missing": get the IDs that DO appear, then filter for the ones that DON’T with ~df["col"].isin(values).

Table: customers

id	first_name	last_name	city	phone_number
8	John	Joseph	San Francisco	928-386-8164
7	Jill	Michael	Austin	813-297-0692
4	William	Daniel	Denver	813-368-1200
5	Henry	Jackson	Miami	808-601-7513
13	Emma	Isaac	Miami	808-690-5201

Table: orders

id	cust_id	order_date	order_details	total_order_cost
1	3	2019-03-04	Coat	100
2	3	2019-03-01	Shoes	80
3	3	2019-03-07	Skirt	30
4	7	2019-02-01	Coat	25
5	7	2019-03-10	Shoes	80

Tables: customers, orders

Workers With The Highest Salaries

Table: worker

worker_id	first_name	last_name	salary	joining_date	department
1	Monika	Arora	100000	2014-02-20	HR
2	Niharika	Verma	80000	2014-06-11	Admin
3	Vishal	Singhal	300000	2014-02-20	HR
4	Amitah	Singh	500000	2014-02-20	Admin
5	Vivek	Bhati	500000	2014-06-11	Admin

Table: title

worker_ref_id	worker_title	affected_from
1	Manager	2016-02-20
2	Executive	2016-06-11
8	Executive	2016-06-11
5	Manager	2016-06-11
4	Asst. Manager	2016-06-11

Tables: worker, title

Most Profitable Financial Company

Table: forbes_global_2010_2014

company	sector	industry	continent	country	marketvalue	sales	profits	assets	rank
ICBC	Financials	Major Banks	Asia	China	215.6	148.7	42.7	3124.9	1
China Construction Bank	Financials	Regional Banks	Asia	China	174.4	121.3	34.2	2449.5	4
Agricultural Bank of China	Financials	Regional Banks	Asia	China	141.1	136.4	27	2405.4	8
JPMorgan Chase	Financials	Major Banks	North America	United States	229.7	105.7	17.3	2435.3	20
Berkshire Hathaway	Financials	Investment Services	North America	United States	309.1	178.8	19.5	493.4	17

Tables: forbes_global_2010_2014

Key Takeaways

Break complex questions into intermediate variables: compute, then use.
.max() / .min() + filter replaces scalar subqueries.
.unique() + .isin() replaces IN (subquery).
~df["col"].isin(values) replaces NOT IN — and it’s NaN-safe.

What’s Next

What about questions like “Find employees who earn above their department’s average”? You need a per-group calculation applied back to each row. That’s .transform(), and it’s next.

Next upContinue →

The Transform Pattern

25 min