Step-by-Step Analysis
Progress Tracking
Log in to save this lesson and continue from where you left off.
When One Line Isn’t Enough
Most interesting analytics questions can’t be answered in one step. "Find everyone earning above average" requires knowing the average first. "Which customers have never ordered?" requires knowing who has ordered. In SQL you’d nest subqueries or chain CTEs. In pandas, you simply use intermediate variables — compute a result, store it, and use it in the next step. It’s more natural, more readable, and easier to debug.
avg_salary = employee["salary"].mean()
employee[employee["salary"] > avg_salary]Finding the Maximum
max_salary = employee["salary"].max()
employee[employee["salary"] == max_salary]Using a Scalar Result
The starter finds the minimum earner. Change it to find the maximum.
Filtering with Lists of Values
A very common pattern: compute a list of values from one query, then use .isin() to filter another DataFrame. This is the pandas equivalent of SQL’s IN (subquery). The key insight is that .unique() gives you an array you can pass directly to .isin().
# Departments that have high earners
high_depts = employee[employee["salary"] > 100000]["department"].unique()
# All employees in those departments
employee[employee["department"].isin(high_depts)]In SQL, NOT IN fails silently when the list contains NULL. In pandas, ~df["col"].isin(values) works correctly regardless of NaN.
Filtering with a Computed List
Find all employees who work in a department that has at least one person earning over 100,000.
Finding What’s Missing
The pattern for "find what’s missing": get the IDs that DO appear, then filter for the ones that DON’T with ~df["col"].isin(values).
| id | first_name | last_name | city | address | phone_number |
|---|---|---|---|---|---|
| 8 | John | Joseph | San Francisco | 928-386-8164 | |
| 7 | Jill | Michael | Austin | 813-297-0692 | |
| 4 | William | Daniel | Denver | 813-368-1200 | |
| 5 | Henry | Jackson | Miami | 808-601-7513 | |
| 13 | Emma | Isaac | Miami | 808-690-5201 |
| id | cust_id | order_date | order_details | total_order_cost |
|---|---|---|---|---|
| 1 | 3 | 2019-03-04 | Coat | 100 |
| 2 | 3 | 2019-03-01 | Shoes | 80 |
| 3 | 3 | 2019-03-07 | Skirt | 30 |
| 4 | 7 | 2019-02-01 | Coat | 25 |
| 5 | 7 | 2019-03-10 | Shoes | 80 |
Find customers who have never placed an order.
Workers With The Highest Salaries
| worker_id | first_name | last_name | salary | joining_date | department |
|---|---|---|---|---|---|
| 1 | Monika | Arora | 100000 | 2014-02-20 | HR |
| 2 | Niharika | Verma | 80000 | 2014-06-11 | Admin |
| 3 | Vishal | Singhal | 300000 | 2014-02-20 | HR |
| 4 | Amitah | Singh | 500000 | 2014-02-20 | Admin |
| 5 | Vivek | Bhati | 500000 | 2014-06-11 | Admin |
| worker_ref_id | worker_title | affected_from |
|---|---|---|
| 1 | Manager | 2016-02-20 |
| 2 | Executive | 2016-06-11 |
| 8 | Executive | 2016-06-11 |
| 5 | Manager | 2016-06-11 |
| 4 | Asst. Manager | 2016-06-11 |
Management wants to analyze only employees with official job titles. Find the job titles of the employees with the highest salary. If multiple employees have the same highest salary, include all their job titles.
Most Profitable Financial Company
| company | sector | industry | continent | country | marketvalue | sales | profits | assets | rank |
|---|---|---|---|---|---|---|---|---|---|
| ICBC | Financials | Major Banks | Asia | China | 215.6 | 148.7 | 42.7 | 3124.9 | 1 |
| China Construction Bank | Financials | Regional Banks | Asia | China | 174.4 | 121.3 | 34.2 | 2449.5 | 4 |
| Agricultural Bank of China | Financials | Regional Banks | Asia | China | 141.1 | 136.4 | 27 | 2405.4 | 8 |
| JPMorgan Chase | Financials | Major Banks | North America | United States | 229.7 | 105.7 | 17.3 | 2435.3 | 20 |
| Berkshire Hathaway | Financials | Investment Services | North America | United States | 309.1 | 178.8 | 19.5 | 493.4 | 17 |
Find the most profitable company from the financial sector. Output the result along with the continent.
Key Takeaways
- Break complex questions into intermediate variables: compute, then use.
.max()/.min()+ filter replaces scalar subqueries..unique()+.isin()replacesIN (subquery).~df["col"].isin(values)replacesNOT IN— and it’s NaN-safe.
What’s Next
What about questions like “Find employees who earn above their department’s average”? You need a per-group calculation applied back to each row. That’s .transform(), and it’s next.