Module 4: Multi-Step Analysis•25 min

The Transform Pattern

Progress Tracking

The Problem: Per-Group Comparisons

You want employees earning above their department’s average. In SQL, you’d use a CTE to compute department averages and join back. In pandas, .transform() does this in one line.

How .transform() Works

Table: employee

id	first_name	last_name	age	sex	employee_title	department	salary	target	bonus	email	city	address	manager_id
5	Max	George	26	M	Sales	Sales	1300	200	150	Max@company.com	California	2638 Richards Avenue	1
13	Katty	Bond	56	F	Manager	Management	150000	0	300	Katty@company.com	Arizona		1
11	Richerd	Gear	57	M	Manager	Management	250000	0	300	Richerd@company.com	Alabama		1
10	Jennifer	Dion	34	F	Sales	Sales	1000	200	150	Jennifer@company.com	Alabama		13
19	George	Joe	50	M	Manager	Management	100000	0	300	George@company.com	Florida	1003 Wyatt Street	1

Python

# Normal groupby: one row per department
employee.groupby("department")["salary"].mean()

# Transform: one value per ORIGINAL row
employee.groupby("department")["salary"].transform("mean")

.groupby().mean() collapses to one row per group. .groupby().transform("mean") keeps the original row count, repeating the group’s mean for each member.

The .transform() Pattern

Python

orders["cust_avg"] = (
    orders.groupby("cust_id")["total_order_cost"]
    .transform("mean")
)
orders[["cust_id", "total_order_cost", "cust_avg"]]

Tables: employee

Filtering with Transform

Tables: employee

`.transform()` vs Merge-Back

.transform() is a shortcut for: groupby → aggregate → merge back. Use .transform() for single-column operations. Use the merge approach when you need multiple aggregated columns.

The Merge-Back Alternative

.transform() is perfect for single-column operations like mean, max, or count. But when you need multiple aggregated columns (mean AND count AND max), the merge-back approach is cleaner: aggregate into a summary DataFrame, then merge it back to the original. Think of .transform() as a shortcut for the simple case.

Python

dept_stats = (
    employee
    .groupby("department")["salary"]
    .agg(["mean", "max"])
    .reset_index()
)
pd.merge(employee, dept_stats, on="department")

Average Salaries

Tables: employee

Highest Salary In Department

Table: employee

id	first_name	last_name	age	sex	employee_title	department	salary	target	bonus	email	city	address	manager_id
5	Max	George	26	M	Sales	Sales	1300	200	150	Max@company.com	California	2638 Richards Avenue	1
13	Katty	Bond	56	F	Manager	Management	150000	0	300	Katty@company.com	Arizona		1
11	Richerd	Gear	57	M	Manager	Management	250000	0	300	Richerd@company.com	Alabama		1
10	Jennifer	Dion	34	F	Sales	Sales	1000	200	150	Jennifer@company.com	Alabama		13
19	George	Joe	50	M	Manager	Management	100000	0	300	George@company.com	Florida	1003 Wyatt Street	1

Tables: employee

Key Takeaways

.transform("agg") broadcasts group results back to every row.
Use it for per-group comparisons: above average, equals max, etc.
For multiple aggregated columns, merge back instead.
This replaces SQL’s correlated subqueries and CTE+JOIN patterns.

What’s Next

You can now do per-group comparisons. Next: chaining multiple analysis steps together — building complex analyses from named intermediate results.

Next upContinue →

Chaining Analysis Steps

25 min