Module 3: Combining DataFrames•25 min

Self-Merges

Progress Tracking

Sometimes both sides of a merge are the same table. “Find employees who earn more than someone else in their department” requires comparing each employee against every other employee in their department. That’s a self-merge: merge the table with itself on the grouping column, then filter the pairs. It feels strange the first time, but the pattern is always the same: merge, pair, filter.

When a DataFrame Needs to Talk to Itself

The classic example: the employee-manager relationship. Managers are employees too. To answer “Who manages whom?” you merge the employees DataFrame with itself.

Python

result = pd.merge(
    employees, employees,
    left_on="manager_id", right_on="id",
    how="left",
    suffixes=("", "_manager")
)
result[["first_name", "first_name_manager"]]

We merge the same DataFrame twice, using suffixes to tell the columns apart. how="left" because the CEO has no manager.

Comparing Rows Within a DataFrame

Self-merges let you compare rows to other rows:

Table: employees

id	first_name	last_name	salary	joining_date	department
1001	Monika	Arora	100000	2023-01-12	HR
1002	Niharika	Verma	80000	2023-01-05	Admin
1003	Vishal	Singhal	300000	2023-01-02	HR
1004	Amitah	Singh	500000	2022-12-15	Admin
1005	Vivek	Bhati	500000	2022-11-16	Admin

Tables: employees

Self-Merges Are Expensive

Without a scope constraint, you’re comparing n×n pairs. On 10,000 employees, that’s 100 million comparisons. Always merge on a scoping column like department first.

Finding Unique Pairs

To avoid duplicate pairs (Alice-Bob and Bob-Alice), filter so the left ID is always less than the right:

Table: employee

id	first_name	last_name	age	sex	employee_title	department	salary	target	bonus	email	city	address	manager_id
5	Max	George	26	M	Sales	Sales	1300	200	150	Max@company.com	California	2638 Richards Avenue	1
13	Katty	Bond	56	F	Manager	Management	150000	0	300	Katty@company.com	Arizona		1
11	Richerd	Gear	57	M	Manager	Management	250000	0	300	Richerd@company.com	Alabama		1
10	Jennifer	Dion	34	F	Sales	Sales	1000	200	150	Jennifer@company.com	Alabama		13
19	George	Joe	50	M	Manager	Management	100000	0	300	George@company.com	Florida	1003 Wyatt Street	1

Tables: employee

Use < Not !=

Filter with id_a < id_b (not !=) to get each pair exactly once.

Employees With the Same Salary

Table: worker

worker_id	first_name	last_name	salary	joining_date	department
1	Monika	Arora	100000	2014-02-20	HR
2	Niharika	Verma	80000	2014-06-11	Admin
3	Vishal	Singhal	300000	2014-02-20	HR
4	Amitah	Singh	500000	2014-02-20	Admin
5	Vivek	Bhati	500000	2014-06-11	Admin

Tables: worker

Matching Users Pairs

Table: facebook_employees

id	location	age	gender	is_senior
0	USA	24	M	FALSE
1	USA	31	F	TRUE
2	USA	29	F	FALSE
3	USA	33	M	FALSE
4	USA	36	F	TRUE

Tables: facebook_employees

Key Takeaways

Self-merge: pd.merge(df, df, ...) with suffixes to distinguish columns.
Essential for hierarchies, row comparisons, and finding pairs.
Use id_a < id_b to avoid duplicate and self pairs.
Always scope with a shared column (department, category) to avoid n×n explosions.

What’s Next

You’ve mastered all merge types. Time to chain them: merging three, four, or more DataFrames in a single analysis.

Next upContinue →

Merging Multiple DataFrames

15 min