Module 3: Combining DataFrames25 min

Self-Merges

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Sometimes both sides of a merge are the same table. “Find employees who earn more than someone else in their department” requires comparing each employee against every other employee in their department. That’s a self-merge: merge the table with itself on the grouping column, then filter the pairs. It feels strange the first time, but the pattern is always the same: merge, pair, filter.

When a DataFrame Needs to Talk to Itself

The classic example: the employee-manager relationship. Managers are employees too. To answer “Who manages whom?” you merge the employees DataFrame with itself.

Python
result = pd.merge(
    employees, employees,
    left_on="manager_id", right_on="id",
    how="left",
    suffixes=("", "_manager")
)
result[["name", "name_manager"]]

We merge the same DataFrame twice, using suffixes to tell the columns apart. how="left" because the CEO has no manager.

Comparing Rows Within a DataFrame

Self-merges let you compare rows to other rows:

Table: employees
idfirst_namelast_namesalaryjoining_datedepartment
1001MonikaArora1000002023-01-12HR
1002NiharikaVerma800002023-01-05Admin
1003VishalSinghal3000002023-01-02HR
1004AmitahSingh5000002022-12-15Admin
1005VivekBhati5000002022-11-16Admin
1
Find Higher Earners by Department

Find employees who earn more than a colleague in the same department.

Tables: employees
Self-Merges Are Expensive

Without a scope constraint, you’re comparing n×n pairs. On 10,000 employees, that’s 100 million comparisons. Always merge on a scoping column like department first.

Finding Unique Pairs

To avoid duplicate pairs (Alice-Bob and Bob-Alice), filter so the left ID is always less than the right:

Table: employee
idfirst_namelast_nameagesexemployee_titledepartmentsalarytargetbonusemailcityaddressmanager_id
5MaxGeorge26MSalesSales1300200150Max@company.comCalifornia2638 Richards Avenue1
13KattyBond56FManagerManagement1500000300Katty@company.comArizona1
11RicherdGear57MManagerManagement2500000300Richerd@company.comAlabama1
10JenniferDion34FSalesSales1000200150Jennifer@company.comAlabama13
19GeorgeJoe50MManagerManagement1000000300George@company.comFlorida1003 Wyatt Street1
2
Find Department Pairs

Find all pairs of employees in the same department. Avoid duplicates and self-pairs.

Tables: employee
Use < Not !=

Filter with id_a < id_b (not !=) to get each pair exactly once.

Employees With the Same Salary

Table: worker
worker_idfirst_namelast_namesalaryjoining_datedepartment
1MonikaArora1000002014-02-20HR
2NiharikaVerma800002014-06-11Admin
3VishalSinghal3000002014-02-20HR
4AmitahSingh5000002014-02-20Admin
5VivekBhati5000002014-06-11Admin
3
Employees With the Same Salary
View solution

Find employees who earn the same salary. Output the worker id along with the first name and the salary in descending order.

Tables: worker

Matching Users Pairs

Table: facebook_employees
idlocationagegenderis_senior
0USA24MFALSE
1USA31FTRUE
2USA29FFALSE
3USA33MFALSE
4USA36FTRUE
4
Meta/Facebook Matching Users Pairs
View solution

Find matching pairs of Meta/Facebook employees such that they are both of the same nation, different age, same gender, and at different seniority levels. Output ids of paired employees.

Tables: facebook_employees

Key Takeaways

  • Self-merge: pd.merge(df, df, ...) with suffixes to distinguish columns.
  • Essential for hierarchies, row comparisons, and finding pairs.
  • Use id_a < id_b to avoid duplicate and self pairs.
  • Always scope with a shared column (department, category) to avoid n×n explosions.

What’s Next

You’ve mastered all merge types. Time to chain them: merging three, four, or more DataFrames in a single analysis.