Merging Multiple DataFrames
Progress Tracking
Log in to save this lesson and continue from where you left off.
Real analysis often involves three or more tables: employees, their departments, and their projects. Or: orders, customers, and products. You chain merges one at a time — merge A with B, then merge that result with C. The key is thinking about which columns connect each pair of tables, and merging in the right order to avoid losing rows you need.
Plan Before You Merge
When facing a complex request:
- What columns are needed? (from which DataFrames)
- How do the DataFrames connect? (shared key columns)
- What’s the logical path? (A → B → C)
Each merge must connect to something already in your result. You can’t jump from customers directly to categories if they’re only linked through orders and products.
The Row Multiplication Problem
When you merge through a many-to-many relationship, rows multiply. If an order has 3 tags, that order appears 3 times after the merge — and your .sum() is 3x too high.
Print len(result) after each merge step. If the count jumps unexpectedly, you have a many-to-many multiplication. Fix it by aggregating before the merge, or by deduplicating after.
Salary Less Than Twice The Average
| empl_id | manager_empl_id |
|---|---|
| E849 | |
| E850 | E849 |
| E851 | E849 |
| E852 | E850 |
| E853 | E850 |
| empl_id | empl_name | empl_city | empl_dob | empl_pin | salary |
|---|---|---|---|---|---|
| E849 | Steven M. Jones | Hicksville | 1988-03-29 | 1490 | 80000 |
| E850 | Marilynn M. Walters | New York | 1978-12-26 | 9624 | 30000 |
| E851 | Kyle M. Massey | Lake Katrine | 1977-09-22 | 1563 | 40000 |
| E852 | Cody A. Mosby | Anaheim | 1965-03-18 | 4883 | 22000 |
| E853 | David J. Mintz | Houston | 1977-01-04 | 8001 | 18000 |
Write a query to get the list of managers whose salary is less than twice the average salary of employees reporting to them. For these managers, output their ID, salary and the average salary of employees reporting to them.
Key Takeaways
- Chain merges to combine any number of DataFrames.
- Plan the merge path before writing — sketch the relationships.
- Watch for row multiplication from many-to-many relationships.
- Check
len()at each step when debugging.
What’s Next
Congratulations — you’ve completed Module 3. You can now combine data from any number of DataFrames using pd.concat() and all types of pd.merge(). In the next module, you’ll learn subqueries and more advanced patterns.