Module 3: Combining DataFrames15 min

Merging Multiple DataFrames

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Real analysis often involves three or more tables: employees, their departments, and their projects. Or: orders, customers, and products. You chain merges one at a time — merge A with B, then merge that result with C. The key is thinking about which columns connect each pair of tables, and merging in the right order to avoid losing rows you need.

Plan Before You Merge

When facing a complex request:

  • What columns are needed? (from which DataFrames)
  • How do the DataFrames connect? (shared key columns)
  • What’s the logical path? (A → B → C)
Follow the Chain

Each merge must connect to something already in your result. You can’t jump from customers directly to categories if they’re only linked through orders and products.

The Row Multiplication Problem

When you merge through a many-to-many relationship, rows multiply. If an order has 3 tags, that order appears 3 times after the merge — and your .sum() is 3x too high.

Check Row Counts After Each Merge

Print len(result) after each merge step. If the count jumps unexpectedly, you have a many-to-many multiplication. Fix it by aggregating before the merge, or by deduplicating after.

Salary Less Than Twice The Average

Table: map_employee_hierarchy
empl_idmanager_empl_id
E849
E850E849
E851E849
E852E850
E853E850
Table: dim_employee
empl_idempl_nameempl_cityempl_dobempl_pinsalary
E849Steven M. JonesHicksville1988-03-29149080000
E850Marilynn M. WaltersNew York1978-12-26962430000
E851Kyle M. MasseyLake Katrine1977-09-22156340000
E852Cody A. MosbyAnaheim1965-03-18488322000
E853David J. MintzHouston1977-01-04800118000
1
Salary Less Than Twice The Average
View solution

Write a query to get the list of managers whose salary is less than twice the average salary of employees reporting to them. For these managers, output their ID, salary and the average salary of employees reporting to them.

Tables: map_employee_hierarchy, dim_employee

Key Takeaways

  • Chain merges to combine any number of DataFrames.
  • Plan the merge path before writing — sketch the relationships.
  • Watch for row multiplication from many-to-many relationships.
  • Check len() at each step when debugging.

What’s Next

Congratulations — you’ve completed Module 3. You can now combine data from any number of DataFrames using pd.concat() and all types of pd.merge(). In the next module, you’ll learn subqueries and more advanced patterns.