Module 3: Combining DataFrames30 min

Inner Merge

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

An inner merge is the workhorse of data analysis. You have orders in one table and customer details in another — to see “who bought what,” you merge them. Inner merge keeps only the rows that match in both tables. If a customer has no orders, they disappear. If an order has no matching customer, it disappears too. This “both sides must match” behavior is exactly what you want most of the time.

Where the Real Work Begins

Almost every meaningful analysis requires combining data from multiple DataFrames:

  • “Show me orders with customer names.”
  • “Get products with their category labels.”
  • “Find employees with their department info.”

All of these need pd.merge().

The Basic Pattern

Python
result = pd.merge(left_df, right_df, left_on="fk", right_on="pk")

left_on and right_on specify which columns to match. When both DataFrames use the same column name, use on= instead:

Python
result = pd.merge(df_a, df_b, on="shared_column")

Your First Merge

Python
result = pd.merge(
    employees, departments,
    left_on="dept_id", right_on="id"
)
result[["first_name", "dept_name", "salary"]]
Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80
Table: customers
idfirst_namelast_namecityaddressphone_number
8JohnJosephSan Francisco928-386-8164
7JillMichaelAustin813-297-0692
4WilliamDanielDenver813-368-1200
5HenryJacksonMiami808-601-7513
13EmmaIsaacMiami808-690-5201
1
Merge Orders with Customers

Merge orders with customers to see who placed each order. Match on the customer ID columns.

Tables: orders, customers
Overlapping Column Names

When both DataFrames have a column with the same name (like id), pandas adds _x and _y suffixes. Control this with suffixes=("_order", "_customer").

Filtering After Merge

Merge first, then filter the combined result:

2
Filter Merged Results

Merge orders with customers, then filter to orders over 100.

Tables: orders, customers

The Aggregation Trap

When merging and then aggregating, group by the unique key (id), never just by name. Different people can share the same name.

Python
# Wrong: two customers named 'John' get merged
merged.groupby('first_name')['total_order_cost'].sum()

# Correct: group by the unique ID
merged.groupby(['id_y', 'first_name'])['total_order_cost'].sum()
What Are You Counting?

After merging, each order row gets customer info attached. If you .count() the customer ID column, you’re counting orders, not customers. Always think about what the rows represent after the merge.

Total Cost Of Orders

Table: customers
idfirst_namelast_namecityaddressphone_number
8JohnJosephSan Francisco928-386-8164
7JillMichaelAustin813-297-0692
4WilliamDanielDenver813-368-1200
5HenryJacksonMiami808-601-7513
13EmmaIsaacMiami808-690-5201
Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80
3
Total Cost Of Orders
View solution

Find the total cost of each customer's orders. Output customer's id, first name, and the total order cost. Order records by customer's first name alphabetically.

Tables: customers, orders

Order Details

Table: customers
idfirst_namelast_namecityaddressphone_number
8JohnJosephSan Francisco928-386-8164
7JillMichaelAustin813-297-0692
4WilliamDanielDenver813-368-1200
5HenryJacksonMiami808-601-7513
13EmmaIsaacMiami808-690-5201
Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80
4
Order Details
View solution

Find order details made by Jill and Eva. Consider the Jill and Eva as first names of customers. Output the order date, details and cost along with the first name. Order records based on the customer id in ascending order.

Tables: customers, orders

Key Takeaways

  • pd.merge(left, right, left_on=, right_on=) joins on different column names.
  • pd.merge(left, right, on=) when both use the same column name.
  • Inner merge (default) keeps only matching rows.
  • Use suffixes= to control overlapping column names.
  • When aggregating merged data, group by unique keys, not names.

What’s Next

Inner merge only keeps matches. But what about “Show me all customers, including those who haven’t ordered”? That’s a left merge, and it’s probably the most important merge type for analytics.