Module 3: Combining DataFrames•30 min

Inner Merge

Progress Tracking

An inner merge is the workhorse of data analysis. You have orders in one table and customer details in another — to see “who bought what,” you merge them. Inner merge keeps only the rows that match in both tables. If a customer has no orders, they disappear. If an order has no matching customer, it disappears too. This “both sides must match” behavior is exactly what you want most of the time.

Where the Real Work Begins

Almost every meaningful analysis requires combining data from multiple DataFrames:

“Show me orders with customer names.”
“Get products with their category labels.”
“Find employees with their department info.”

All of these need pd.merge().

The Basic Pattern

Python

result = pd.merge(left_df, right_df, left_on="fk", right_on="pk")

left_on and right_on specify which columns to match. When both DataFrames use the same column name, use on= instead:

Python

result = pd.merge(df_a, df_b, on="shared_column")

Your First Merge

Python

result = pd.merge(
    employees, departments,
    left_on="dept_id", right_on="id"
)
result[["first_name", "dept_name", "salary"]]

Table: orders

id	cust_id	order_date	order_details	total_order_cost
1	3	2019-03-04	Coat	100
2	3	2019-03-01	Shoes	80
3	3	2019-03-07	Skirt	30
4	7	2019-02-01	Coat	25
5	7	2019-03-10	Shoes	80

Table: customers

id	first_name	last_name	city	phone_number
8	John	Joseph	San Francisco	928-386-8164
7	Jill	Michael	Austin	813-297-0692
4	William	Daniel	Denver	813-368-1200
5	Henry	Jackson	Miami	808-601-7513
13	Emma	Isaac	Miami	808-690-5201

Tables: orders, customers

Overlapping Column Names

When both DataFrames have a column with the same name (like id), pandas adds _x and _y suffixes. Control this with suffixes=("_order", "_customer").

Filtering After Merge

Merge first, then filter the combined result:

Tables: orders, customers

The Aggregation Trap

When merging and then aggregating, group by the unique key (id), never just by name. Different people can share the same name.

Python

# Wrong: two customers named 'John' get merged
merged.groupby('first_name')['total_order_cost'].sum()

# Correct: group by the unique ID
merged.groupby(['id_y', 'first_name'])['total_order_cost'].sum()

What Are You Counting?

After merging, each order row gets customer info attached. If you .count() the customer ID column, you’re counting orders, not customers. Always think about what the rows represent after the merge.

Total Cost Of Orders

Table: customers

id	first_name	last_name	city	phone_number
8	John	Joseph	San Francisco	928-386-8164
7	Jill	Michael	Austin	813-297-0692
4	William	Daniel	Denver	813-368-1200
5	Henry	Jackson	Miami	808-601-7513
13	Emma	Isaac	Miami	808-690-5201

Table: orders

id	cust_id	order_date	order_details	total_order_cost
1	3	2019-03-04	Coat	100
2	3	2019-03-01	Shoes	80
3	3	2019-03-07	Skirt	30
4	7	2019-02-01	Coat	25
5	7	2019-03-10	Shoes	80

Tables: customers, orders

Order Details

Tables: customers, orders

Key Takeaways

pd.merge(left, right, left_on=, right_on=) joins on different column names.
pd.merge(left, right, on=) when both use the same column name.
Inner merge (default) keeps only matching rows.
Use suffixes= to control overlapping column names.
When aggregating merged data, group by unique keys, not names.

What’s Next

Inner merge only keeps matches. But what about “Show me all customers, including those who haven’t ordered”? That’s a left merge, and it’s probably the most important merge type for analytics.

Next upContinue →

Left and Right Merge

35 min