Module 3: Combining DataFrames10 min

Understanding DataFrame Relationships

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Why Data Lives in Separate DataFrames

In Module 1 and 2, every analysis used a single DataFrame. That works for simple questions. But real business questions usually need data from multiple places. “Show me each order with the customer’s name and email.” That information lives in two different DataFrames.

This separation isn’t a bug. It’s how databases organize data to avoid redundancy and maintain accuracy. Before you can combine DataFrames, you need to understand how they relate.

Primary Keys: Unique Identifiers

Every table needs a way to uniquely identify each row. That’s what a primary key does.

Table: customers
idfirst_namelast_namecityaddressphone_number
8JohnJosephSan Francisco928-386-8164
7JillMichaelAustin813-297-0692
4WilliamDanielDenver813-368-1200
5HenryJacksonMiami808-601-7513
13EmmaIsaacMiami808-690-5201

The id column is the primary key. Each value is unique. No two customers share the same id.

Primary Key Formats

Primary keys are almost always integers that auto-increment (1, 2, 3...). Some databases use UUIDs instead, but the concept is the same: one unique value per row.

Foreign Keys: The Connection

A foreign key is a column in one DataFrame that references the primary key of another. It’s how DataFrames link together.

Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80

The cust_id column is a foreign key. It references the id column in the customers DataFrame. Order 4 belongs to customer 7. Order 6 belongs to customer 15.

Foreign Keys Aren’t Unique

Foreign keys don’t have to be unique. Customer ID 3 has multiple orders. That’s perfectly normal — it’s a one-to-many relationship.

One-to-Many Relationships

The most common relationship. One record in DataFrame A relates to many records in DataFrame B:

  • One customer can have many orders
  • One department can have many employees
  • One author can write many books

The “one” side has the primary key. The “many” side has the foreign key.

Many-to-Many Relationships

When records on both sides can relate to multiple records on the other side:

  • Students can enroll in many courses; courses have many students
  • Products can be in many orders; orders contain many products

These require a junction table (also called a linking or bridge table) with two foreign keys.

Identifying Relationships

When exploring new data, look for:

  • Columns ending in _id (foreign keys)
  • Column names that match other DataFrame names
  • DataFrames with names combining two other names (order_items, student_courses)

Verifying Relationships

You already previewed the customers and orders tables above. The cust_id column in orders references the id column in customers. Let’s verify it’s a one-to-many relationship:

1
Check for Duplicate Foreign Keys

Verify that `cust_id` in `orders` is not unique — find customer IDs that appear more than once. This confirms the many side of the one-to-many relationship.

Tables: orders

Some customer IDs appear multiple times — that’s the “many” side in action. In this module, you’ll learn how to actually connect DataFrames like these using pd.merge().

Key Takeaways

  • Primary keys uniquely identify each row in a DataFrame.
  • Foreign keys reference primary keys in other DataFrames, creating relationships.
  • One-to-many is the most common relationship (one customer, many orders).
  • Many-to-many requires a junction table with two foreign keys.
  • Look for _id columns and shared column names to identify relationships.

What’s Next

Now that you understand how DataFrames relate, it’s time to connect them. In the next lesson, you’ll learn pd.concat() — a way to stack DataFrames together.