Understanding DataFrame Relationships
Progress Tracking
Log in to save this lesson and continue from where you left off.
Why Data Lives in Separate DataFrames
In Module 1 and 2, every analysis used a single DataFrame. That works for simple questions. But real business questions usually need data from multiple places. “Show me each order with the customer’s name and email.” That information lives in two different DataFrames.
This separation isn’t a bug. It’s how databases organize data to avoid redundancy and maintain accuracy. Before you can combine DataFrames, you need to understand how they relate.
Primary Keys: Unique Identifiers
Every table needs a way to uniquely identify each row. That’s what a primary key does.
| id | first_name | last_name | city | address | phone_number |
|---|---|---|---|---|---|
| 8 | John | Joseph | San Francisco | 928-386-8164 | |
| 7 | Jill | Michael | Austin | 813-297-0692 | |
| 4 | William | Daniel | Denver | 813-368-1200 | |
| 5 | Henry | Jackson | Miami | 808-601-7513 | |
| 13 | Emma | Isaac | Miami | 808-690-5201 |
The id column is the primary key. Each value is unique. No two customers share the same id.
Primary keys are almost always integers that auto-increment (1, 2, 3...). Some databases use UUIDs instead, but the concept is the same: one unique value per row.
Foreign Keys: The Connection
A foreign key is a column in one DataFrame that references the primary key of another. It’s how DataFrames link together.
| id | cust_id | order_date | order_details | total_order_cost |
|---|---|---|---|---|
| 1 | 3 | 2019-03-04 | Coat | 100 |
| 2 | 3 | 2019-03-01 | Shoes | 80 |
| 3 | 3 | 2019-03-07 | Skirt | 30 |
| 4 | 7 | 2019-02-01 | Coat | 25 |
| 5 | 7 | 2019-03-10 | Shoes | 80 |
The cust_id column is a foreign key. It references the id column in the customers DataFrame. Order 4 belongs to customer 7. Order 6 belongs to customer 15.
Foreign keys don’t have to be unique. Customer ID 3 has multiple orders. That’s perfectly normal — it’s a one-to-many relationship.
One-to-Many Relationships
The most common relationship. One record in DataFrame A relates to many records in DataFrame B:
- One customer can have many orders
- One department can have many employees
- One author can write many books
The “one” side has the primary key. The “many” side has the foreign key.
Many-to-Many Relationships
When records on both sides can relate to multiple records on the other side:
- Students can enroll in many courses; courses have many students
- Products can be in many orders; orders contain many products
These require a junction table (also called a linking or bridge table) with two foreign keys.
Identifying Relationships
When exploring new data, look for:
- Columns ending in
_id(foreign keys) - Column names that match other DataFrame names
- DataFrames with names combining two other names (
order_items,student_courses)
Verifying Relationships
You already previewed the customers and orders tables above. The cust_id column in orders references the id column in customers. Let’s verify it’s a one-to-many relationship:
Verify that `cust_id` in `orders` is not unique — find customer IDs that appear more than once. This confirms the many side of the one-to-many relationship.
Some customer IDs appear multiple times — that’s the “many” side in action. In this module, you’ll learn how to actually connect DataFrames like these using pd.merge().
Key Takeaways
- Primary keys uniquely identify each row in a DataFrame.
- Foreign keys reference primary keys in other DataFrames, creating relationships.
- One-to-many is the most common relationship (one customer, many orders).
- Many-to-many requires a junction table with two foreign keys.
- Look for
_idcolumns and shared column names to identify relationships.
What’s Next
Now that you understand how DataFrames relate, it’s time to connect them. In the next lesson, you’ll learn pd.concat() — a way to stack DataFrames together.