Module 1: DataFrame Fundamentals•35 min

Working with Columns

Progress Tracking

Being Specific About What You Select

In the last lesson, you learned to select columns with double brackets. That’s the foundation. But real analysis requires more than just pulling raw columns out of a DataFrame — you’ll need to rename them, compute new ones, and clean up duplicates.

Let’s build those skills one at a time.

Column Order Is Up to You

Python

techcorp_workforce[["department", "first_name", "salary"]]

This seems trivial, but it matters when you’re building reports or exporting data. Put the most important columns first.

Renaming Columns

DataFrame column names such as employee_id or joining_date make sense to developers. But when you’re building a report for stakeholders, you want something more readable.

The rename() method lets you change column names in the output:

Python

cols = techcorp_workforce[["first_name", "joining_date"]]
cols.rename(columns={
    "first_name": "First Name",
    "joining_date": "Hire Date"
})

You pass a dictionary to the columns parameter: keys are old names, values are new names. Only the columns you list get renamed — everything else stays the same.

Column Aliases

Table: techcorp_workforce

id	first_name	last_name	department	salary	phone_number	joining_date
1	Sarah	Mitchell	HR	95000	555-0101	2021-03-15
2	Michael	Chen	HR	88000	555-0102	2022-06-01
3	Emily	Rodriguez	HR	82500		2021-09-20
4	David	Park	HR	80000	555-0104	2023-01-10
5	Lisa	Thompson	HR	65000		2021-04-05

Tables: techcorp_workforce

rename() Returns a New DataFrame

rename() returns the renamed DataFrame as a new copy without changing the original. To keep that version for later steps, store it: df = df.rename(...).

Creating New Columns with Arithmetic

One of the most useful things you can do with DataFrames is create new columns from existing ones. You just assign to a new column name using standard math operators:

Python

techcorp_workforce["monthly_salary"] = techcorp_workforce["salary"] / 12

This creates a new column called monthly_salary by dividing the salary column by 12. The operation applies to every row automatically — no loops needed.

All the standard operators work:

+ addition — salary + 5000 adds a bonus
- subtraction — salary - tax calculates net pay
* multiplication — price * quantity gets a total
/ division — salary / 12 converts annual to monthly

Computed Columns

Tables: techcorp_workforce

Arithmetic on Columns

Tables: techcorp_workforce

Combining Text Columns

Sometimes you need to merge text from multiple columns — like creating a full name from first_name and last_name. You concatenate strings with the + operator:

Python

techcorp_workforce["full_name"] = (
    techcorp_workforce["first_name"] + " " + techcorp_workforce["last_name"]
)

The ' ' in the middle adds a space between the two names. Without it, you’d get JohnSmith instead of John Smith.

Building Full Names

Tables: techcorp_workforce

String Concatenation Only Works on Strings

If one of the columns is numeric, the + operator will try to do math instead of concatenation. Convert to string first with .astype(str): df["id"].astype(str) + " - " + df["name"].

Removing Duplicates

Sometimes you want to know what values exist in a column, not see every row. The drop_duplicates() method removes duplicate rows from your results.

Let’s look at the orders table:

Table: orders

id	cust_id	order_date	order_details	total_order_cost
1	3	2019-03-04	Coat	100
2	3	2019-03-01	Shoes	80
3	3	2019-03-07	Skirt	30
4	7	2019-02-01	Coat	25
5	7	2019-03-10	Shoes	80

Without removing duplicates, you’d see repeated order details:

Selecting Order Details

Tables: orders

Unique Order Products

Tables: orders

drop_duplicates() is exact-match

You may see what looks like Shoes listed twice in the output. The dataset actually has both 'Shoes' and 'Shoes ' (with a trailing space) — drop_duplicates() correctly treats them as different values. Real-world data is messy like this; you’ll learn how to strip whitespace with .str.strip() in Module 5.

Deduplicating on Multiple Columns

When you select multiple columns before calling drop_duplicates(), it removes rows where the entire combination is duplicated:

Tables: orders

This returns unique customer-product pairs. If customer 3 ordered Shoes twice, you’d only see that combination once.

drop_duplicates() vs. unique()

drop_duplicates() works on DataFrames and returns a DataFrame. For a single column, you can also use df['col'].unique(), which returns a NumPy array of unique values. Use drop_duplicates() when you want to keep the DataFrame structure; use unique() when you just need a quick list of distinct values.

A Quick Note on Data Types

Every column in a DataFrame has a data type. You saw dtypes in the last lesson. Here’s what the main types mean:

object — text (strings). Names, emails, categories.
int64 — whole numbers. IDs, counts, ages.
float64 — decimal numbers. Prices, percentages, measurements.
datetime64 — dates and timestamps. When things happened.
bool — True/False values. Flags, yes/no fields.

Why does this matter? Because you can’t do math on text, and you can’t search for string patterns in numbers. If something isn’t working, check dtypes first — a column you think is numeric might actually be stored as text.

Numbers Stored as Text

A common gotcha in real-world data: a column looks numeric, but dtypes shows object. This usually means there’s a stray non-numeric value (like "N/A" or "$1,200"). You’ll learn how to convert types in a later lesson.

Common Mistakes

Misspelling column names

Python

# KeyError: "firstname"
techcorp_workforce["firstname"]

# The column is actually called "first_name"
techcorp_workforce["first_name"]

You’ll get a KeyError if the name doesn’t match exactly. Check df.columns if you’re not sure.

Forgetting the double brackets

Python

# This returns a Series, not a DataFrame
techcorp_workforce["first_name"]

# This returns a DataFrame
techcorp_workforce[["first_name"]]

If a method complains about getting a Series when it expects a DataFrame, this is probably why.

Using the wrong quotes

Python

# Without quotes, Python thinks these are variables
techcorp_workforce[[first_name, last_name]]  # NameError!

# Column names are always strings — use quotes
techcorp_workforce[["first_name", "last_name"]]  # correct

Renaming with a New Dataset

Let’s practice with the restaurant inspections data:

Table: los_angeles_restaurant_health_inspections

serial_number	activity_date	facility_name	score	grade	service_code	service_description	employee_id	facility_address	facility_city	facility_id	facility_state	facility_zip	owner_id	owner_name	pe_description	program_element_pe	program_name	program_status	record_id
DAQHRSETQ	2017-06-08	MARGARITAS CAFE	93	A	1	ROUTINE INSPECTION	EE0000006	5026 S CRENSHAW BLVD	LOS ANGELES	FA0023656	CA	90043	OW0004133	BAZAN, ASCENCION	RESTAURANT (61-150) SEATS HIGH RISK	1638	MARGARITAS CAFE	ACTIVE	PR0011718
DA2GQRJOS	2017-03-07	LAS MOLENDERAS	97	A	1	ROUTINE INSPECTION	EE0000997	2635 WHITTIER BLVD	LOS ANGELES	FA0160416	CA	90023	OW0125379	MARISOL FEREGRINO	RESTAURANT (0-30) SEATS HIGH RISK	1632	LAS MOLENDERAS	INACTIVE	PR0148504
DAMQTA46T	2016-03-22	SANDRA'S TAMALES	93	A	1	ROUTINE INSPECTION	EE0001049	5390 WHITTIER BLVD	LOS ANGELES	FA0171769	CA	90022-4032	OW0178828	SANDRA'S TAMALES INC.	RESTAURANT (0-30) SEATS MODERATE RISK	1631	SANDRA'S TAMALES	ACTIVE	PR0164225
DAXMBTIRZ	2018-02-12	CAFE GRATITUDE	97	A	1	ROUTINE INSPECTION	EE0000828	639 N LARCHMONT BLVD STE #102	LOS ANGELES	FA0058921	CA	90004	OW0005704	CAFE GRATITUDE LARCHMONT LLC	RESTAURANT (61-150) SEATS HIGH RISK	1638	CAFE GRATITUDE	ACTIVE	PR0019854
DAK8TBMS0	2015-09-10	THE WAFFLE	90	A	1	ROUTINE INSPECTION	EE0000709	6255 W SUNSET BLVD STE #105	LOS ANGELES	FA0051830	CA	90028	OW0035796	THE WAFFLE, LLC	RESTAURANT (61-150) SEATS HIGH RISK	1638	THE WAFFLE	ACTIVE	PR0010922

Tables: los_angeles_restaurant_health_inspections

Unique Values in a Column

Tables: los_angeles_restaurant_health_inspections

Key Takeaways

Column order in your selection determines the order in your output. Put the most important columns first.
Use rename(columns={...}) to give columns readable names.
Create new columns with arithmetic: df['new'] = df['col'] * 2. The operation applies to every row automatically.
Concatenate text with +. Remember to convert non-string columns with .astype(str) first.
drop_duplicates() removes duplicate rows. It applies to the full row, not just the first column.
When something isn’t working, check df.dtypes — the column might not be the type you expect.

What’s Next

Now you know how to select, rename, compute, and deduplicate columns. But so far, you’re getting all rows. In the next lesson, you’ll learn to filter data with Boolean indexing — which is when pandas starts to feel actually powerful.

Next upContinue →

Filtering Rows

50 min