Working with Columns
Progress Tracking
Log in to save this lesson and continue from where you left off.
Being Specific About What You Select
In the last lesson, you learned to select columns with double brackets. That’s the foundation. But real analysis requires more than just pulling raw columns out of a DataFrame — you’ll need to rename them, compute new ones, and clean up duplicates.
Let’s build those skills one at a time.
Column Order Is Up to You
techcorp_workforce[["department", "first_name", "salary"]]This seems trivial, but it matters when you’re building reports or exporting data. Put the most important columns first.
Renaming Columns
DataFrame column names such as employee_id or joining_date make sense to developers. But when you’re building a report for stakeholders, you want something more readable.
The rename() method lets you change column names in the output:
cols = techcorp_workforce[["first_name", "joining_date"]]
cols.rename(columns={
"first_name": "First Name",
"joining_date": "Hire Date"
})You pass a dictionary to the columns parameter: keys are old names, values are new names. Only the columns you list get renamed — everything else stays the same.
Column Aliases
| id | first_name | last_name | department | salary | phone_number | joining_date |
|---|---|---|---|---|---|---|
| 1 | Sarah | Mitchell | HR | 95000 | 555-0101 | 2021-03-15 |
| 2 | Michael | Chen | HR | 88000 | 555-0102 | 2022-06-01 |
| 3 | Emily | Rodriguez | HR | 82500 | 2021-09-20 | |
| 4 | David | Park | HR | 80000 | 555-0104 | 2023-01-10 |
| 5 | Lisa | Thompson | HR | 65000 | 2021-04-05 |
Rename department to Team and salary to Annual Pay.
By default, rename() doesn’t modify the original DataFrame — it returns a new one with the changed names. If you want to change the original in place, pass inplace=True, but most of the time working with the returned copy is cleaner.
Creating New Columns with Arithmetic
One of the most useful things you can do with DataFrames is create new columns from existing ones. You just assign to a new column name using standard math operators:
techcorp_workforce["monthly_salary"] = techcorp_workforce["salary"] / 12This creates a new column called monthly_salary by dividing the salary column by 12. The operation applies to every row automatically — no loops needed.
All the standard operators work:
+addition —salary + 5000adds a bonus-subtraction —salary - taxcalculates net pay*multiplication —price * quantitygets a total/division —salary / 12converts annual to monthly
Computed Columns
Create a daily salary column by dividing salary by 365. Show first name, salary, and daily salary.
Arithmetic on Columns
We already created the bonus column. Now add a second column called `salary_with_bonus` that adds the bonus to the original salary, and update the display to include it.
Combining Text Columns
Sometimes you need to merge text from multiple columns — like creating a full name from first_name and last_name. You concatenate strings with the + operator:
techcorp_workforce["full_name"] = (
techcorp_workforce["first_name"] + " " + techcorp_workforce["last_name"]
)The ' ' in the middle adds a space between the two names. Without it, you’d get JohnSmith instead of John Smith.
Building Full Names
Create a label column combining first name and department in the format Alice (HR). Show label and salary.
If one of the columns is numeric, the + operator will try to do math instead of concatenation. Convert to string first with .astype(str): df['id'].astype(str) + ' - ' + df['name'].
Removing Duplicates
Sometimes you want to know what values exist in a column, not see every row. The drop_duplicates() method removes duplicate rows from your results.
Let’s look at the orders table:
| id | cust_id | order_date | order_details | total_order_cost |
|---|---|---|---|---|
| 1 | 3 | 2019-03-04 | Coat | 100 |
| 2 | 3 | 2019-03-01 | Shoes | 80 |
| 3 | 3 | 2019-03-07 | Skirt | 30 |
| 4 | 7 | 2019-02-01 | Coat | 25 |
| 5 | 7 | 2019-03-10 | Shoes | 80 |
Without removing duplicates, you’d see repeated order details:
Select just the order details column from `orders`.
Unique Order Products
Update the selection to return only unique order details.
Deduplicating on Multiple Columns
When you select multiple columns before calling drop_duplicates(), it removes rows where the entire combination is duplicated:
Find the unique combinations of customer ID and order details from `orders`.
This returns unique customer-product pairs. If customer 3 ordered Shoes twice, you’d only see that combination once.
drop_duplicates() works on DataFrames and returns a DataFrame. For a single column, you can also use df['col'].unique(), which returns a NumPy array of unique values. Use drop_duplicates() when you want to keep the DataFrame structure; use unique() when you just need a quick list of distinct values.
A Quick Note on Data Types
Every column in a DataFrame has a data type. You saw dtypes in the last lesson. Here’s what the main types mean:
object— text (strings). Names, emails, categories.int64— whole numbers. IDs, counts, ages.float64— decimal numbers. Prices, percentages, measurements.datetime64— dates and timestamps. When things happened.bool— True/False values. Flags, yes/no fields.
Why does this matter? Because you can’t do math on text, and you can’t search for string patterns in numbers. If something isn’t working, check dtypes first — a column you think is numeric might actually be stored as text.
A common gotcha in real-world data: a column looks numeric, but dtypes shows object. This usually means there’s a stray non-numeric value (like 'N/A' or '$1,200'). You’ll learn how to convert types in a later lesson.
Common Mistakes
Misspelling column names
# KeyError: "firstname"
techcorp_workforce["firstname"]
# The column is actually called "first_name"
techcorp_workforce["first_name"]You’ll get a KeyError if the name doesn’t match exactly. Check df.columns if you’re not sure.
Forgetting the double brackets
# This returns a Series, not a DataFrame
techcorp_workforce["first_name"]
# This returns a DataFrame
techcorp_workforce[["first_name"]]If a method complains about getting a Series when it expects a DataFrame, this is probably why.
Using the wrong quotes
# Without quotes, Python thinks these are variables
techcorp_workforce[[first_name, last_name]] # NameError!
# Column names are always strings — use quotes
techcorp_workforce[["first_name", "last_name"]] # correctRenaming with a New Dataset
Let’s practice with the restaurant inspections data:
| serial_number | activity_date | facility_name | score | grade | service_code | service_description | employee_id | facility_address | facility_city | facility_id | facility_state | facility_zip | owner_id | owner_name | pe_description | program_element_pe | program_name | program_status | record_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DAQHRSETQ | 2017-06-08 | MARGARITAS CAFE | 93 | A | 1 | ROUTINE INSPECTION | EE0000006 | 5026 S CRENSHAW BLVD | LOS ANGELES | FA0023656 | CA | 90043 | OW0004133 | BAZAN, ASCENCION | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | MARGARITAS CAFE | ACTIVE | PR0011718 |
| DA2GQRJOS | 2017-03-07 | LAS MOLENDERAS | 97 | A | 1 | ROUTINE INSPECTION | EE0000997 | 2635 WHITTIER BLVD | LOS ANGELES | FA0160416 | CA | 90023 | OW0125379 | MARISOL FEREGRINO | RESTAURANT (0-30) SEATS HIGH RISK | 1632 | LAS MOLENDERAS | INACTIVE | PR0148504 |
| DAMQTA46T | 2016-03-22 | SANDRA'S TAMALES | 93 | A | 1 | ROUTINE INSPECTION | EE0001049 | 5390 WHITTIER BLVD | LOS ANGELES | FA0171769 | CA | 90022-4032 | OW0178828 | SANDRA'S TAMALES INC. | RESTAURANT (0-30) SEATS MODERATE RISK | 1631 | SANDRA'S TAMALES | ACTIVE | PR0164225 |
| DAXMBTIRZ | 2018-02-12 | CAFE GRATITUDE | 97 | A | 1 | ROUTINE INSPECTION | EE0000828 | 639 N LARCHMONT BLVD STE #102 | LOS ANGELES | FA0058921 | CA | 90004 | OW0005704 | CAFE GRATITUDE LARCHMONT LLC | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | CAFE GRATITUDE | ACTIVE | PR0019854 |
| DAK8TBMS0 | 2015-09-10 | THE WAFFLE | 90 | A | 1 | ROUTINE INSPECTION | EE0000709 | 6255 W SUNSET BLVD STE #105 | LOS ANGELES | FA0051830 | CA | 90028 | OW0035796 | THE WAFFLE, LLC | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | THE WAFFLE | ACTIVE | PR0010922 |
Select the facility name and score from `los_angeles_restaurant_health_inspections`, renaming score to inspection score.
Unique Values in a Column
Find all unique grades that appear in `los_angeles_restaurant_health_inspections`.
Key Takeaways
- Column order in your selection determines the order in your output. Put the most important columns first.
- Use
rename(columns={...})to give columns readable names. - Create new columns with arithmetic:
df['new'] = df['col'] * 2. The operation applies to every row automatically. - Concatenate text with
+. Remember to convert non-string columns with.astype(str)first. drop_duplicates()removes duplicate rows. It applies to the full row, not just the first column.- When something isn’t working, check
df.dtypes— the column might not be the type you expect.
What’s Next
Now you know how to select, rename, compute, and deduplicate columns. But so far, you’re getting all rows. In the next lesson, you’ll learn to filter data with Boolean indexing — which is when pandas starts to feel actually powerful.