Module 1: DataFrame Fundamentals35 min

Working with Columns

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Being Specific About What You Select

In the last lesson, you learned to select columns with double brackets. That’s the foundation. But real analysis requires more than just pulling raw columns out of a DataFrame — you’ll need to rename them, compute new ones, and clean up duplicates.

Let’s build those skills one at a time.

Column Order Is Up to You

Python
techcorp_workforce[["department", "first_name", "salary"]]

This seems trivial, but it matters when you’re building reports or exporting data. Put the most important columns first.

Renaming Columns

DataFrame column names such as employee_id or joining_date make sense to developers. But when you’re building a report for stakeholders, you want something more readable.

The rename() method lets you change column names in the output:

Python
cols = techcorp_workforce[["first_name", "joining_date"]]
cols.rename(columns={
    "first_name": "First Name",
    "joining_date": "Hire Date"
})

You pass a dictionary to the columns parameter: keys are old names, values are new names. Only the columns you list get renamed — everything else stays the same.

Column Aliases

Table: techcorp_workforce
idfirst_namelast_namedepartmentsalaryphone_numberjoining_date
1SarahMitchellHR95000555-01012021-03-15
2MichaelChenHR88000555-01022022-06-01
3EmilyRodriguezHR825002021-09-20
4DavidParkHR80000555-01042023-01-10
5LisaThompsonHR650002021-04-05
1
Rename Department and Salary

Rename department to Team and salary to Annual Pay.

Tables: techcorp_workforce
rename() Returns a New DataFrame

By default, rename() doesn’t modify the original DataFrame — it returns a new one with the changed names. If you want to change the original in place, pass inplace=True, but most of the time working with the returned copy is cleaner.

Creating New Columns with Arithmetic

One of the most useful things you can do with DataFrames is create new columns from existing ones. You just assign to a new column name using standard math operators:

Python
techcorp_workforce["monthly_salary"] = techcorp_workforce["salary"] / 12

This creates a new column called monthly_salary by dividing the salary column by 12. The operation applies to every row automatically — no loops needed.

All the standard operators work:

  • + addition — salary + 5000 adds a bonus
  • - subtraction — salary - tax calculates net pay
  • * multiplication — price * quantity gets a total
  • / division — salary / 12 converts annual to monthly

Computed Columns

2
Add a Daily Salary Column

Create a daily salary column by dividing salary by 365. Show first name, salary, and daily salary.

Tables: techcorp_workforce

Arithmetic on Columns

3
Calculate a Salary Bonus

We already created the bonus column. Now add a second column called `salary_with_bonus` that adds the bonus to the original salary, and update the display to include it.

Tables: techcorp_workforce

Combining Text Columns

Sometimes you need to merge text from multiple columns — like creating a full name from first_name and last_name. You concatenate strings with the + operator:

Python
techcorp_workforce["full_name"] = (
    techcorp_workforce["first_name"] + " " + techcorp_workforce["last_name"]
)

The ' ' in the middle adds a space between the two names. Without it, you’d get JohnSmith instead of John Smith.

Building Full Names

4
Build a Display Label

Create a label column combining first name and department in the format Alice (HR). Show label and salary.

Tables: techcorp_workforce
String Concatenation Only Works on Strings

If one of the columns is numeric, the + operator will try to do math instead of concatenation. Convert to string first with .astype(str): df['id'].astype(str) + ' - ' + df['name'].

Removing Duplicates

Sometimes you want to know what values exist in a column, not see every row. The drop_duplicates() method removes duplicate rows from your results.

Let’s look at the orders table:

Table: orders
idcust_idorder_dateorder_detailstotal_order_cost
132019-03-04Coat100
232019-03-01Shoes80
332019-03-07Skirt30
472019-02-01Coat25
572019-03-10Shoes80

Without removing duplicates, you’d see repeated order details:

5
Select Order Details

Select just the order details column from `orders`.

Tables: orders

Unique Order Products

6
Get Unique Order Details

Update the selection to return only unique order details.

Tables: orders

Deduplicating on Multiple Columns

When you select multiple columns before calling drop_duplicates(), it removes rows where the entire combination is duplicated:

7
Unique Customer-Product Pairs

Find the unique combinations of customer ID and order details from `orders`.

Tables: orders

This returns unique customer-product pairs. If customer 3 ordered Shoes twice, you’d only see that combination once.

drop_duplicates() vs. unique()

drop_duplicates() works on DataFrames and returns a DataFrame. For a single column, you can also use df['col'].unique(), which returns a NumPy array of unique values. Use drop_duplicates() when you want to keep the DataFrame structure; use unique() when you just need a quick list of distinct values.

A Quick Note on Data Types

Every column in a DataFrame has a data type. You saw dtypes in the last lesson. Here’s what the main types mean:

  • object — text (strings). Names, emails, categories.
  • int64 — whole numbers. IDs, counts, ages.
  • float64 — decimal numbers. Prices, percentages, measurements.
  • datetime64 — dates and timestamps. When things happened.
  • bool — True/False values. Flags, yes/no fields.

Why does this matter? Because you can’t do math on text, and you can’t search for string patterns in numbers. If something isn’t working, check dtypes first — a column you think is numeric might actually be stored as text.

Numbers Stored as Text

A common gotcha in real-world data: a column looks numeric, but dtypes shows object. This usually means there’s a stray non-numeric value (like 'N/A' or '$1,200'). You’ll learn how to convert types in a later lesson.

Common Mistakes

Misspelling column names

Python
# KeyError: "firstname"
techcorp_workforce["firstname"]

# The column is actually called "first_name"
techcorp_workforce["first_name"]

You’ll get a KeyError if the name doesn’t match exactly. Check df.columns if you’re not sure.

Forgetting the double brackets

Python
# This returns a Series, not a DataFrame
techcorp_workforce["first_name"]

# This returns a DataFrame
techcorp_workforce[["first_name"]]

If a method complains about getting a Series when it expects a DataFrame, this is probably why.

Using the wrong quotes

Python
# Without quotes, Python thinks these are variables
techcorp_workforce[[first_name, last_name]]  # NameError!

# Column names are always strings — use quotes
techcorp_workforce[["first_name", "last_name"]]  # correct

Renaming with a New Dataset

Let’s practice with the restaurant inspections data:

Table: los_angeles_restaurant_health_inspections
serial_numberactivity_datefacility_namescoregradeservice_codeservice_descriptionemployee_idfacility_addressfacility_cityfacility_idfacility_statefacility_zipowner_idowner_namepe_descriptionprogram_element_peprogram_nameprogram_statusrecord_id
DAQHRSETQ2017-06-08MARGARITAS CAFE93A1ROUTINE INSPECTIONEE00000065026 S CRENSHAW BLVDLOS ANGELESFA0023656CA90043OW0004133BAZAN, ASCENCIONRESTAURANT (61-150) SEATS HIGH RISK1638MARGARITAS CAFEACTIVEPR0011718
DA2GQRJOS2017-03-07LAS MOLENDERAS97A1ROUTINE INSPECTIONEE00009972635 WHITTIER BLVDLOS ANGELESFA0160416CA90023OW0125379MARISOL FEREGRINORESTAURANT (0-30) SEATS HIGH RISK1632LAS MOLENDERASINACTIVEPR0148504
DAMQTA46T2016-03-22SANDRA'S TAMALES93A1ROUTINE INSPECTIONEE00010495390 WHITTIER BLVDLOS ANGELESFA0171769CA90022-4032OW0178828SANDRA'S TAMALES INC.RESTAURANT (0-30) SEATS MODERATE RISK1631SANDRA'S TAMALESACTIVEPR0164225
DAXMBTIRZ2018-02-12CAFE GRATITUDE97A1ROUTINE INSPECTIONEE0000828639 N LARCHMONT BLVD STE #102LOS ANGELESFA0058921CA90004OW0005704CAFE GRATITUDE LARCHMONT LLCRESTAURANT (61-150) SEATS HIGH RISK1638CAFE GRATITUDEACTIVEPR0019854
DAK8TBMS02015-09-10THE WAFFLE90A1ROUTINE INSPECTIONEE00007096255 W SUNSET BLVD STE #105LOS ANGELESFA0051830CA90028OW0035796THE WAFFLE, LLCRESTAURANT (61-150) SEATS HIGH RISK1638THE WAFFLEACTIVEPR0010922
8
Rename Inspection Columns

Select the facility name and score from `los_angeles_restaurant_health_inspections`, renaming score to inspection score.

Tables: los_angeles_restaurant_health_inspections

Unique Values in a Column

9
Find Unique Grades

Find all unique grades that appear in `los_angeles_restaurant_health_inspections`.

Tables: los_angeles_restaurant_health_inspections

Key Takeaways

  • Column order in your selection determines the order in your output. Put the most important columns first.
  • Use rename(columns={...}) to give columns readable names.
  • Create new columns with arithmetic: df['new'] = df['col'] * 2. The operation applies to every row automatically.
  • Concatenate text with +. Remember to convert non-string columns with .astype(str) first.
  • drop_duplicates() removes duplicate rows. It applies to the full row, not just the first column.
  • When something isn’t working, check df.dtypes — the column might not be the type you expect.

What’s Next

Now you know how to select, rename, compute, and deduplicate columns. But so far, you’re getting all rows. In the next lesson, you’ll learn to filter data with Boolean indexing — which is when pandas starts to feel actually powerful.