Module 4: Multi-Step Analysis30 min

Reshaping Data

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Long vs Wide Format

Data comes in two shapes:

Python
# LONG format (each observation is a row)
  name     subject  score
  Alice    Math     90
  Alice    Science  85
  Bob      Math     78
  Bob      Science  92

# WIDE format (each subject is a column)
  name     Math  Science
  Alice    90    85
  Bob      78    92

Long format is better for analysis (filtering, grouping). Wide format is better for reports and display. You’ll constantly convert between them.

Creating Pivot Tables

Python
pd.pivot_table(
    orders,
    index="cust_id",
    values="total_order_cost",
    aggfunc=["sum", "count"]
)

Melting Wide to Long

pd.melt() is the reverse of pivoting. It takes columns and unpivots them into rows:

Python
# Wide format: one column per metric
# employee has salary, bonus as separate columns
pd.melt(
    employee,
    id_vars=["first_name", "department"],
    value_vars=["salary", "bonus"],
    var_name="metric",
    value_name="amount"
)

The key parameters:

  • id_vars — columns to keep as-is (identifier columns)
  • value_vars — columns to unpivot into rows
  • var_name — name for the new column holding the old column names
  • value_name — name for the new column holding the values

When to Pivot vs Melt

  • Data has one row per observation, you want a summary table → pivot_table()
  • Data has metrics spread across columns, you want one row per metric → melt()
  • Need to aggregate while reshaping → pivot_table() (has aggfunc)
  • Just rearranging, no aggregation → melt() or pivot() (without _table)
Table: employee
idfirst_namelast_nameagesexemployee_titledepartmentsalarytargetbonusemailcityaddressmanager_id
5MaxGeorge26MSalesSales1300200150Max@company.comCalifornia2638 Richards Avenue1
13KattyBond56FManagerManagement1500000300Katty@company.comArizona1
11RicherdGear57MManagerManagement2500000300Richerd@company.comAlabama1
10JenniferDion34FSalesSales1000200150Jennifer@company.comAlabama13
19GeorgeJoe50MManagerManagement1000000300George@company.comFlorida1003 Wyatt Street1
1
Pivot Department Salary Stats

Create a pivot table showing average and max salary per department. Use pivot table() with aggfunc.

Tables: employee

Titanic Survivors and Non-Survivors

Table: titanic
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
103Braund, Mr. Owen Harrismale2210A/5 211717.25S
211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female3810PC 1759971.28C85C
313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.92S
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1C123S
503Allen, Mr. William Henrymale35003734508.05S
2
Titanic Survivors and Non-Survivors
View solution

Make a report showing the number of survivors and non-survivors by passenger class. Classes are categorized based on the `pclass` value as: • First class: `pclass = 1` • Second class: `pclass = 2` • Third class: `pclass = 3` Output the number of survivors and non-survivors by each class.

Tables: titanic

Key Takeaways

  • pd.pivot_table() converts long → wide with aggregation.
  • pd.melt() converts wide → long (unpivoting).
  • pd.crosstab() is a shortcut for frequency pivot tables.
  • Long format for analysis; wide format for reports.
  • index, columns, values, aggfunc are the four pivot_table parameters to know.

What’s Next

You’ve completed Module 4. You can now break complex analyses into clean steps, apply custom logic, and reshape data for any format. Module 5 covers dates, strings, and conditional logic — the functions that handle messy real-world data.