Module 4: Multi-Step Analysis30 min

Custom Logic with .apply()

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

Built-in methods like .str handle common cases. But real data has edge cases that don’t fit built-in methods: custom tax brackets, business-specific categorization rules, multi-column logic that depends on three fields at once. .apply() is your escape hatch. It lets you run any Python function on every row. It’s slower than vectorized operations, so use it as a last resort — but when you need it, nothing else will do.

When Built-In Methods Aren’t Enough

.apply() runs any function on every element, row, or column of a DataFrame.

Lambda Functions: Quick One-Liners

A lambda is an anonymous function — a function without a name, written in one line:

Python
# Regular function
def double(x):
    return x * 2

# Same thing as a lambda
double = lambda x: x * 2

# Both do: double(5) -> 10
Python
employee["tax"] = employee["salary"].apply(
    lambda s: s * 0.3 if s > 100000 else s * 0.2
)
employee[["first_name", "salary", "tax"]]
.apply() Is Slower Than Vectorized Operations

.apply() loops under the hood. For simple cases, np.where() or .str methods are faster. Use .apply() when the logic is too complex for vectorized alternatives.

Multi-Branch Logic

1
Categorize with .apply()

The starter has the lambda structure. Fill in the salary tiers: over 100000 = Senior, over 70000 = Mid, else Junior.

Tables: employee

Named Functions for Complex Logic

When the logic is too complex for a lambda, write a named function:

Python
def classify_name(name):
    if pd.isna(name):
        return "Unknown"
    elif len(name) <= 3:
        return "Short"
    elif len(name) <= 6:
        return "Medium"
    else:
        return "Long"

employee["name_class"] = employee["first_name"].apply(classify_name)
Lambda vs Named Function

Use lambda for one-line logic. Use a named function when you need multiple lines, error handling, or reusability. If your lambda has more than one if/else, switch to a named function.

.apply() on Rows

Pass axis=1 to apply a function to each row — the function receives the entire row as a Series:

2
Build a Label from Multiple Columns

Create a label column combining first name and department: 'Alice (HR)'. Use `.apply()` with axis=1.

Tables: employee

When NOT to Use .apply()

Before reaching for .apply(), check if a built-in method exists:

Python

# Slow: .apply() for simple case
df["upper"] = df["name"].apply(lambda x: x.upper())

# Fast: built-in .str method
df["upper"] = df["name"].str.upper()

# Slow: .apply() for conditional
df["flag"] = df["salary"].apply(
    lambda x: "High" if x > 100000 else "Low"
)

# Fast: np.where()
df["flag"] = np.where(df["salary"] > 100000, "High", "Low")

Key Takeaways

  • .apply(func) runs a function on every element in a Series.
  • .apply(func, axis=1) runs on every row of a DataFrame.
  • Lambda for one-liners; named functions for complex logic.
  • Prefer vectorized operations (.str, np.where, .dt) when they exist — .apply() is the fallback.

What’s Next

You can now transform data with any custom logic. Next: reshaping — pivoting wide tables to long and long tables to wide, the operations that power every cross-tabulation report.