PCA Analysis in Python for Beginners

PCA Analysis in Python for Beginners
  • Author Avatar
    Written by:

    Nathan Rosidi

A Practical Walkthrough of Principal Component Analysis with Real-World Examples in Python

One of the most common methods for reducing the number of features in machine learning is Principal Component Analysis (PCA). But why do you need to reduce them? Because it helps you find the most important ones and ignore the rest. This cuts down the computational cost and speeds up your model.

For example, in image recognition, you can compress hundreds of pixel features into a handful of principal components while still capturing most of the important variation. Similarly, in finance, you can reduce dozens of correlated market indicators into a few composite risk factors that explain most of the movement.

In this article, we’ll discover how to use PCA in Python step by step, including the logic, math, and a real project.

PCA Analysis Python: What It Is and Why It Matters

Let’s say you are trying to predict house prices and you have 10 columns, like square footage, number of rooms, distance from the city, and so on.

Some of these columns may be giving you the same kind of information. PCA finds patterns in these columns and builds new axes to represent the data more efficiently.

These new axes are called principal components. Each one shows how much of the total variation in the data it can explain. For example, in the graph below, you need between 6 components to explain 80% of the data.

What is PCA Analysis in Python

When to Use PCA Analysis in Python

You don’t always need PCA. But in some cases, for instance, if your dataset is too large and your computational power is limited, it can really help.

It reduces the number of columns while keeping most of the critical signals. It’s also useful when some columns are highly correlated. PCA combines them into fewer, uncorrelated components.

Another reason to use PCA is for visualization. It helps you project high-dimensional data into 2D or 3D to spot clusters or outliers. But if your dataset is already small and clean, PCA might not help. Sometimes it can even remove signals that your model actually needs.

If you’re familiar with PCA in other languages or want to compare how it’s done elsewhere, you can see how it's implemented in principal component analysis in R for another perspective.

Math Behind PCA in Python

PCA works in a few simple steps.

  • First, it centers your data by taking away the mean of each column.
  • Then it checks how the columns vary together.
  • Based on that, it finds new directions that capture the most variation in the data.

These directions are ranked by how much information they hold. You can then choose how many to keep, depending on how much variance you want to explain.

You can see these new directions in the plot below; the red, blue, and green arrows show how PCA finds the first three principal components, each capturing a different amount of variation in the data.

Math Behind PCA in Python

Setting Up Your Python Environment for PCA Analysis

To run PCA in Python, you only need a few basic packages. Here is the code to install the packages:

pip install pandas numpy matplotlib scikit-learn


Once you have installed them, here is the code to load them.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


In the next section, we’ll load a real dataset and walk through each PCA step with code.

Real-World PCA Analysis in Python: DoorDash Delivery Duration Case Study

Real World PCA Analysis in Python

This is a real data science take-home assignment in the recruitment process for data science positions at DoorDash. The goal is to predict the total duration of the delivery process by examining order details, market activity, and estimated times. Here is the link to this data project.

Loading and Preparing Data for PCA Analysis in Python

Let’s load the data and take a quick look at it.

import pandas as pd

df = pd.read_csv("historical_data.csv")
df.head()


Here are the first few rows of the data.

Real World PCA Analysis in Python


In the first few rows, we can see order timestamps, basic store details, item counts, and pricing information per order. Now let’s check all of the columns.

df.info()


Here is the output.

Real World PCA Analysis in Python


The full dataset includes columns related to store and order details, market activity at the time of order, internal delivery estimates, and all the timestamps needed to compute total delivery time.

Data Preprocessing

Before applying PCA, we performed a few basic preprocessing steps to prepare the dataset. One of the first things we needed was a column for delivery duration, but this wasn’t directly provided. We had to calculate it using two timestamp columns: `created_at` and `actual_delivery_time`.

Here’s how they looked at first:

Real World PCA Analysis in Python


We converted them into `datetime` format and created a new column with the delivery duration in seconds. This will be our target variable for modeling.

Since PCA is an unsupervised feature reduction method, we will later drop this target column before running PCA.

df["created_at"] = pd.to_datetime(df["created_at"])
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"])

df["actual_total_delivery_duration"] = (
    df["actual_delivery_time"] - df["created_at"]
).dt.total_seconds()

df["actual_total_delivery_duration"].head()


Here is the output.

Real World PCA Analysis in Python


Next, we engineered a new feature called `estimated_non_prep_duration`, which combines the estimated driving time and order placement time:

df["estimated_non_prep_duration"] = (
    df["estimated_store_to_consumer_driving_duration"] +
    df["estimated_order_place_duration"]
)
df["estimated_non_prep_duration"]


Here is the output.

Real World PCA Analysis in Python


We’ll also remove the columns that don’t provide meaningful patterns for PCA, such as IDs, raw timestamps, and the original duration values:

drop_cols = [
    "store_id", "market_id",
    "estimated_order_place_duration", "estimated_store_to_consumer_driving_duration",
    "actual_total_delivery_duration", "created_at", "actual_delivery_time"
]

df = df.drop(columns=drop_cols, errors="ignore")
df.head()


Here is the output.

Real World PCA Analysis in Python


Next, we’ll clean missing values and select only numerical columns:

df_clean = df.dropna()
numeric_cols = df_clean.select_dtypes(include=["float64", "int64"]).columns
X = df_clean[numeric_cols]
X


Here is the output.

Real World PCA Analysis in Python

Applying PCA Analysis in Python Step by Step

Finally, we’ll scale the data to make it suitable for PCA and then check how much variance each component explains. Here is the code.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)

cumulative_variance = pca.explained_variance_ratio_.cumsum()

plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% Variance Threshold')
plt.title('Cumulative Explained Variance by Number of Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()


Here is the output.

Applying PCA Analysis in Python Step by Step


The line illustrates how much of the total variance is explained as more components are added.As shown above, just the first six components are sufficient to explain over 95% of the dataset.

This means instead of using all features, we can keep only those six components and still preserve almost all helpful information. That makes the data smaller, faster to process, and often better for modeling.

Common Pitfalls & Best Practices

PCA is robust, but it's easy to misuse. Here are a few key points to consider when using it in real-world projects.

Key points to consider in PCA Analysis in Python

Final Thoughts

One of the easiest and most valuable tools in your data science toolbox is PCA analysis in Python. It cuts down on noise, accelerates models, and highlights the most essential parts of your data.

While PCA isn’t always necessary, it can be especially useful when your dataset has many features or when those features overlap in the information they provide. With the right approach, PCA can make your models simpler, faster, and often more effective.

Share

Become a data expert. Subscribe to our newsletter.