Module 1: DataFrame Fundamentals•35 min

Introduction to Python & Pandas

Progress Tracking

What This Path Is About

This is a hands-on path for learning data analysis with Python and pandas. By the end, you’ll be able to load datasets, filter and transform rows, merge tables, aggregate with groupby, and tackle the same analytical problems that come up in real jobs and technical interviews.

We’re going to move fast. Each lesson introduces a concept, shows you how it works, and then asks you to write code. If you’ve done some programming before, great. If you’re coming from spreadsheets or SQL, also great — pandas will feel familiar once you see how it maps to what you already know.

Let’s start with the tool itself.

Pandas in 60 Seconds

Pandas is a Python library for working with structured data — the kind that fits in rows and columns. It was created in 2008 by Wes McKinney, who was tired of switching between Python and R while working in finance. His goal was simple: make Python as good as R for data manipulation.

It worked. Today, pandas is the default tool for data wrangling in Python. If you open a Jupyter notebook at any tech company, you’ll see import pandas as pd at the top. Data scientists, analysts, ML engineers — everyone uses it.

The core idea is one data structure: the DataFrame. A DataFrame is a table. It has rows, columns, and an index. If you’ve used a spreadsheet, you already know what a DataFrame looks like. The difference is that instead of clicking around in cells, you write code to manipulate the data. That makes your work reproducible, shareable, and fast.

Setting Up

In most Python projects, the first line of any data analysis script is:

Python

import pandas as pd

The as pd alias is a universal convention. You’ll need this import whenever you call pandas functions directly — things like pd.read_csv(), pd.merge(), or pd.DataFrame(). The shorthand saves a lot of typing.

StrataScratch Environment

On StrataScratch, each dataset is preloaded as a DataFrame variable. In these first lessons, you won’t need any imports — you can work directly with the preloaded DataFrames. Later, when you need to call pandas functions like pd.merge() or pd.DataFrame(), you’ll add import pandas as pd at the top.

Loading Data in the Real World

In a real project, data doesn’t appear out of thin air. You load it from files. The most common format is CSV, and pandas makes it one line:

Python

import pandas as pd

# Load a CSV file into a DataFrame
df = pd.read_csv("employees.csv")

That’s it — pd.read_csv() reads the file and returns a DataFrame. Pandas also supports Excel files, JSON, SQL databases, and more:

Python

# Excel
df = pd.read_excel("data.xlsx")

# JSON
df = pd.read_json("data.json")

# From a SQL database
df = pd.read_sql("SELECT * FROM employees", connection)

In the exercises on StrataScratch, datasets are already loaded as DataFrames for you. But if you want to practice loading and exploring real data from scratch, check out StrataScratch’s Data Projects — they give you a full Python and SQL environment.

Meet the Data

Let’s start with a dataset called techcorp_workforce — it contains employee data from a tech company. Take a look:

Table: techcorp_workforce

id	first_name	last_name	department	salary	phone_number	joining_date
1	Sarah	Mitchell	HR	95000	555-0101	2021-03-15
2	Michael	Chen	HR	88000	555-0102	2022-06-01
3	Emily	Rodriguez	HR	82500		2021-09-20
4	David	Park	HR	80000	555-0104	2023-01-10
5	Lisa	Thompson	HR	65000		2021-04-05

This is a DataFrame with one row per employee. Each column holds a specific type of information: id is an integer, first_name is text, joining_date is a date, and salary is numeric. Pandas automatically tracks these types and uses them to determine which operations are valid on each column.

Selecting Columns

The most fundamental operation in pandas is pulling out the columns you care about. You do this with square brackets and a list of column names:

Python

techcorp_workforce[["first_name", "department"]]

The double brackets are important. The outer pair tells pandas “select from this DataFrame.” The inner pair creates a Python list of the column names you want. The result is a new DataFrame containing only those columns.

For a single column, you can use single brackets with just the column name as a string:

Python

techcorp_workforce["salary"]

This returns a Series — pandas’ one-dimensional data structure. Think of a Series as a single column pulled out of the table. Most of the time, you’ll work with DataFrames (multiple columns), but it’s good to know the difference exists.

Single vs. Double Brackets

Using single brackets like df['col'] returns a Series. Using double brackets like df[['col']] returns a DataFrame — even for one column. This distinction matters when chaining operations because some methods expect a DataFrame rather than a Series.

Selecting Multiple Columns

Tables: techcorp_workforce

The output is a two-column DataFrame. Only the columns you asked for come back — everything else is excluded.

Viewing the Full DataFrame

Sometimes you want everything. In pandas, just reference the DataFrame variable by name:

Tables: techcorp_workforce

In practice, you rarely want to dump an entire DataFrame to the screen. Production datasets can have millions of rows. Use head() to peek at just the first few rows:

Python

# First 5 rows (default)
techcorp_workforce.head()

# First 10 rows
techcorp_workforce.head(10)

# Last 5 rows
techcorp_workforce.tail()

head() and tail() are the pandas equivalent of quickly scrolling to the top or bottom of a spreadsheet. Use them constantly.

Start Every Exploration with head()

When you load a new dataset, run df.head() before anything else. It shows you the column names, what the data looks like, and whether anything is obviously wrong — all in one glance.

Inspecting a DataFrame

Before you start analyzing, you need to know what you’re working with. Every DataFrame comes with a handful of attributes and methods that answer the key questions:

Python

# How many rows and columns?
techcorp_workforce.shape

# What are the column names?
techcorp_workforce.columns

# What data type is each column?
techcorp_workforce.dtypes

# Get a full summary: columns, types, non-null counts
techcorp_workforce.info()

shape returns a tuple like (50, 8) — 50 rows, 8 columns. columns gives you the column names as an Index object. dtypes tells you whether each column holds integers, floats, strings (object in pandas), or dates (datetime64). info() combines all of this into one summary: column names, types, and how many non-null values each column has.

The object Dtype

When you see object as a column’s dtype, it almost always means text (strings). Pandas uses this label because Python strings are generic objects under the hood. Don’t let the name confuse you.

For numeric columns, describe() gives you a quick statistical summary:

Python

techcorp_workforce.describe()

This returns count, mean, standard deviation, min, max, and percentiles. It’s a fast way to spot outliers or sanity-check the data before doing anything else.

describe() for Text Columns

By default, describe() only summarizes numeric columns. To include text columns, use df.describe(include='all'). This adds count, unique, top (most frequent value), and frequency.

DataFrame Shape and Info

Before you start analyzing, you need to know what you're working with. .info() gives you the full picture in one call: column names, data types, and how many non-null values each column has.

Python

techcorp_workforce.info()

Writing Comments

Python uses # for comments. Everything after the # on that line is ignored:

Python

# Pull employee names for the quarterly report
techcorp_workforce[["first_name", "last_name"]]

You can also place comments at the end of a line of code:

Python

techcorp_workforce.shape  # returns (rows, columns)

For temporarily disabling code, comment it out:

Python

techcorp_workforce[["first_name", "last_name"]]

# techcorp_workforce[["first_name", "last_name", "salary"]]

When to Comment

Don’t comment every line — that’s noise. Comment on the why, not the what. If a line of code does something non-obvious or makes a business-logic decision, leave a note. If it’s self-explanatory, skip the comment.

Working with a New Dataset

Let’s switch to a completely different dataset — health inspection records from Los Angeles restaurants:

Table: los_angeles_restaurant_health_inspections

serial_number	activity_date	facility_name	score	grade	service_code	service_description	employee_id	facility_address	facility_city	facility_id	facility_state	facility_zip	owner_id	owner_name	pe_description	program_element_pe	program_name	program_status	record_id
DAQHRSETQ	2017-06-08	MARGARITAS CAFE	93	A	1	ROUTINE INSPECTION	EE0000006	5026 S CRENSHAW BLVD	LOS ANGELES	FA0023656	CA	90043	OW0004133	BAZAN, ASCENCION	RESTAURANT (61-150) SEATS HIGH RISK	1638	MARGARITAS CAFE	ACTIVE	PR0011718
DA2GQRJOS	2017-03-07	LAS MOLENDERAS	97	A	1	ROUTINE INSPECTION	EE0000997	2635 WHITTIER BLVD	LOS ANGELES	FA0160416	CA	90023	OW0125379	MARISOL FEREGRINO	RESTAURANT (0-30) SEATS HIGH RISK	1632	LAS MOLENDERAS	INACTIVE	PR0148504
DAMQTA46T	2016-03-22	SANDRA'S TAMALES	93	A	1	ROUTINE INSPECTION	EE0001049	5390 WHITTIER BLVD	LOS ANGELES	FA0171769	CA	90022-4032	OW0178828	SANDRA'S TAMALES INC.	RESTAURANT (0-30) SEATS MODERATE RISK	1631	SANDRA'S TAMALES	ACTIVE	PR0164225
DAXMBTIRZ	2018-02-12	CAFE GRATITUDE	97	A	1	ROUTINE INSPECTION	EE0000828	639 N LARCHMONT BLVD STE #102	LOS ANGELES	FA0058921	CA	90004	OW0005704	CAFE GRATITUDE LARCHMONT LLC	RESTAURANT (61-150) SEATS HIGH RISK	1638	CAFE GRATITUDE	ACTIVE	PR0019854
DAK8TBMS0	2015-09-10	THE WAFFLE	90	A	1	ROUTINE INSPECTION	EE0000709	6255 W SUNSET BLVD STE #105	LOS ANGELES	FA0051830	CA	90028	OW0035796	THE WAFFLE, LLC	RESTAURANT (61-150) SEATS HIGH RISK	1638	THE WAFFLE	ACTIVE	PR0010922

This table has more columns and messier real-world data. Try selecting specific columns from it.

Tables: los_angeles_restaurant_health_inspections

Checking Dimensions

Tables: los_angeles_restaurant_health_inspections

Key Takeaways

Pandas is Python’s standard library for structured data analysis. The core data structure is the DataFrame — a table with typed columns.
You’ll need import pandas as pd once you start calling pandas functions directly (like pd.merge() or pd.read_csv()). In these early lessons, everything works on preloaded DataFrames without an import.
Select columns with df[['col1', 'col2']] for a DataFrame result, or df['col'] for a single Series.
Use head(), shape, info(), dtypes, and describe() to understand a dataset before analyzing it.
Comment on the why, not the what. Use # for single-line comments.

What’s Next

Now that you can inspect and select columns from a DataFrame, the next step is filtering rows. We’ll cover boolean indexing — the pandas way of saying “give me only the rows where this condition is true.” That’s when pandas start getting genuinely powerful.

Next upContinue →

Working with Columns

35 min