Module 1: DataFrame Fundamentals35 min

Introduction to Python & Pandas

Progress Tracking

Log in to save this lesson and continue from where you left off.

Log in

What This Path Is About

This is a hands-on path for learning data analysis with Python and pandas. By the end, you’ll be able to load datasets, filter and transform rows, merge tables, aggregate with groupby, and tackle the same analytical problems that come up in real jobs and technical interviews.

We’re going to move fast. Each lesson introduces a concept, shows you how it works, and then asks you to write code. If you’ve done some programming before, great. If you’re coming from spreadsheets or SQL, also great — pandas will feel familiar once you see how it maps to what you already know.

Let’s start with the tool itself.

Pandas in 60 Seconds

Pandas is a Python library for working with structured data — the kind that fits in rows and columns. It was created in 2008 by Wes McKinney, who was tired of switching between Python and R while working in finance. His goal was simple: make Python as good as R for data manipulation.

It worked. Today, pandas is the default tool for data wrangling in Python. If you open a Jupyter notebook at any tech company, you’ll see import pandas as pd at the top. Data scientists, analysts, ML engineers — everyone uses it.

The core idea is one data structure: the DataFrame. A DataFrame is a table. It has rows, columns, and an index. If you’ve used a spreadsheet, you already know what a DataFrame looks like. The difference is that instead of clicking around in cells, you write code to manipulate the data. That makes your work reproducible, shareable, and fast.

Setting Up

In most Python projects, the first line of any data analysis script is:

Python
import pandas as pd

The as pd alias is a universal convention. You’ll need this import whenever you call pandas functions directly — things like pd.read_csv(), pd.merge(), or pd.DataFrame(). The shorthand saves a lot of typing.

StrataScratch Environment

On StrataScratch, each dataset is preloaded as a DataFrame variable. In these first lessons, you won’t need any imports — you can work directly with the preloaded DataFrames. Later, when you need to call pandas functions like pd.merge() or pd.DataFrame(), you’ll add import pandas as pd at the top.

Loading Data in the Real World

In a real project, data doesn’t appear out of thin air. You load it from files. The most common format is CSV, and pandas makes it one line:

Python
import pandas as pd

# Load a CSV file into a DataFrame
df = pd.read_csv("employees.csv")

That’s it — pd.read_csv() reads the file and returns a DataFrame. Pandas also supports Excel files, JSON, SQL databases, and more:

Python
# Excel
df = pd.read_excel("data.xlsx")

# JSON
df = pd.read_json("data.json")

# From a SQL database
df = pd.read_sql("SELECT * FROM employees", connection)

In the exercises on StrataScratch, datasets are already loaded as DataFrames for you. But if you want to practice loading and exploring real data from scratch, check out StrataScratch’s Data Projects  — they give you a full Python and SQL environment.

Meet the Data

Let’s start with a dataset called techcorp_workforce — it contains employee data from a tech company. Take a look:

Table: techcorp_workforce
idfirst_namelast_namedepartmentsalaryphone_numberjoining_date
1SarahMitchellHR95000555-01012021-03-15
2MichaelChenHR88000555-01022022-06-01
3EmilyRodriguezHR825002021-09-20
4DavidParkHR80000555-01042023-01-10
5LisaThompsonHR650002021-04-05

This is a DataFrame with one row per employee. Each column holds a specific type of information: id is an integer, first_name is text, joining_date is a date, and salary is numeric. Pandas automatically tracks these types and uses them to determine which operations are valid on each column.

Selecting Columns

The most fundamental operation in pandas is pulling out the columns you care about. You do this with square brackets and a list of column names:

Python
techcorp_workforce[["first_name", "department"]]

The double brackets are important. The outer pair tells pandas “select from this DataFrame.” The inner pair creates a Python list of the column names you want. The result is a new DataFrame containing only those columns.

For a single column, you can use single brackets with just the column name as a string:

Python
techcorp_workforce["salary"]

This returns a Series — pandas’ one-dimensional data structure. Think of a Series as a single column pulled out of the table. Most of the time, you’ll work with DataFrames (multiple columns), but it’s good to know the difference exists.

Single vs. Double Brackets

Using single brackets like df['col'] returns a Series. Using double brackets like df[['col']] returns a DataFrame — even for one column. This distinction matters when chaining operations because some methods expect a DataFrame rather than a Series.

Selecting Multiple Columns

1
Select Two Columns

Select first name and last name from `techcorp_workforce`.

Tables: techcorp_workforce

The output is a two-column DataFrame. Only the columns you asked for come back — everything else is excluded.

Viewing the Full DataFrame

Sometimes you want everything. In pandas, just reference the DataFrame variable by name:

2
View All Columns

Display all columns and rows from `techcorp_workforce`.

Tables: techcorp_workforce

In practice, you rarely want to dump an entire DataFrame to the screen. Production datasets can have millions of rows. Use head() to peek at just the first few rows:

Python
# First 5 rows (default)
techcorp_workforce.head()

# First 10 rows
techcorp_workforce.head(10)

# Last 5 rows
techcorp_workforce.tail()

head() and tail() are the pandas equivalent of quickly scrolling to the top or bottom of a spreadsheet. Use them constantly.

Start Every Exploration with head()

When you load a new dataset, run df.head() before anything else. It shows you the column names, what the data looks like, and whether anything is obviously wrong — all in one glance.

Inspecting a DataFrame

Before you start analyzing, you need to know what you’re working with. Every DataFrame comes with a handful of attributes and methods that answer the key questions:

Python
# How many rows and columns?
techcorp_workforce.shape

# What are the column names?
techcorp_workforce.columns

# What data type is each column?
techcorp_workforce.dtypes

# Get a full summary: columns, types, non-null counts
techcorp_workforce.info()

shape returns a tuple like (50, 8) — 50 rows, 8 columns. columns gives you the column names as an Index object. dtypes tells you whether each column holds integers, floats, strings (object in pandas), or dates (datetime64). info() combines all of this into one summary: column names, types, and how many non-null values each column has.

The object Dtype

When you see object as a column’s dtype, it almost always means text (strings). Pandas uses this label because Python strings are generic objects under the hood. Don’t let the name confuse you.

For numeric columns, describe() gives you a quick statistical summary:

Python
techcorp_workforce.describe()

This returns count, mean, standard deviation, min, max, and percentiles. It’s a fast way to spot outliers or sanity-check the data before doing anything else.

describe() for Text Columns

By default, describe() only summarizes numeric columns. To include text columns, use df.describe(include='all'). This adds count, unique, top (most frequent value), and frequency.

DataFrame Shape and Info

Before you start analyzing, you need to know what you're working with. .info() gives you the full picture in one call: column names, data types, and how many non-null values each column has.

Python
techcorp_workforce.info()

Writing Comments

Python uses # for comments. Everything after the # on that line is ignored:

Python
# Pull employee names for the quarterly report
techcorp_workforce[["first_name", "last_name"]]

You can also place comments at the end of a line of code:

Python
techcorp_workforce.shape  # returns (rows, columns)

For temporarily disabling code, comment it out:

Python
techcorp_workforce[["first_name", "last_name"]]

# techcorp_workforce[["first_name", "last_name", "salary"]]
When to Comment

Don’t comment every line — that’s noise. Comment on the why, not the what. If a line of code does something non-obvious or makes a business-logic decision, leave a note. If it’s self-explanatory, skip the comment.

Working with a New Dataset

Let’s switch to a completely different dataset — health inspection records from Los Angeles restaurants:

Table: los_angeles_restaurant_health_inspections
serial_numberactivity_datefacility_namescoregradeservice_codeservice_descriptionemployee_idfacility_addressfacility_cityfacility_idfacility_statefacility_zipowner_idowner_namepe_descriptionprogram_element_peprogram_nameprogram_statusrecord_id
DAQHRSETQ2017-06-08MARGARITAS CAFE93A1ROUTINE INSPECTIONEE00000065026 S CRENSHAW BLVDLOS ANGELESFA0023656CA90043OW0004133BAZAN, ASCENCIONRESTAURANT (61-150) SEATS HIGH RISK1638MARGARITAS CAFEACTIVEPR0011718
DA2GQRJOS2017-03-07LAS MOLENDERAS97A1ROUTINE INSPECTIONEE00009972635 WHITTIER BLVDLOS ANGELESFA0160416CA90023OW0125379MARISOL FEREGRINORESTAURANT (0-30) SEATS HIGH RISK1632LAS MOLENDERASINACTIVEPR0148504
DAMQTA46T2016-03-22SANDRA'S TAMALES93A1ROUTINE INSPECTIONEE00010495390 WHITTIER BLVDLOS ANGELESFA0171769CA90022-4032OW0178828SANDRA'S TAMALES INC.RESTAURANT (0-30) SEATS MODERATE RISK1631SANDRA'S TAMALESACTIVEPR0164225
DAXMBTIRZ2018-02-12CAFE GRATITUDE97A1ROUTINE INSPECTIONEE0000828639 N LARCHMONT BLVD STE #102LOS ANGELESFA0058921CA90004OW0005704CAFE GRATITUDE LARCHMONT LLCRESTAURANT (61-150) SEATS HIGH RISK1638CAFE GRATITUDEACTIVEPR0019854
DAK8TBMS02015-09-10THE WAFFLE90A1ROUTINE INSPECTIONEE00007096255 W SUNSET BLVD STE #105LOS ANGELESFA0051830CA90028OW0035796THE WAFFLE, LLCRESTAURANT (61-150) SEATS HIGH RISK1638THE WAFFLEACTIVEPR0010922

This table has more columns and messier real-world data. Try selecting specific columns from it.

3
Select Inspection Details

Select the facility name, score, and grade from `los_angeles_restaurant_health_inspections`.

Tables: los_angeles_restaurant_health_inspections

Checking Dimensions

4
Inspect the Table Shape

Find out how many rows and columns `los_angeles_restaurant_health_inspections` has.

Tables: los_angeles_restaurant_health_inspections

Key Takeaways

  • Pandas is Python’s standard library for structured data analysis. The core data structure is the DataFrame — a table with typed columns.
  • You’ll need import pandas as pd once you start calling pandas functions directly (like pd.merge() or pd.read_csv()). In these early lessons, everything works on preloaded DataFrames without an import.
  • Select columns with df[['col1', 'col2']] for a DataFrame result, or df['col'] for a single Series.
  • Use head(), shape, info(), dtypes, and describe() to understand a dataset before analyzing it.
  • Comment on the why, not the what. Use # for single-line comments.

What’s Next

Now that you can inspect and select columns from a DataFrame, the next step is filtering rows. We’ll cover boolean indexing — the pandas way of saying “give me only the rows where this condition is true.” That’s when pandas start getting genuinely powerful.