Introduction to Python & Pandas
Progress Tracking
Log in to save this lesson and continue from where you left off.
What This Path Is About
This is a hands-on path for learning data analysis with Python and pandas. By the end, you’ll be able to load datasets, filter and transform rows, merge tables, aggregate with groupby, and tackle the same analytical problems that come up in real jobs and technical interviews.
We’re going to move fast. Each lesson introduces a concept, shows you how it works, and then asks you to write code. If you’ve done some programming before, great. If you’re coming from spreadsheets or SQL, also great — pandas will feel familiar once you see how it maps to what you already know.
Let’s start with the tool itself.
Pandas in 60 Seconds
Pandas is a Python library for working with structured data — the kind that fits in rows and columns. It was created in 2008 by Wes McKinney, who was tired of switching between Python and R while working in finance. His goal was simple: make Python as good as R for data manipulation.
It worked. Today, pandas is the default tool for data wrangling in Python. If you open a Jupyter notebook at any tech company, you’ll see import pandas as pd at the top. Data scientists, analysts, ML engineers — everyone uses it.
The core idea is one data structure: the DataFrame. A DataFrame is a table. It has rows, columns, and an index. If you’ve used a spreadsheet, you already know what a DataFrame looks like. The difference is that instead of clicking around in cells, you write code to manipulate the data. That makes your work reproducible, shareable, and fast.
Setting Up
In most Python projects, the first line of any data analysis script is:
import pandas as pdThe as pd alias is a universal convention. You’ll need this import whenever you call pandas functions directly — things like pd.read_csv(), pd.merge(), or pd.DataFrame(). The shorthand saves a lot of typing.
On StrataScratch, each dataset is preloaded as a DataFrame variable. In these first lessons, you won’t need any imports — you can work directly with the preloaded DataFrames. Later, when you need to call pandas functions like pd.merge() or pd.DataFrame(), you’ll add import pandas as pd at the top.
Loading Data in the Real World
In a real project, data doesn’t appear out of thin air. You load it from files. The most common format is CSV, and pandas makes it one line:
import pandas as pd
# Load a CSV file into a DataFrame
df = pd.read_csv("employees.csv")That’s it — pd.read_csv() reads the file and returns a DataFrame. Pandas also supports Excel files, JSON, SQL databases, and more:
# Excel
df = pd.read_excel("data.xlsx")
# JSON
df = pd.read_json("data.json")
# From a SQL database
df = pd.read_sql("SELECT * FROM employees", connection)In the exercises on StrataScratch, datasets are already loaded as DataFrames for you. But if you want to practice loading and exploring real data from scratch, check out StrataScratch’s Data Projects — they give you a full Python and SQL environment.
Meet the Data
Let’s start with a dataset called techcorp_workforce — it contains employee data from a tech company. Take a look:
| id | first_name | last_name | department | salary | phone_number | joining_date |
|---|---|---|---|---|---|---|
| 1 | Sarah | Mitchell | HR | 95000 | 555-0101 | 2021-03-15 |
| 2 | Michael | Chen | HR | 88000 | 555-0102 | 2022-06-01 |
| 3 | Emily | Rodriguez | HR | 82500 | 2021-09-20 | |
| 4 | David | Park | HR | 80000 | 555-0104 | 2023-01-10 |
| 5 | Lisa | Thompson | HR | 65000 | 2021-04-05 |
This is a DataFrame with one row per employee. Each column holds a specific type of information: id is an integer, first_name is text, joining_date is a date, and salary is numeric. Pandas automatically tracks these types and uses them to determine which operations are valid on each column.
Selecting Columns
The most fundamental operation in pandas is pulling out the columns you care about. You do this with square brackets and a list of column names:
techcorp_workforce[["first_name", "department"]]The double brackets are important. The outer pair tells pandas “select from this DataFrame.” The inner pair creates a Python list of the column names you want. The result is a new DataFrame containing only those columns.
For a single column, you can use single brackets with just the column name as a string:
techcorp_workforce["salary"]This returns a Series — pandas’ one-dimensional data structure. Think of a Series as a single column pulled out of the table. Most of the time, you’ll work with DataFrames (multiple columns), but it’s good to know the difference exists.
Using single brackets like df['col'] returns a Series. Using double brackets like df[['col']] returns a DataFrame — even for one column. This distinction matters when chaining operations because some methods expect a DataFrame rather than a Series.
Selecting Multiple Columns
Select first name and last name from `techcorp_workforce`.
The output is a two-column DataFrame. Only the columns you asked for come back — everything else is excluded.
Viewing the Full DataFrame
Sometimes you want everything. In pandas, just reference the DataFrame variable by name:
Display all columns and rows from `techcorp_workforce`.
In practice, you rarely want to dump an entire DataFrame to the screen. Production datasets can have millions of rows. Use head() to peek at just the first few rows:
# First 5 rows (default)
techcorp_workforce.head()
# First 10 rows
techcorp_workforce.head(10)
# Last 5 rows
techcorp_workforce.tail()head() and tail() are the pandas equivalent of quickly scrolling to the top or bottom of a spreadsheet. Use them constantly.
When you load a new dataset, run df.head() before anything else. It shows you the column names, what the data looks like, and whether anything is obviously wrong — all in one glance.
Inspecting a DataFrame
Before you start analyzing, you need to know what you’re working with. Every DataFrame comes with a handful of attributes and methods that answer the key questions:
# How many rows and columns?
techcorp_workforce.shape
# What are the column names?
techcorp_workforce.columns
# What data type is each column?
techcorp_workforce.dtypes
# Get a full summary: columns, types, non-null counts
techcorp_workforce.info()shape returns a tuple like (50, 8) — 50 rows, 8 columns. columns gives you the column names as an Index object. dtypes tells you whether each column holds integers, floats, strings (object in pandas), or dates (datetime64). info() combines all of this into one summary: column names, types, and how many non-null values each column has.
When you see object as a column’s dtype, it almost always means text (strings). Pandas uses this label because Python strings are generic objects under the hood. Don’t let the name confuse you.
For numeric columns, describe() gives you a quick statistical summary:
techcorp_workforce.describe()This returns count, mean, standard deviation, min, max, and percentiles. It’s a fast way to spot outliers or sanity-check the data before doing anything else.
By default, describe() only summarizes numeric columns. To include text columns, use df.describe(include='all'). This adds count, unique, top (most frequent value), and frequency.
DataFrame Shape and Info
Before you start analyzing, you need to know what you're working with. .info() gives you the full picture in one call: column names, data types, and how many non-null values each column has.
techcorp_workforce.info()Writing Comments
Python uses # for comments. Everything after the # on that line is ignored:
# Pull employee names for the quarterly report
techcorp_workforce[["first_name", "last_name"]]You can also place comments at the end of a line of code:
techcorp_workforce.shape # returns (rows, columns)For temporarily disabling code, comment it out:
techcorp_workforce[["first_name", "last_name"]]
# techcorp_workforce[["first_name", "last_name", "salary"]]Don’t comment every line — that’s noise. Comment on the why, not the what. If a line of code does something non-obvious or makes a business-logic decision, leave a note. If it’s self-explanatory, skip the comment.
Working with a New Dataset
Let’s switch to a completely different dataset — health inspection records from Los Angeles restaurants:
| serial_number | activity_date | facility_name | score | grade | service_code | service_description | employee_id | facility_address | facility_city | facility_id | facility_state | facility_zip | owner_id | owner_name | pe_description | program_element_pe | program_name | program_status | record_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DAQHRSETQ | 2017-06-08 | MARGARITAS CAFE | 93 | A | 1 | ROUTINE INSPECTION | EE0000006 | 5026 S CRENSHAW BLVD | LOS ANGELES | FA0023656 | CA | 90043 | OW0004133 | BAZAN, ASCENCION | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | MARGARITAS CAFE | ACTIVE | PR0011718 |
| DA2GQRJOS | 2017-03-07 | LAS MOLENDERAS | 97 | A | 1 | ROUTINE INSPECTION | EE0000997 | 2635 WHITTIER BLVD | LOS ANGELES | FA0160416 | CA | 90023 | OW0125379 | MARISOL FEREGRINO | RESTAURANT (0-30) SEATS HIGH RISK | 1632 | LAS MOLENDERAS | INACTIVE | PR0148504 |
| DAMQTA46T | 2016-03-22 | SANDRA'S TAMALES | 93 | A | 1 | ROUTINE INSPECTION | EE0001049 | 5390 WHITTIER BLVD | LOS ANGELES | FA0171769 | CA | 90022-4032 | OW0178828 | SANDRA'S TAMALES INC. | RESTAURANT (0-30) SEATS MODERATE RISK | 1631 | SANDRA'S TAMALES | ACTIVE | PR0164225 |
| DAXMBTIRZ | 2018-02-12 | CAFE GRATITUDE | 97 | A | 1 | ROUTINE INSPECTION | EE0000828 | 639 N LARCHMONT BLVD STE #102 | LOS ANGELES | FA0058921 | CA | 90004 | OW0005704 | CAFE GRATITUDE LARCHMONT LLC | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | CAFE GRATITUDE | ACTIVE | PR0019854 |
| DAK8TBMS0 | 2015-09-10 | THE WAFFLE | 90 | A | 1 | ROUTINE INSPECTION | EE0000709 | 6255 W SUNSET BLVD STE #105 | LOS ANGELES | FA0051830 | CA | 90028 | OW0035796 | THE WAFFLE, LLC | RESTAURANT (61-150) SEATS HIGH RISK | 1638 | THE WAFFLE | ACTIVE | PR0010922 |
This table has more columns and messier real-world data. Try selecting specific columns from it.
Select the facility name, score, and grade from `los_angeles_restaurant_health_inspections`.
Checking Dimensions
Find out how many rows and columns `los_angeles_restaurant_health_inspections` has.
Key Takeaways
- Pandas is Python’s standard library for structured data analysis. The core data structure is the DataFrame — a table with typed columns.
- You’ll need
import pandas as pdonce you start calling pandas functions directly (likepd.merge()orpd.read_csv()). In these early lessons, everything works on preloaded DataFrames without an import. - Select columns with
df[['col1', 'col2']]for a DataFrame result, ordf['col']for a single Series. - Use
head(),shape,info(),dtypes, anddescribe()to understand a dataset before analyzing it. - Comment on the why, not the what. Use
#for single-line comments.
What’s Next
Now that you can inspect and select columns from a DataFrame, the next step is filtering rows. We’ll cover boolean indexing — the pandas way of saying “give me only the rows where this condition is true.” That’s when pandas start getting genuinely powerful.