Learn PySpark Joins Easily with This Guide

Published: July 14, 2025

Categories:

Written by:
Nathan Rosidi

Master Inner, Left, and Complex Joins in PySpark with Real Interview Questions

PySpark joins aren’t all that different from what you’re used to for other languages like Python, R, or Java, but there are a few critical quirks you should watch out for.

In this article, we will explore these important concepts using real-world interview questions that range from easy to medium in difficulty.

By the end, we will explore the traps, best practices, and the details that might be missed. But first, let’s break things down and discover how these joins really work, what they are, and how they can be used in PySpark. Sounds good? So let’s start!

What is a PySpark Join?

Pyspark joins let you combine rows from two dataframes based on a common key. Several joins exist, like:

Inner
Left
Right
Outer
Cross
Semi

PySpark Joins Example: Left Join Explained

Each of these joins works for a different purpose. Now let's see PySpark Join in action.

In this, we will use the dataset from a hard-level question, which we’ll solve later in the article.

Find the number of employees who received the bonus and who didn't

Dell

Microsoft

HardID 10081

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the number of employees who received the bonus and who didn't. Bonus values in employee table are corrupted so you should use values from the bonus table. Be aware of the fact that employee can receive more than one bonus. Output value inside has_bonus column (1 if they had bonus, 0 if not) along with the corresponding number of employees for each.

DataFrames: employee, bonus

Expected Output Type: pandas.Series

Link to this question: https://platform.stratascratch.com/coding/10081-find-the-number-of-employees-who-received-the-bonus-and-who-didnt

Preview the datasets

In this question, Microsoft and Dell want us to find the number of employees who received the bonus and those who didn’t. There are two datasets. Here is the first one.

Table: employee

And here is the second dataset.

Table: bonus

But what if we want to know who received the highest and lowest bonuses?

To figure out, we need to combine the bonus and employee tables. We use left join on;

Bonus.worker_ref_id = employee.id

Then sort by bonus_amount

from pyspark.sql import functions as F
bonus = bonus.withColumn("bonus_amount", F.col("bonus_amount").cast("int"))

joined = bonus.join(employee, bonus["worker_ref_id"] == employee["id"], how="left")

max_bonus = joined.orderBy(F.col("bonus_amount").desc()).select("first_name", "bonus_amount").limit(1)

# Lowest bonus
min_bonus = joined.orderBy("bonus_amount").select("first_name", "bonus_amount").limit(1)

# Show results
max_bonus.show()
min_bonus.show()

Here's the output.

You can see that Allen got the highest bonus and Joe got the lowest one.

Easy-Level PySpark Joins Question From Meta

Find all posts which were reacted to with a heart

Preview the datasets

As you can see, there are two different datasets.

facebook_reactions
facebook_posts

Now let’s see the first dataset, facebook_reactions.

Table: facebook_reactions

Here is the second dataset, facebook_posts.

Table: facebook_posts

Solving with PySpark Inner Join

To solve this question, we will perform a filter + inner join. First, we filter facebook_reactions where the reaction is “heart”. Then we join this result with facebook_posts using post_id as the key. Finally, we select all columns from facebook_posts. Here is the solution:

import pyspark.sql.functions as F

heart = facebook_reactions.filter(F.col('reaction') == 'heart').select('post_id')
result = heart.join(facebook_posts, on='post_id').dropDuplicates(['post_id'])
result.toPandas()

Here is the result.

Great, we’ve solved the easy question. Now, let’s continue with the medium question.

Medium-Level PySpark Joins Question From Meta

Users By Average Session Time

Last Updated: July 2021

Preview the dataset

Let’s see the dataset, facebook_web_log.

Table: facebook_web_log

Solution - Using Aliases in PySpark Joins for Complex Queries

We use the same dataset twice, once for “page_load” and once for “page_exit”. Each part is filtered and aliased to avoid column conflicts during the join operation. Next, we merge them by using “user_id”. That’s how we keep only records where load time comes before exit.Aliasing here is important. It lets us reference each version of data during grouping and filtering. Here is the code:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Filter for 'page_load' and 'page_exit' separately and join on 'user_id' with condition
df = facebook_web_log.filter(F.col('action') == 'page_load').select('user_id', 'timestamp').alias('load') \
    .join(facebook_web_log.filter(F.col('action') == 'page_exit').select('user_id', 'timestamp').alias('exit'),
          on='user_id', how='left') \
    .filter(F.col('load.timestamp') < F.col('exit.timestamp'))

# Instead of converting to date, truncate the timestamp to the start of the day for grouping
df = df.withColumn('date_load', F.date_trunc('day', F.col('load.timestamp')))

# Group by user_id and date_load, calculate max load and min exit timestamps per day
df = df.groupBy('user_id', 'date_load') \
    .agg(F.max('load.timestamp').alias('timestamp_load'), F.min('exit.timestamp').alias('timestamp_exit'))

# Calculate duration in seconds
df = df.withColumn('duration', F.col('timestamp_exit').cast('long') - F.col('timestamp_load').cast('long'))

# Filter out non-positive durations
df = df.filter(F.col('duration') > 0)

# Calculate the mean duration in seconds per user
df = df.groupBy('user_id') \
    .agg(F.mean('duration').alias('avg_session_duration'))

# Convert the DataFrame to Pandas for final output
result = df.toPandas()

result

Here is the expected output.

Hard-Level PySpark Joins Question From Dell

Find the number of employees who received the bonus and who didn't

Dell

Microsoft

HardID 10081

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

DataFrames: employee, bonus

Expected Output Type: pandas.Series

Link to this question: https://platform.stratascratch.com/coding/10081-find-the-number-of-employees-who-received-the-bonus-and-who-didnt

Dell asked us to find the number of employees who received the bonus and who did not. We have explored the datasets for this question at the beginning, so let’s continue to the solution at once.

Solving with PySpark Left Join and Null Handling

Let’s solve this by joining the employee table with the bonus table using employee.id = bonus.worker_ref_id.

Since the employees may appear multiple times in the bonus table, we use a left join to include everyone.

Next, we create a new column, has_bonus, using a simple null check: if bonus_data is null, then 0, otherwise 1. Finally, we group by has_bonus and count the distinct employee IDs in each group. Here is the solution:

import pyspark.sql.functions as F

merged_df = employee.join(bonus, employee['id'] == bonus['worker_ref_id'], how='left')
merged_df = merged_df.withColumn('has_bonus', F.when(F.col('bonus_date').isNull(), 0).otherwise(1))
result = merged_df.groupby('has_bonus').agg(F.countDistinct('id').alias('id_count'))
result.toPandas()

Here's the output.

Common PySpark Joins Mistakes and How to Avoid Them

Working with joins can be good and look easy until small mistakes sneak in. Here are a few common pitfalls that can throw off your results.

1. Forgetting to Handle Nulls

When using left joins, always check for nulls in the table to identify unmatched records.

2. Not Casting Data Types

If join keys have mismatched types, PySpark won't want you; it will just fail silently, so keep that in mind.

3. Duplicates

If you join one-to-many relationships, you might end up with row explosions. Use .dropDuplicates() when needed to avoid it.

4. Inner Join by default

Always specify how to join explicitly; do not rely on the default settings.

5. Alias Confusion

When self-joining or filtering on the same table twice, always alias each filtered set. Otherwise, column references may conflict.

Quick Tips for Writing Better PySpark Joins

Always preview the datasets before joining them.
After a join, count your rows to verify your output.
Use coalesce() to handle nulls when calculating or aggregating.
When in doubt, start with a small sample to test your operation

Final Thoughts on Mastering PySpark Joins

You’ve seen how different PySpark joins work. Each example came straight from real interview questions, so you get both theory and practice. We also covered common mistakes like null handling, duplicate rows, and mismatched data types. Use the tips to stay sharp, avoid bugs, and write joins that actually work.

Learn PySpark Joins Easily with This Guide

What is a PySpark Join?

PySpark Joins Example: Left Join Explained

Find the number of employees who received the bonus and who didn't

Preview the datasets

Easy-Level PySpark Joins Question From Meta

Find all posts which were reacted to with a heart

Preview the datasets

Solving with PySpark Inner Join

Medium-Level PySpark Joins Question From Meta

Users By Average Session Time

Preview the dataset

Solution - Using Aliases in PySpark Joins for Complex Queries

Hard-Level PySpark Joins Question From Dell

Find the number of employees who received the bonus and who didn't

Solving with PySpark Left Join and Null Handling

Common PySpark Joins Mistakes and How to Avoid Them

Quick Tips for Writing Better PySpark Joins

Final Thoughts on Mastering PySpark Joins

Latest Posts:

Learn PySpark Joins Easily with This Guide

Unsupervised Clustering: Methods, Examples, and When to Use

A Guide to Master Machine Learning Modeling from Scratch