Learn PySpark Joins Easily with This Guide

Learn PySpark Joins Easily
  • Author Avatar
    Written by:

    Nathan Rosidi

Master Inner, Left, and Complex Joins in PySpark with Real Interview Questions

PySpark joins aren’t all that different from what you’re used to for other languages like Python, R, or Java, but there are a few critical quirks you should watch out for.

In this article, we will explore these important concepts using real-world interview questions that range from easy to medium in difficulty.

By the end, we will explore the traps, best practices, and the details that might be missed. But first, let’s break things down and discover how these joins really work, what they are, and how they can be used in PySpark. Sounds good? So let’s start!

What is a PySpark Join?

Pyspark joins let you combine rows from two dataframes based on a common key. Several joins exist, like:

  • Inner
  • Left
  • Right
  • Outer
  • Cross
  • Semi

PySpark Joins Example: Left Join Explained

Each of these joins works for a different purpose. Now let's see PySpark Join in action.

In this, we will use the dataset from a hard-level question, which we’ll solve later in the article.


DataFrames: employee, bonus
Expected Output Type: pandas.Series


Link to this question: https://platform.stratascratch.com/coding/10081-find-the-number-of-employees-who-received-the-bonus-and-who-didnt

Preview the datasets

In this question, Microsoft and Dell want us to find the number of employees who received the bonus and those who didn’t. There are two datasets. Here is the first one.

Table: employee


And here is the second dataset.

Table: bonus


But what if we want to know who received the highest and lowest bonuses?

To figure out, we need to combine the bonus and employee tables. We use left join on;

  • Bonus.worker_ref_id = employee.id

Then sort by bonus_amount

from pyspark.sql import functions as F
bonus = bonus.withColumn("bonus_amount", F.col("bonus_amount").cast("int"))

joined = bonus.join(employee, bonus["worker_ref_id"] == employee["id"], how="left")

max_bonus = joined.orderBy(F.col("bonus_amount").desc()).select("first_name", "bonus_amount").limit(1)

# Lowest bonus
min_bonus = joined.orderBy("bonus_amount").select("first_name", "bonus_amount").limit(1)

# Show results
max_bonus.show()
min_bonus.show()


Here's the output.

PySpark Joins Example

You can see that Allen got the highest bonus and Joe got the lowest one.

Easy-Level PySpark Joins Question From Meta


DataFrames: facebook_reactions, facebook_posts
Expected Output Type: pandas.DataFrame

Link to this question: https://platform.stratascratch.com/coding/10087-find-all-posts-which-were-reacted-to-with-a-heart

In this question, Meta has asked us to find all posts that were reacted to with a heart.

Preview the datasets

As you can see, there are two different datasets.

  • facebook_reactions
  • facebook_posts

Now let’s see the first dataset, facebook_reactions.

Table: facebook_reactions


Here is the second dataset, facebook_posts.

Table: facebook_posts

Solving with PySpark Inner Join

To solve this question, we will perform a filter + inner join. First, we filter facebook_reactions where the reaction is “heart”. Then we join this result with facebook_posts using post_id as the key. Finally, we select all columns from facebook_posts. Here is the solution:

import pyspark.sql.functions as F

heart = facebook_reactions.filter(F.col('reaction') == 'heart').select('post_id')
result = heart.join(facebook_posts, on='post_id').dropDuplicates(['post_id'])
result.toPandas()

Here is the result.

PySpark Joins Example


Great, we’ve solved the easy question. Now, let’s continue with the medium question.

Medium-Level PySpark Joins Question From Meta


DataFrame: facebook_web_log
Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10352-users-by-avg-session-time

In this question, Meta asked us to calculate each user's average session time, and of course, there are a lot of different conditions we should meet.

Preview the dataset

Let’s see the dataset, facebook_web_log.

Table: facebook_web_log

Solution - Using Aliases in PySpark Joins for Complex Queries

We use the same dataset twice, once for “page_load” and once for “page_exit”. Each part is filtered and aliased to avoid column conflicts during the join operation. Next, we merge them by using “user_id”. That’s how we keep only records where load time comes before exit.Aliasing here is important. It lets us reference each version of data during grouping and filtering. Here is the code:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Filter for 'page_load' and 'page_exit' separately and join on 'user_id' with condition
df = facebook_web_log.filter(F.col('action') == 'page_load').select('user_id', 'timestamp').alias('load') \
    .join(facebook_web_log.filter(F.col('action') == 'page_exit').select('user_id', 'timestamp').alias('exit'),
          on='user_id', how='left') \
    .filter(F.col('load.timestamp') < F.col('exit.timestamp'))

# Instead of converting to date, truncate the timestamp to the start of the day for grouping
df = df.withColumn('date_load', F.date_trunc('day', F.col('load.timestamp')))

# Group by user_id and date_load, calculate max load and min exit timestamps per day
df = df.groupBy('user_id', 'date_load') \
    .agg(F.max('load.timestamp').alias('timestamp_load'), F.min('exit.timestamp').alias('timestamp_exit'))

# Calculate duration in seconds
df = df.withColumn('duration', F.col('timestamp_exit').cast('long') - F.col('timestamp_load').cast('long'))

# Filter out non-positive durations
df = df.filter(F.col('duration') > 0)

# Calculate the mean duration in seconds per user
df = df.groupBy('user_id') \
    .agg(F.mean('duration').alias('avg_session_duration'))

# Convert the DataFrame to Pandas for final output
result = df.toPandas()

result


Here is the expected output.

PySpark Joins Example

Hard-Level PySpark Joins Question From Dell


DataFrames: employee, bonus
Expected Output Type: pandas.Series

Link to this question: https://platform.stratascratch.com/coding/10081-find-the-number-of-employees-who-received-the-bonus-and-who-didnt

Dell asked us to find the number of employees who received the bonus and who did not. We have explored the datasets for this question at the beginning, so let’s continue to the solution at once.

Solving with PySpark Left Join and Null Handling

Let’s solve this by joining the employee table with the bonus table using employee.id = bonus.worker_ref_id.

Since the employees may appear multiple times in the bonus table, we use a left join to include everyone.

Next, we create a new column, has_bonus, using a simple null check: if bonus_data is null, then 0, otherwise 1. Finally, we group by has_bonus and count the distinct employee IDs in each group. Here is the solution:

import pyspark.sql.functions as F

merged_df = employee.join(bonus, employee['id'] == bonus['worker_ref_id'], how='left')
merged_df = merged_df.withColumn('has_bonus', F.when(F.col('bonus_date').isNull(), 0).otherwise(1))
result = merged_df.groupby('has_bonus').agg(F.countDistinct('id').alias('id_count'))
result.toPandas()


Here's the output.

PySpark Joins Example

Common PySpark Joins Mistakes and How to Avoid Them

Working with joins can be good and look easy until small mistakes sneak in. Here are a few common pitfalls that can throw off your results.

1. Forgetting to Handle Nulls

  • When using left joins, always check for nulls in the table to identify unmatched records.

2. Not Casting Data Types

  • If join keys have mismatched types, PySpark won't want you; it will just fail silently, so keep that in mind.

3. Duplicates

  • If you join one-to-many relationships, you might end up with row explosions. Use .dropDuplicates() when needed to avoid it.

4. Inner Join by default

  • Always specify how to join explicitly; do not rely on the default settings.

5. Alias Confusion

  • When self-joining or filtering on the same table twice, always alias each filtered set. Otherwise, column references may conflict.

Quick Tips for Writing Better PySpark Joins

  • Always preview the datasets before joining them.
  • After a join, count your rows to verify your output.
  • Use coalesce() to handle nulls when calculating or aggregating.
  • When in doubt, start with a small sample to test your operation

Final Thoughts on Mastering PySpark Joins

You’ve seen how different PySpark joins work. Each example came straight from real interview questions, so you get both theory and practice. We also covered common mistakes like null handling, duplicate rows, and mismatched data types. Use the tips to stay sharp, avoid bugs, and write joins that actually work.

Share

Become a data expert. Subscribe to our newsletter.