PySpark Interview Questions for Data Science Excellence

PySpark Interview Questions
  • Author Avatar
    Written by:

    Nathan Rosidi

Navigating Through Essential PySpark Interview Questions from Basics to Advanced for Aspiring Data Scientists.

What sets apart a good data scientist from a great one? It's not just about knowing different tools and techniques; it's about understanding how and when to use them.

PySpark, integrating Python and Apache Spark, stands as a crucial tool in modern data science. Its importance in processing large datasets and enabling distributed computing is undeniable.

In this article, I will introduce you to a range of PySpark interview questions, from basic to advanced levels, and you will see how and when to use them in real life, by testing yourselves with interview questions. Buckle up and let’s get started!

Basic PySpark Interview Questions

Basic PySpark Interview Questions

Starting with the basics is importantl in building a strong foundation. This section on Basic PySpark Interview Questions is designed for beginners or those new to PySpark.

PySpark Interview Questions #1: Find out search details for apartments designed for a sole-person stay

This question focuses on extracting details of searches made by people looking for apartments suitable for a single person.

EasyID 9615

Find the search details for apartments where the property type is Apartment and the accommodation is suitable for one person.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9615-find-out-search-details-for-apartments-designed-for-a-sole-person-stay?code_type=6

First,we need to find out which searches were for single-person apartments. Let’s break down the solution.

  • Start with the dataset 'airbnb_search_details', which contains information about what people are searching for.
  • Filter this data to find searches that meet two criteria: the accommodation is for one person ('accommodates' equals 1) and the property type is 'Apartment'.
  • After applying these filters, convert the filtered data into a pandas dataframe. This is done for easier analysis and handling of the data.

In simple terms, we're identifying and listing the search details for apartments that are meant for only one person. Let’s see the code.

import pyspark.sql.functions as F

result = airbnb_search_details.filter((F.col('accommodates') == 1) & (F.col('property_type') == 'Apartment')).toPandas()

Here is the output.

Missing or invalid data

PySpark Interview Questions #2: Users Activity Per Month Day

This question is about finding out how active users are on Facebook on different days of the month, measured by the number of posts made each day.

Last Updated: January 2021

EasyID 2006

Return the total number of posts for each month, aggregated across all the years (i.e., posts in January 2019 and January 2020 are both combined into January). Output the month number (i.e., 1 for January, 2 for February) and the total number of posts in that month.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/2006-users-activity-per-month-day?code_type=6

We are analyzing the number of Facebook posts made on each day of the month. Let’s break down the solution.

  • Start with the 'facebook_posts' dataset, which records when posts are made.
  • First, convert the 'post_date' to a standard date format.
  • Then, group the posts by the day of the month they were made on.
  • For each day, count the number of posts ('post_id'), which we label as 'user_activity'.
  • Order the results by the day of the month.
  • Finally, convert the PySpark dataframe into a pandas dataframe for easier analysis.

In short, we are figuring out how many posts are made on each day of the month on Facebook and presenting this data in a clear and organized way. Let’s see the code.

import pyspark.sql.functions as F

result = facebook_posts.withColumn('post_date', F.to_date(F.col('post_date'))) \
    .groupBy(F.dayofmonth('post_date').alias('day')) \
    .agg(F.count('post_id').alias('user_activity')) \
    .orderBy('day') \
    .toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

post_dateuser_activity
13
23

PySpark Interview Questions #3:Customers Who Purchased the Same Product

This question asks us to identify customers who have bought the same furniture items and to provide details like the product ID, brand name, and the count of unique customers for each furniture item, arranged in order of popularity.

Last Updated: February 2023

MediumID 2150

In order to improve customer segmentation efforts for users interested in purchasing furniture, you have been asked to find customers who have purchased the same items of furniture.

Output the product_id, brand_name, unique customer ID's who purchased that product, and the count of unique customer ID's who purchased that product. Arrange the output in descending order with the highest count at the top.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/2150-customers-who-purchased-the-same-product?code_type=6

We are looking to improve customer segmentation by finding out which furniture items are bought by the same customers. Let’s break down the solution.

  • Start by merging two datasets: 'online_orders' (which contains customer orders) and 'online_products' (which contains product details). We link them using 'product_id'.
  • Then, filter this merged data to include only furniture items.
  • Create a list showing product ID, brand name, and customer ID, making sure to remove any duplicates.
  • Next, group the data by product ID and count the distinct customer IDs for each product, labeling this count as 'unique_cust_no'.
  • Then order this grouped data so that the products with the highest number of unique customers are at the top.
  • Merge this sorted data back with our original furniture list.
  • Finally, convert this PySpark dataframe into a pandas dataframe for easier handling and presentation.

In short, we are finding and listing furniture items based on the number of unique customers who bought them, showing the most popular items first. Let’s see the code.

import pyspark.sql.functions as F

merged = online_orders.join(online_products, on="product_id", how="inner")
merged = merged.filter(merged["product_class"] == "FURNITURE")
merged = merged.select("product_id", "brand_name", "customer_id").dropDuplicates()
unique_cust = (
    merged.groupBy("product_id")
    .agg(F.countDistinct("customer_id").alias("unique_cust_no"))
    .orderBy(F.desc("unique_cust_no"))
)
result = merged.join(unique_cust, on="product_id", how="inner").orderBy(F.desc("unique_cust_no"))

result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

product_idbrand_namecustomer_idunique_cust_no
10American Home23
10American Home33
10American Home13
8Lucky Joe31
11American Home11

PySpark Interview Questions #4:Sorting Movies By Duration Time

This question requires organizing a list of movies based on their duration, with the longest movies shown first.

Last Updated: May 2023

EasyID 2163

You have been asked to sort movies according to their duration in descending order.

Your output should contain all columns sorted by the movie duration in the given dataset.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/2163-sorting-movies-by-duration-time?code_type=6

We need to arrange movies by their length, starting with the longest. Let’s break down the solution.

  • Begin with the 'movie_catalogue', which includes details about various movies.
  • Extract the duration in minutes from the 'duration' column. This involves finding the number in the text and converting it to a float (a number that can have decimals).
  • Next, sort the entire movie catalogue based on these duration numbers, putting the longest movies at the top.
  • After sorting, remove the 'movie_minutes' column, as it's no longer needed.
  • Finally, convert the sorted data into a pandas dataframe.

In simple terms, we are putting the movies in order from longest to shortest based on their duration. Let’s see the code.

import pyspark.sql.functions as F

movie_catalogue = movie_catalogue.withColumn(
    "movie_minutes",
    F.regexp_extract(F.col("duration"), r"(\d+)", 1).cast("float")
)

result = movie_catalogue.orderBy(F.desc("movie_minutes")).drop("movie_minutes")

result.toPandas()

Here is the output.

Missing or invalid data

PySpark Interview Questions #5:Find the date with the highest opening stock price

This question is about identifying the date when Apple's stock had its highest opening price.

EasyID 9613

Find the date when Apple's opening stock price reached its maximum

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9613-find-the-date-with-the-highest-opening-stock-price?code_type=6

We are tasked with finding out the day when Apple's stock opened at its maximum value. Let’s break down the solution.

  • Start with the 'aapl_historical_stock_price' dataset, which has Apple's stock price history.
  • Modify the 'date' column to ensure it's in a string format showing only the year, month, and day.
  • Next, find the maximum value in the 'open' column, which represents the opening stock price.
  • Then filter the dataset to find the date(s) when this maximum opening price occurred.
  • Finally, select only the 'date' column and convert the data to a pandas dataframe for easy viewing.

In summary, we are pinpointing the date when Apple's stock had its highest opening price and presenting this information in a straightforward manner. Let’s see the code.

import pandas as pd
import numpy as np
import datetime
import time

df = aapl_historical_stock_price
df = df.withColumn('date', df['date'].cast('string').substr(0, 10))

result = df.filter(df['open'] == df.selectExpr('max(open)').collect()[0][0]).select('date').toPandas()

Here is the output.

Missing or invalid data

Intermediate PySpark Interview Questions

Intermediate PySpark Interview Questions

Once the fundamentals are mastered, the next step is to go into more complex scenarios. The Intermediate PySpark Interview Questions section is tailored for those who have a basic understanding of PySpark.

PySpark Interview Questions #5:Find the first and last times the maximum score was awarded

This question asks us to find out the earliest and latest dates on which the highest score was given in Los Angeles restaurant health inspections.

Last Updated: April 2018

MediumID 9712

Find the first and last times the maximum score was awarded

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9712-find-the-first-and-last-times-the-maximum-score-was-awarded?code_type=6

We are looking for the first and last instances when the maximum health inspection score was awarded to restaurants in Los Angeles. Let’s break down the solution.

  • Start by identifying the highest score given in the 'los_angeles_restaurant_health_inspections' dataset.
  • Next, ensure the 'activity_date' is in a date format for accurate comparison.
  • Then find the earliest date ('first_time') when this maximum score was awarded by filtering for this score and selecting the minimum date.
  • Similarly, find the latest date ('last_time') by selecting the maximum date for the same score.
  • Combine these two dates to get a result showing both the first and last times this score was given.
  • Finally, convert this combined data into a pandas dataframe for easy viewing.

In summary, we are identifying the first and last occurrences of the highest health inspection score awarded to restaurants in Los Angeles. Let’s see the code.

import pyspark.sql.functions as F

max_score = los_angeles_restaurant_health_inspections.select(F.max("score")).first()[0]
los_angeles_restaurant_health_inspections = los_angeles_restaurant_health_inspections.withColumn("activity_date", F.to_date("activity_date"))
first_time = los_angeles_restaurant_health_inspections.filter(los_angeles_restaurant_health_inspections["score"] == max_score).select(F.min("activity_date").alias("first_time"))
last_time = los_angeles_restaurant_health_inspections.filter(los_angeles_restaurant_health_inspections["score"] == max_score).select(F.max("activity_date").alias("last_time"))
result = first_time.crossJoin(last_time)

result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

first_timelast_time
2015-09-112018-03-16

PySpark Interview Questions #6:Account Registrations

This question requires us to calculate the number of account signups per month, showing the year and month along with the corresponding number of registrations.

Last Updated: August 2022

EasyID 2126

Find the number of account registrations according to the signup date. Output the year months (YYYY-MM) and their corresponding number of registrations.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/2126-account-registrations?code_type=6

We are tasked with finding out how many accounts were registered each month. Let’s break down the solution.

  • Begin with the 'noom_signups' dataset, which has data on when accounts were registered.
  • Create a new column 'started_at_month' that formats the 'started_at' date to show only the year and month (in 'YYYY-MM' format).
  • Next, group the data by this new 'started_at_month' column.
  • Count the number of registrations for each month and rename this count as 'n_registrations'.
  • Then sort the data by the month and year.
  • Finally, convert this sorted data into a pandas dataframe for easy reading.

In simple terms, we are tallying the number of account signups for each month and displaying them in an organized and chronological manner. Let’s see the code.

import pyspark.sql.functions as F

noom_signups = noom_signups.withColumn('started_at_month', F.date_format('started_at', 'yyyy-MM'))
result = noom_signups.groupby('started_at_month').count().withColumnRenamed('count', 'n_registrations').orderBy('started_at_month')
result.toPandas()

Here is the output.

Missing or invalid data

PySpark Interview Questions #7: Process a Refund

This question asks for the calculation of the minimum, average, and maximum number of days it takes to process a refund for accounts opened since January 1, 2019, and to group these calculations by billing cycle in months.

Last Updated: August 2022

MediumID 2125

Calculate and display the minimum, average and the maximum number of days it takes to process a refund for accounts opened from January 1, 2019. Group by billing cycle in months.

Note: The time frame for a refund to be fully processed is from settled_at until refunded_at.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/2125-process-a-refund?code_type=6

We need to analyze the duration of refund processing for accounts opened from 2019 onwards, grouped by their billing cycle duration. Let’s break down the solution.

  • Begin by joining three datasets: 'noom_transactions', 'noom_signups', and 'noom_plans', linking them via 'signup_id' and 'plan_id'.
  • Filter these combined datasets to include only transactions from accounts started on or after January 1, 2019.
  • Calculate 'time_to_settle', the number of days between 'settled_at' and 'refunded_at'.
  • Next, group the data by 'billing_cycle_in_months'.
  • For each billing cycle, calculate the minimum, average, and maximum refund processing time.
  • Sort these groups by the billing cycle length.
  • Finally, convert this grouped and calculated data into a pandas dataframe for easier interpretation.

In summary, we're measuring how long refunds take for different billing cycles, starting from 2019, and presenting this data in an organized manner. Let’s see the code.

import pyspark.sql.functions as F

transactions_signups = noom_transactions.join(noom_signups, on='signup_id')
transactions_signups_plans = transactions_signups.join(noom_plans, on='plan_id')
new_sigups_transactions = transactions_signups_plans.filter(transactions_signups_plans['started_at'] >= '2019-01-01')
new_sigups_transactions = new_sigups_transactions.withColumn('time_to_settle', (F.datediff(new_sigups_transactions['refunded_at'], new_sigups_transactions['settled_at'])))
result = new_sigups_transactions.groupby('billing_cycle_in_months').agg(F.min('time_to_settle').alias('_min'), F.mean('time_to_settle').alias('_mean'), F.max('time_to_settle').alias('_max')).sort('billing_cycle_in_months')
result.toPandas()

Here is the output.

Missing or invalid data

PySpark Interview Questions #8:Highest Salary

This question requires us to identify the employee (or employees) who has the highest salary, and to display their first name and the amount of their salary.

MediumID 9870

You have been asked to find the employee with the highest salary. Output the worker or worker's first name, as well as the salary.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9870-highest-salary?code_type=6

We need to find the employee with the highest salary in the dataset. Let’s break down the solution.

  • Start with the 'worker' dataset, which includes details about employees and their salaries.
  • First determine the highest salary in the dataset using the 'max' function.
  • Next, filter the dataset to find the employee(s) who have this highest salary.
  • Then select the 'first_name' and 'salary' columns to display.
  • Finally, convert this information into a pandas dataframe to make it more readable.

In summary, we are identifying the employee(s) with the top salary and presenting their first name along with the salary amount. Let’s see the code.

import pyspark.sql.functions as F

result = worker.filter(F.col('salary') == worker.select(F.max('salary')).collect()[0][0]) \
    .select('first_name', 'salary') \
    .toPandas()


Here is the output.

All required columns and the first 5 rows of the solution are shown

first_namesalary
Jura980000

PySpark Interview Questions #9: Find the average of inspections scores between 91 and 100

This question asks us to calculate the average (mean) of health inspection scores for Los Angeles restaurants that fall between 91 and 100, assuming that these scores are normally distributed.

EasyID 9707

Find the mean of inspections scores between 91 and 100.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9707-find-the-average-of-inspections-scores-between-91-and-100?code_type=6

We are finding the average score of restaurant health inspections that are in the range of 91 to 100. Let’s break down the solution.

  • Start with the 'los_angeles_restaurant_health_inspections' dataset.
  • Filter this data to include only the scores that are between 91 and 100.
  • Next, calculate the average (mean) of these scores.
  • Finally, convert this calculated average into a pandas dataframe for easy viewing and interpretation.

In simple terms, we are determining the average score of health inspections for Los Angeles restaurants, focusing on scores between 91 and 100. Let’s see the code.

import pyspark.sql.functions as F

score_between = los_angeles_restaurant_health_inspections.filter(F.col('score').between(91, 100))
result = score_between.select(F.mean('score')).toPandas()
result

Here is the output.

Missing or invalid data

Advanced PySpark Interview Questions

Advanced PySpark Interview Questions

For those who have confidently navigated through the basics and intermediate levels, the Advanced PySpark Interview Questions section awaits. This is where complex, real-world data problems are addressed. It’s designed for individuals who are comfortable with PySpark and are looking to deepen their expertise.

PySpark Interview Questions #10: Find how the survivors are distributed by the gender and passenger classes

This question asks us to determine the distribution of Titanic survivors based on their gender and the class they traveled in. The classes are categorized into first, second, and third class based on the 'pclass' value.

MediumID 9882

Find how the survivors are distributed by the gender and passenger classes. Classes are categorized based on the pclass value as: pclass = 1: first_class pclass = 2: second_classs pclass = 3: third_class Output the sex along with the corresponding number of survivors for each class. HINT: each sex should be in the separate line with one column having the value of that sex and other 3 columns having number of survivors for each 3 classes.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9882-find-how-the-survivors-are-distributed-by-the-gender-and-passenger-classes/official-solution?code_type=6

We need to find out how many survivors there were in each passenger class, broken down by gender. Let’s break down the solution.

  • We start with the 'titanic' dataset and filter it to include only the records of survivors ('survived' == 1).
  • We then group these survivor records by 'sex' and 'pclass' and count the number of survivors in each group.
  • Next, we reorganize (pivot) this data so that each row represents a gender, and each column represents a class, showing the count of survivors in each category.
  • We rename the class columns to 'first_class', 'second_class', and 'third_class' for clarity.
  • Finally, we convert this pivoted data into a pandas dataframe, which will display each gender with the corresponding number of survivors in each of the three classes.

In summary, we are showcasing the number of Titanic survivors based on their gender and the class in which they were traveling. Let’s see the code.

import pyspark.sql.functions as F

survived = titanic.filter(titanic['survived'] == 1)
count = survived.groupby(['sex','pclass']).agg(F.count('*').alias('count'))
pivot = count.groupBy('sex').pivot('pclass').agg(F.sum('count'))
pivot = pivot.withColumnRenamed('1', 'first_class').withColumnRenamed('2', 'second_class').withColumnRenamed('3', 'third_class')
result = pivot.toPandas()
result


Here is the output.

Missing or invalid data

PySpark Interview Questions #11:Consecutive Days

This question is about identifying users who were active for three or more consecutive days.

Last Updated: July 2021

HardID 2054

Find all the users who were active for 3 consecutive days or more.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/2054-consecutive-days?code_type=6

We need to find users who have been active for at least three days in a row. Let’s break down the solution.

  • Start with the 'sf_events' dataset and remove any duplicate records.
  • Then ensure the 'date' column is in a standard date format (YYYY-MM-DD).
  • Next, assign a rank to each user's activity date. This rank is calculated within each user's set of dates, ordered chronologically.
  • Create a new column 'consecutive_days' by subtracting the rank from the date. This helps to identify consecutive days.
  • Then group the data by 'user_id' and 'consecutive_days', counting the number of records in each group.
  • Filter these groups to keep only those where the count is three or more, indicating three or more consecutive days of activity.
  • Finally, select the 'user_id' of these active users and convert the data into a pandas dataframe.

In summary, we are pinpointing users who were active for three consecutive days or more and listing their IDs. Let’s see the code.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df = sf_events.dropDuplicates()
df = df.withColumn('date', F.to_date(df['date'], 'yyyy-MM-dd'))
df = df.withColumn('rank', F.row_number().over(Window.partitionBy('user_id').orderBy('date')))
df = df.withColumn('consecutive_days', F.date_sub(df['date'], df['rank'] - 1))
result = df.groupBy('user_id', 'consecutive_days').agg(F.count('*').alias('counter')).filter(F.col('counter') >= 3).select('user_id')
result.toPandas()

Here is the output.

Missing or invalid data

PySpark Interview Questions #12:Find all records with words that start with the letter 'g'

This question asks us to identify records from a dataset where either of two fields, 'words1' or 'words2', contains words starting with the letter 'g'.

EasyID 9806

Find all records with words that start with the letter 'g'.
Output words1 and words2 if any of them satisfies the condition.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9806-find-all-records-with-words-that-start-with-the-letter-g?code_type=6

We need to find records with words beginning with 'g' in either of two columns ('words1' or 'words2'). Let’s break down the solution.

  • Begin with the 'google_word_lists' dataset.
  • Apply filters to both 'words1' and 'words2' columns to check if any word starts with the letter 'g'. We use the regular expression ('rlike') function for this purpose. The expression checks for words starting with 'g' either at the beginning of the string or preceded by a comma, space, or other delimiters.
  • After applying these filters, select the records that meet our criteria.
  • Finally, convert these filtered records into a pandas dataframe for easy viewing and analysis.

In simple terms, we are finding and listing records where either the 'words1' or 'words2' field contains a word that starts with the letter 'g'. Let’s see the code.

import pyspark.sql.functions as F

movie_catalogue = movie_catalogue.withColumn(
    "movie_minutes",
    F.regexp_extract(F.col("duration"), r"(\d+)", 1).cast("float")
)

result = movie_catalogue.orderBy(F.desc("movie_minutes")).drop("movie_minutes")

result.toPandas()

Here is the output.

Missing or invalid data

PySpark Interview Questions #13:Top Teams In The Rio De Janeiro 2016 Olympics

This question asks us to identify the top three medal-winning teams for each event at the Rio De Janeiro 2016 Olympics, and display them as 'gold team', 'silver team', and 'bronze team' along with the number of medals they won. In case of a tie, the teams should be ordered alphabetically. If there is no team for a position, it should be labeled as 'No Team'.

HardID 9960

Find the top 3 medal-winning teams by counting the total number of medals for each event in the Rio De Janeiro 2016 olympics. In case there is a tie, order the countries by name in ascending order. Output the event name along with the top 3 teams as the 'gold team', 'silver team', and 'bronze team', with the team name and the total medals under each column in format "{team} with {number of medals} medals". Replace NULLs with "No Team" string.

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9960-top-teams-in-the-rio-de-janeiro-2016-olympics?code_type=6

We are analyzing the 2016 Rio De Janeiro Olympics data to determine the top three teams in terms of medal counts for each event. Let’s break down the solution.

  • Start with the 'olympics_athletes_events' dataset and convert it to a pandas dataframe.
  • Filter this data to include only the 2016 Olympics and events where medals were awarded.
  • Then group the data by 'event' and 'team', counting the number of medals for each team in each event.
  • For each event, rank the teams based on their medal counts and, in case of a tie, alphabetically by team name.
  • Identify the top three teams for each event and label them as 'gold team', 'silver team', and 'bronze team', including the count of medals they won in a formatted string.
  • Group this data by 'event' and aggregate the top teams in their respective positions.
  • Replace any missing team positions with 'No Team'.
  • The final output includes the event name, and the top three teams with their medal counts

In summary, we are showcasing the leading medal-winning teams in each event from the 2016 Olympics, labeled according to their rank and presented in an easy-to-read format. Let’s see the code.

import pandas as pd
import numpy as np

olympics_athletes_events = olympics_athletes_events.toPandas()

y_2016 = olympics_athletes_events[(olympics_athletes_events['year'] == 2016) &(olympics_athletes_events['medal'].notnull()) ]
n_medal = y_2016.groupby(['event','team']).size().to_frame('medals_count').reset_index()
n_medal['team_position'] = n_medal.sort_values(['medals_count', 'team'], ascending = [False, True]).groupby(['event']).cumcount() + 1
less_3_medals = n_medal[n_medal['team_position'] <= 3]
less_3_medals.loc[less_3_medals['team_position'] == 1, 'gold_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
less_3_medals.loc[less_3_medals['team_position'] == 2, 'silver_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
less_3_medals.loc[less_3_medals['team_position'] == 3, 'bronze_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
result = less_3_medals.groupby('event').agg({'gold_team':'first', 'silver_team':'first', 'bronze_team':'first'}).reset_index().fillna('No Team')
result

Here is the output.

All required columns and the first 5 rows of the solution are shown

eventgold_teamsilver_teambronze_team
Archery Men's IndividualFrance with 1 medalsNo TeamNo Team
Athletics Men's MarathonKenya with 1 medalsNo TeamNo Team
Athletics Women's 100 metresUnited States with 1 medalsNo TeamNo Team
Basketball Women's BasketballSpain with 1 medalsUnited States with 1 medalsNo Team
Boxing Women's LightweightRussia with 1 medalsNo TeamNo Team
Canoeing Men's Canadian Singles SlalomFrance with 1 medalsNo TeamNo Team
Canoeing Men's Kayak Doubles 1000 metresAustralia with 1 medalsGermany with 1 medalsNo Team
Canoeing Men's Kayak Doubles 200 metresGreat Britain with 1 medalsNo TeamNo Team
Canoeing Men's Kayak Fours 1000 metresCzech Republic with 1 medalsNo TeamNo Team
Cycling Men's Road Race IndividualBelgium with 1 medalsNo TeamNo Team
Cycling Men's Team Pursuit 4000 metresDenmark with 1 medalsNo TeamNo Team
Cycling Men's Team SprintNew Zealand with 1 medalsNo TeamNo Team
Diving Men's PlatformMexico with 1 medalsNo TeamNo Team
Football Women's FootballGermany with 2 medalsNo TeamNo Team
Gymnastics Men's Team All-AroundChina with 1 medalsNo TeamNo Team
Gymnastics Women's Team All-AroundRussia with 1 medalsNo TeamNo Team
Hockey Men's HockeyArgentina with 1 medalsGermany with 1 medalsNo Team
Rowing Men's Coxless FoursGreat Britain with 1 medalsNo TeamNo Team
Rowing Men's Lightweight Coxless FoursSwitzerland with 1 medalsNo TeamNo Team
Rowing Women's Double ScullsGreat Britain with 1 medalsNo TeamNo Team
Rowing Women's Lightweight Double ScullsChina with 1 medalsNo TeamNo Team
Rugby Sevens Men's Rugby SevensFiji with 1 medalsGreat Britain with 1 medalsNo Team
Rugby Sevens Women's Rugby SevensAustralia with 1 medalsNo TeamNo Team
Sailing Men's SkiffNew Zealand with 1 medalsNo TeamNo Team
Shooting Women's SkeetUnited States with 1 medalsNo TeamNo Team
Swimming Men's 4 x 100 metres Medley RelayAustralia with 1 medalsNo TeamNo Team
Swimming Men's 4 x 200 metres Freestyle RelayUnited States with 1 medalsNo TeamNo Team
Swimming Women's 4 x 100 metres Freestyle RelayAustralia with 1 medalsNo TeamNo Team
Synchronized Swimming Women's TeamJapan with 1 medalsNo TeamNo Team
Taekwondo Women's FlyweightSerbia with 1 medalsNo TeamNo Team
Volleyball Men's VolleyballItaly with 1 medalsNo TeamNo Team
Water Polo Women's Water PoloItaly with 2 medalsUnited States with 1 medalsNo Team
Weightlifting Women's MiddleweightChina with 1 medalsNo TeamNo Team

PySpark Interview Questions #14:Exclusive Amazon Products

This question asks us to identify products that are only sold on Amazon and not available at Top Shop and Macy's. We need to list these exclusive products along with their name, brand, price, and rating.

HardID 9608

Find products which are exclusive to only Amazon and therefore not sold at Top Shop and Macy's. Your output should include the product name, brand name, price, and rating.

Two products are considered equal if they have the same product name and same maximum retail price (mrp column).

Go to the Question

Link to this question : https://platform.stratascratch.com/coding/9608-exclusive-amazon-products?code_type=6

We are tasked with finding products that are exclusive to Amazon, meaning they aren't sold at Top Shop and Macy's. Let’s break down the solution.

  • Start with the 'innerwear_amazon_com' dataset, which includes details about products sold on Amazon.
  • Check these Amazon products against the 'innerwear_macys_com' and 'innerwear_topshop_com' datasets to ensure they are not available at Macy's and Top Shop. We do this using 'leftanti' joins, which find records in the first dataset that do not have matching records in the second dataset. We use 'product_name' and 'mrp' (maximum retail price) as key fields to compare.
  • After performing these joins, select the 'product_name', 'brand_name', 'price', and 'rating' columns for the final output.
  • Finally, convert this filtered data into a pandas dataframe for easier analysis.

In short, we are identifying and listing products that are unique to Amazon by ensuring they are not sold at Top Shop and Macy's, based on their name and price. Let’s see the code.

# Import your libraries
import pyspark

# Start writing code
(innerwear_amazon_com
    .join(innerwear_macys_com, ['product_name','mrp'], 'leftanti')
    .join(innerwear_topshop_com, ['product_name','mrp'], 'leftanti')
    .select('product_name','brand_name','price','rating').toPandas())

Here is the output.

All required columns and the first 5 rows of the solution are shown

product_namebrand_namepricerating
Calvin Klein Women's Bottoms Up Hipster PantyCalvin-Klein$11.004.5
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
Calvin Klein Women's Carousel 3 Pack ThongCalvin-Klein$19.994
b.tempt'd by Wacoal Women's Lace Kiss Braletteb-temptd$11.654
Wacoal Women's Front Close T-Back BraWacoal$46.004.2
Calvin Klein Women's Modern Cotton Bralette and Bikini SetCalvin-Klein$44.004.6
Calvin Klein Women's 3 Pack Invisibles Hipster PantyCalvin-Klein$29.753.9
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$11.954.2
Wacoal Women's Underwire Sport BraWacoal$65.004.3
Hanky Panky Women's Vikini PantyHanky-Panky$30.004
Wacoal Women's Halo Underwire BraWacoal$48.004.4
Wacoal Women's Basic Beauty Contour BraWacoal$55.004
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.5
Wacoal Women's How Perfect Soft Cup BraWacoal$51.524.2
Wacoal Women's Body By Wacoal Underwire BraWacoal$43.004.4
Wacoal Women's How Perfect Soft Cup BraWacoal$60.004.1
Wacoal Women's Retro Chic Contour BraWacoal$65.004.2
Wacoal Women's Halo Strapless BraWacoal$46.004.2
Calvin Klein Women's Modern Cotton Boyshort PantyCalvin-Klein$19.494.4
Wacoal Women's Underwire Sport BraWacoal$52.674.3
b.tempt'd by Wacoal Womens Ciao Bella Tanga Pantyb-temptd$19.004.1
Wacoal Women's Embrace Lace BraWacoal$44.504.3
Calvin Klein Women's Naked Glamour Strapless Push Up BraCalvin-Klein$28.273.9
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$41.404.3
Wacoal Women's Amazing Assets Contour BraWacoal$65.003.8
Wacoal Women's Sport Zip Front Contour BraWacoal$69.843.8
Wacoal Women's Awareness Underwire BraWacoal$65.004.4
Wacoal Women's Bodysuede Underwire BraWacoal$60.004.4
Wacoal Women's Underwire Sport BraWacoal$45.504.3
Wacoal Women's How Perfect Soft Cup BraWacoal$60.004.1
Wacoal Embrace Lace Bikini PantyWacoal$27.004.3
Wacoal Women's Halo Underwire BraWacoal$34.954.3
Wacoal Women's Lace Affair Bikini PantyWacoal$22.004.8
Wacoal Women's Awareness Underwire BraWacoal$65.004.4
Wacoal Women's Basic Beauty Contour BraWacoal$55.004
Calvin Klein Women's Naked Glamour Strapless Push Up BraCalvin-Klein$29.993.9
Wacoal Women's Retro Chic Underwire BraWacoal$48.004.4
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$46.004.3
Calvin Klein Women's 3 Pack Carousel Thong PantyCalvin-Klein$20.994.7
Wacoal Women's Retro Chic Underwire BraWacoal$48.004.4
Calvin Klein Women's Sheer Marquisette Demi Unlined BraCalvin-Klein$36.004.6
Wacoal Women's Basic Beauty Front Close Contour BraWacoal$55.004.2
Wacoal Women's Retro Chic Contour BraWacoal$52.974.2
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$40.484.3
Wacoal Women's Body By Wacoal Underwire BraWacoal$39.464.4
Calvin Klein Women's Standard Radiant Cotton Bikini PantyCalvin-Klein$9.884.4
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
b.tempt'd by Wacoal Women's Ciao Bella Balconette Brab-temptd$38.004.3
Wacoal Women's Bodysuede Underwire BraWacoal$60.004.4
Wacoal Women's Halo Underwire BraWacoal$48.004.4
Wacoal Women's Embrace Lace BraWacoal$35.004.3
Calvin Klein Women's Naked Glamour Strapless Push Up BraCalvin-Klein$29.993.8
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$54.774.2
Wacoal Women's Embrace Lace Bikini PantyWacoal$27.004.3
Calvin Klein Women's Sheer Marquisette Demi Unlined BraCalvin-Klein$26.194.5
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$55.004.2
Calvin Klein Women's Everyday Lightly Lined Demi BraCalvin-Klein$38.003.8
Calvin Klein Women's Bottoms Up Bikini PantyCalvin-Klein$8.004.3
Calvin Klein Women's ID Wide Waistband Unlined Triangle Cotton BraletteCalvin-Klein$19.013.8
Wacoal Women's Body By Wacoal Underwire BraWacoal$46.004.4
Calvin Klein Women's Sheer Marquisette Demi Unlined BraCalvin-Klein$27.004.6
Wacoal Women's Retro Chic Underwire BraWacoal$69.994.4
b.tempt'd by Wacoal Women's Ciao Bella Balconette Brab-temptd$38.004.3
Wacoal Women's Red Carpet Strapless BraWacoal$65.004.4
Calvin Klein Women's Seductive Comfort Lift Strapless Multiway BraCalvin-Klein$39.604.1
Wacoal Women's Slimline Seamless Minimizer BraWacoal$65.004.3
Wacoal Women's Awareness Underwire BraWacoal$55.254.4
Wacoal Women's Embrace Lace BraWacoal$50.004.3
Hanky Panky Women's Bare Godiva Thong PantyHanky-Panky$25.004.2
Calvin Klein Women's Seductive Comfort Lift Strapless Multiway BraCalvin-Klein$39.603.9
Wacoal Women's Halo Underwire BraWacoal$48.004.4
Wacoal Women's Retro Chic Underwire BraWacoal$48.004.4
Calvin Klein Women's 4 Pack Stretch Lace Bikini PantyCalvin-Klein$28.014.1
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$34.504.3
Wacoal Women's Red Carpet Strapless BraWacoal$65.004.4
Calvin Klein Women's ID Tanga Wide Waistband Cotton PantyCalvin-Klein$16.974.3
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
Calvin Klein Women's Ombre 5 Pack ThongCalvin-Klein$59.995
Wacoal Women's How Perfect Soft Cup BraWacoal$60.004.2
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$55.004.2
Wacoal Women's Underwire Sport BraWacoal$42.904.3
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
Wacoal Women's Retro Chic Underwire BraWacoal$54.604.4
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$55.004.2
Wacoal Womens Basic Beauty Contour T-Shirt BraWacoal$55.004.2
Calvin Klein Women's Perfectly Fit Lightly Lined Memory Touch T-Shirt BraCalvin-Klein$41.404.3
Calvin Klein Women's Modern Cotton BikiniCalvin-Klein$11.364.7
Wacoal Women's Sport Contour BraWacoal$47.404.2
b.tempt'd by Wacoal Women's Ciao Bella Balconette Brab-temptd$21.894.3
Wacoal Women's Sport Contour BraWacoal$65.964.2
Wacoal Women's Bodysuede Underwire BraWacoal$60.004.3
Wacoal Women's Halo Strapless BraWacoal$46.004.2

Conclusion

In this article, we covered key aspects of PySpark, presenting questions from basic to advanced levels. These are important for understanding real-world data science challenges and using PySpark effectively in various scenarios.

Continuous practice and engagement with practical problems are crucial for budding data scientists. Regularly tackling data projects and data science interview questions sharpens your PySpark skills, leading to greater proficiency.

Visit StrataScratch to go deeper into data science and PySpark. This is your pathway to advancing in your data science career.

Share