PySpark Interview Questions for Data Science Excellence

PySpark Interview Questions


Navigating Through Essential PySpark Interview Questions from Basics to Advanced for Aspiring Data Scientists

What sets apart a good data scientist from a great one? It's not just about knowing different tools and techniques; it's about understanding how and when to use them.

PySpark, integrating Python and Apache Spark, stands as a crucial tool in modern data science. Its importance in processing large datasets and enabling distributed computing is undeniable.

In this article, I will introduce you to a range of PySpark interview questions, from basic to advanced levels, and you will see how and when to use them in real life, by testing yourselves with interview questions. Buckle up and let’s get started!

Basic PySpark Interview Questions

Basic PySpark Interview Questions

Starting with the basics is importantl in building a strong foundation. This section on Basic PySpark Interview Questions is designed for beginners or those new to PySpark.

PySpark Interview Questions #1: Find out search details for apartments designed for a sole-person stay

This question focuses on extracting details of searches made by people looking for apartments suitable for a single person.


DataFrame: airbnb_search_details
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9615-find-out-search-details-for-apartments-designed-for-a-sole-person-stay?code_type=6

First,we need to find out which searches were for single-person apartments. Let’s break down the solution.

  • Start with the dataset 'airbnb_search_details', which contains information about what people are searching for.
  • Filter this data to find searches that meet two criteria: the accommodation is for one person ('accommodates' equals 1) and the property type is 'Apartment'.
  • After applying these filters, convert the filtered data into a pandas dataframe. This is done for easier analysis and handling of the data.

In simple terms, we're identifying and listing the search details for apartments that are meant for only one person. Let’s see the code.

import pyspark.sql.functions as F

result = airbnb_search_details.filter((F.col('accommodates') == 1) & (F.col('property_type') == 'Apartment')).toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

idpriceproperty_typeroom_typeamenitiesaccommodatesbathroomsbed_typecancellation_policycleaning_feecityhost_identity_verifiedhost_response_ratehost_sinceneighbourhoodnumber_of_reviewsreview_scores_ratingzipcodebedroomsbeds
5059214431.75ApartmentPrivate room{TV,"Wireless Internet","Air conditioning",Kitchen,"Free parking on premises",Breakfast,Heating,"Smoke detector","Carbon monoxide detector","First aid kit","Fire extinguisher",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Laptop friendly workspace","Private living room"}13Real BedstrictFALSENYCf2014-03-14 00:00:00Harlem01003021
10923708340.12ApartmentPrivate room{TV,Internet,"Wireless Internet","Air conditioning",Kitchen,"Pets live on this property",Cat(s),"Buzzer/wireless intercom",Heating,"Family/kid friendly",Washer,"Smoke detector","Carbon monoxide detector","First aid kit","Fire extinguisher",Essentials}11Real BedstrictFALSENYCt100%2014-06-30 00:00:00Harlem166911003111
1077375400.73ApartmentPrivate room{"Wireless Internet",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_50"}11Real BedmoderateTRUENYCt2015-04-04 00:00:00Sunset Park11001122011
13121821501.06ApartmentPrivate room{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,Heating,"Smoke detector","First aid kit",Essentials,Hangers,"Hair dryer",Iron,"Laptop friendly workspace"}11Real BedflexibleFALSENYCf2014-09-20 00:00:00Park Slope01121511
19245819424.85ApartmentPrivate room{Internet,"Wireless Internet",Kitchen,"Pets live on this property",Dog(s),Washer,Dryer,"Smoke detector","Fire extinguisher"}11Real BedmoderateFALSESFt2010-03-16 00:00:00Mission District12909411011

PySpark Interview Questions #2: Users Activity Per Month Day

This question is about finding out how active users are on Facebook on different days of the month, measured by the number of posts made each day.


DataFrame: facebook_posts
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2006-users-activity-per-month-day?code_type=6

We are analyzing the number of Facebook posts made on each day of the month. Let’s break down the solution.

  • Start with the 'facebook_posts' dataset, which records when posts are made.
  • First, convert the 'post_date' to a standard date format.
  • Then, group the posts by the day of the month they were made on.
  • For each day, count the number of posts ('post_id'), which we label as 'user_activity'.
  • Order the results by the day of the month.
  • Finally, convert the PySpark dataframe into a pandas dataframe for easier analysis.

In short, we are figuring out how many posts are made on each day of the month on Facebook and presenting this data in a clear and organized way. Let’s see the code.

import pyspark.sql.functions as F

result = facebook_posts.withColumn('post_date', F.to_date(F.col('post_date'))) \
    .groupBy(F.dayofmonth('post_date').alias('day')) \
    .agg(F.count('post_id').alias('user_activity')) \
    .orderBy('day') \
    .toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

post_dateuser_activity
13
23

PySpark Interview Questions #3:Customers Who Purchased the Same Product

This question asks us to identify customers who have bought the same furniture items and to provide details like the product ID, brand name, and the count of unique customers for each furniture item, arranged in order of popularity.


DataFrames: online_orders, online_products
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2150-customers-who-purchased-the-same-product?code_type=6

We are looking to improve customer segmentation by finding out which furniture items are bought by the same customers. Let’s break down the solution.

  • Start by merging two datasets: 'online_orders' (which contains customer orders) and 'online_products' (which contains product details). We link them using 'product_id'.
  • Then, filter this merged data to include only furniture items.
  • Create a list showing product ID, brand name, and customer ID, making sure to remove any duplicates.
  • Next, group the data by product ID and count the distinct customer IDs for each product, labeling this count as 'unique_cust_no'.
  • Then order this grouped data so that the products with the highest number of unique customers are at the top.
  • Merge this sorted data back with our original furniture list.
  • Finally, convert this PySpark dataframe into a pandas dataframe for easier handling and presentation.

In short, we are finding and listing furniture items based on the number of unique customers who bought them, showing the most popular items first. Let’s see the code.

import pyspark.sql.functions as F

merged = online_orders.join(online_products, on="product_id", how="inner")
merged = merged.filter(merged["product_class"] == "FURNITURE")
merged = merged.select("product_id", "brand_name", "customer_id").dropDuplicates()
unique_cust = (
    merged.groupBy("product_id")
    .agg(F.countDistinct("customer_id").alias("unique_cust_no"))
    .orderBy(F.desc("unique_cust_no"))
)
result = merged.join(unique_cust, on="product_id", how="inner").orderBy(F.desc("unique_cust_no"))

result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

product_idbrand_namecustomer_idunique_cust_no
10American Home23
10American Home13
10American Home33
8Lucky Joe31
11American Home11

PySpark Interview Questions #4:Sorting Movies By Duration Time

This question requires organizing a list of movies based on their duration, with the longest movies shown first.


DataFrame: movie_catalogue
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2163-sorting-movies-by-duration-time?code_type=6

We need to arrange movies by their length, starting with the longest. Let’s break down the solution.

  • Begin with the 'movie_catalogue', which includes details about various movies.
  • Extract the duration in minutes from the 'duration' column. This involves finding the number in the text and converting it to a float (a number that can have decimals).
  • Next, sort the entire movie catalogue based on these duration numbers, putting the longest movies at the top.
  • After sorting, remove the 'movie_minutes' column, as it's no longer needed.
  • Finally, convert the sorted data into a pandas dataframe.

In simple terms, we are putting the movies in order from longest to shortest based on their duration. Let’s see the code.

import pyspark.sql.functions as F

movie_catalogue = movie_catalogue.withColumn(
    "movie_minutes",
    F.regexp_extract(F.col("duration"), r"(\d+)", 1).cast("float")
)

result = movie_catalogue.orderBy(F.desc("movie_minutes")).drop("movie_minutes")

result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

show_idtitlerelease_yearratingduration
s8083Star Wars: Episode VIII: The Last Jedi2017PG-13152 min
s6201Avengers: Infinity War2018PG-13150 min
s6326Black Panther2018PG-13135 min
s8052Solo: A Star Wars Story2018PG-13135 min
s8053Solo: A Star Wars Story (Spanish Version)2018PG-13135 min

PySpark Interview Questions #5:Find the date with the highest opening stock price

This question is about identifying the date when Apple's stock had its highest opening price.


DataFrame: aapl_historical_stock_price
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9613-find-the-date-with-the-highest-opening-stock-price?code_type=6

We are tasked with finding out the day when Apple's stock opened at its maximum value. Let’s break down the solution.

  • Start with the 'aapl_historical_stock_price' dataset, which has Apple's stock price history.
  • Modify the 'date' column to ensure it's in a string format showing only the year, month, and day.
  • Next, find the maximum value in the 'open' column, which represents the opening stock price.
  • Then filter the dataset to find the date(s) when this maximum opening price occurred.
  • Finally, select only the 'date' column and convert the data to a pandas dataframe for easy viewing.

In summary, we are pinpointing the date when Apple's stock had its highest opening price and presenting this information in a straightforward manner. Let’s see the code.

import pandas as pd
import numpy as np
import datetime
import time

df = aapl_historical_stock_price
df = df.withColumn('date', df['date'].cast('string').substr(0, 10))

result = df.filter(df['open'] == df.selectExpr('max(open)').collect()[0][0]).select('date').toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

date
2012-09-21

Intermediate PySpark Interview Questions

Intermediate PySpark Interview Questions

Once the fundamentals are mastered, the next step is to go into more complex scenarios. The Intermediate PySpark Interview Questions section is tailored for those who have a basic understanding of PySpark.

PySpark Interview Questions #5:Find the first and last times the maximum score was awarded

This question asks us to find out the earliest and latest dates on which the highest score was given in Los Angeles restaurant health inspections.


DataFrame: los_angeles_restaurant_health_inspections
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9712-find-the-first-and-last-times-the-maximum-score-was-awarded?code_type=6

We are looking for the first and last instances when the maximum health inspection score was awarded to restaurants in Los Angeles. Let’s break down the solution.

  • Start by identifying the highest score given in the 'los_angeles_restaurant_health_inspections' dataset.
  • Next, ensure the 'activity_date' is in a date format for accurate comparison.
  • Then find the earliest date ('first_time') when this maximum score was awarded by filtering for this score and selecting the minimum date.
  • Similarly, find the latest date ('last_time') by selecting the maximum date for the same score.
  • Combine these two dates to get a result showing both the first and last times this score was given.
  • Finally, convert this combined data into a pandas dataframe for easy viewing.

In summary, we are identifying the first and last occurrences of the highest health inspection score awarded to restaurants in Los Angeles. Let’s see the code.

import pyspark.sql.functions as F

max_score = los_angeles_restaurant_health_inspections.select(F.max("score")).first()[0]
los_angeles_restaurant_health_inspections = los_angeles_restaurant_health_inspections.withColumn("activity_date", F.to_date("activity_date"))
first_time = los_angeles_restaurant_health_inspections.filter(los_angeles_restaurant_health_inspections["score"] == max_score).select(F.min("activity_date").alias("first_time"))
last_time = los_angeles_restaurant_health_inspections.filter(los_angeles_restaurant_health_inspections["score"] == max_score).select(F.max("activity_date").alias("last_time"))
result = first_time.crossJoin(last_time)

result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

first_timelast_time
2015-09-112018-03-16

PySpark Interview Questions #6:Account Registrations

This question requires us to calculate the number of account signups per month, showing the year and month along with the corresponding number of registrations.


DataFrame: noom_signups
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2126-account-registrations?code_type=6

We are tasked with finding out how many accounts were registered each month. Let’s break down the solution.

  • Begin with the 'noom_signups' dataset, which has data on when accounts were registered.
  • Create a new column 'started_at_month' that formats the 'started_at' date to show only the year and month (in 'YYYY-MM' format).
  • Next, group the data by this new 'started_at_month' column.
  • Count the number of registrations for each month and rename this count as 'n_registrations'.
  • Then sort the data by the month and year.
  • Finally, convert this sorted data into a pandas dataframe for easy reading.

In simple terms, we are tallying the number of account signups for each month and displaying them in an organized and chronological manner. Let’s see the code.

import pyspark.sql.functions as F

noom_signups = noom_signups.withColumn('started_at_month', F.date_format('started_at', 'yyyy-MM'))
result = noom_signups.groupby('started_at_month').count().withColumnRenamed('count', 'n_registrations').orderBy('started_at_month')
result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

started_at_monthn_registrations
2018-101
2018-114
2018-121
2019-013
2019-023

PySpark Interview Questions #7: Process a Refund

This question asks for the calculation of the minimum, average, and maximum number of days it takes to process a refund for accounts opened since January 1, 2019, and to group these calculations by billing cycle in months.


DataFrames: noom_signups, noom_transactions, noom_plans
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2125-process-a-refund?code_type=6

We need to analyze the duration of refund processing for accounts opened from 2019 onwards, grouped by their billing cycle duration. Let’s break down the solution.

  • Begin by joining three datasets: 'noom_transactions', 'noom_signups', and 'noom_plans', linking them via 'signup_id' and 'plan_id'.
  • Filter these combined datasets to include only transactions from accounts started on or after January 1, 2019.
  • Calculate 'time_to_settle', the number of days between 'settled_at' and 'refunded_at'.
  • Next, group the data by 'billing_cycle_in_months'.
  • For each billing cycle, calculate the minimum, average, and maximum refund processing time.
  • Sort these groups by the billing cycle length.
  • Finally, convert this grouped and calculated data into a pandas dataframe for easier interpretation.

In summary, we're measuring how long refunds take for different billing cycles, starting from 2019, and presenting this data in an organized manner. Let’s see the code.

import pyspark.sql.functions as F

transactions_signups = noom_transactions.join(noom_signups, on='signup_id')
transactions_signups_plans = transactions_signups.join(noom_plans, on='plan_id')
new_sigups_transactions = transactions_signups_plans.filter(transactions_signups_plans['started_at'] >= '2019-01-01')
new_sigups_transactions = new_sigups_transactions.withColumn('time_to_settle', (F.datediff(new_sigups_transactions['refunded_at'], new_sigups_transactions['settled_at'])))
result = new_sigups_transactions.groupby('billing_cycle_in_months').agg(F.min('time_to_settle').alias('_min'), F.mean('time_to_settle').alias('_mean'), F.max('time_to_settle').alias('_max')).sort('billing_cycle_in_months')
result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

billing_cycle_in_months_time_to_settle_min_time_to_settle_mean_time_to_settle_max
1410.44421
66914
1269.513

PySpark Interview Questions #8:Highest Salary

This question requires us to identify the employee (or employees) who has the highest salary, and to display their first name and the amount of their salary.


DataFrame: worker
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9870-highest-salary?code_type=6

We need to find the employee with the highest salary in the dataset. Let’s break down the solution.

  • Start with the 'worker' dataset, which includes details about employees and their salaries.
  • First determine the highest salary in the dataset using the 'max' function.
  • Next, filter the dataset to find the employee(s) who have this highest salary.
  • Then select the 'first_name' and 'salary' columns to display.
  • Finally, convert this information into a pandas dataframe to make it more readable.

In summary, we are identifying the employee(s) with the top salary and presenting their first name along with the salary amount. Let’s see the code.

import pyspark.sql.functions as F

result = worker.filter(F.col('salary') == worker.select(F.max('salary')).collect()[0][0]) \
    .select('first_name', 'salary') \
    .toPandas()


Here is the output.

All required columns and the first 5 rows of the solution are shown

first_namesalary
Amitah500000
Vivek500000

PySpark Interview Questions #9: Find the average of inspections scores between 91 and 100

This question asks us to calculate the average (mean) of health inspection scores for Los Angeles restaurants that fall between 91 and 100, assuming that these scores are normally distributed.


DataFrame: los_angeles_restaurant_health_inspections

Link to this question : https://platform.stratascratch.com/coding/9707-find-the-average-of-inspections-scores-between-91-and-100?code_type=6

We are finding the average score of restaurant health inspections that are in the range of 91 to 100. Let’s break down the solution.

  • Start with the 'los_angeles_restaurant_health_inspections' dataset.
  • Filter this data to include only the scores that are between 91 and 100.
  • Next, calculate the average (mean) of these scores.
  • Finally, convert this calculated average into a pandas dataframe for easy viewing and interpretation.

In simple terms, we are determining the average score of health inspections for Los Angeles restaurants, focusing on scores between 91 and 100. Let’s see the code.

import pyspark.sql.functions as F

score_between = los_angeles_restaurant_health_inspections.filter(F.col('score').between(91, 100))
result = score_between.select(F.mean('score')).toPandas()
result

Here is the output.

All required columns and the first 5 rows of the solution are shown

95.056

Advanced PySpark Interview Questions

Advanced PySpark Interview Questions

For those who have confidently navigated through the basics and intermediate levels, the Advanced PySpark Interview Questions section awaits. This is where complex, real-world data problems are addressed. It’s designed for individuals who are comfortable with PySpark and are looking to deepen their expertise.

PySpark Interview Questions #10: Find how the survivors are distributed by the gender and passenger classes

This question asks us to determine the distribution of Titanic survivors based on their gender and the class they traveled in. The classes are categorized into first, second, and third class based on the 'pclass' value.


DataFrame: titanic
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9882-find-how-the-survivors-are-distributed-by-the-gender-and-passenger-classes/official-solution?code_type=6

We need to find out how many survivors there were in each passenger class, broken down by gender. Let’s break down the solution.

  • We start with the 'titanic' dataset and filter it to include only the records of survivors ('survived' == 1).
  • We then group these survivor records by 'sex' and 'pclass' and count the number of survivors in each group.
  • Next, we reorganize (pivot) this data so that each row represents a gender, and each column represents a class, showing the count of survivors in each category.
  • We rename the class columns to 'first_class', 'second_class', and 'third_class' for clarity.
  • Finally, we convert this pivoted data into a pandas dataframe, which will display each gender with the corresponding number of survivors in each of the three classes.

In summary, we are showcasing the number of Titanic survivors based on their gender and the class in which they were traveling. Let’s see the code.

import pyspark.sql.functions as F

survived = titanic.filter(titanic['survived'] == 1)
count = survived.groupby(['sex','pclass']).agg(F.count('*').alias('count'))
pivot = count.groupBy('sex').pivot('pclass').agg(F.sum('count'))
pivot = pivot.withColumnRenamed('1', 'first_class').withColumnRenamed('2', 'second_class').withColumnRenamed('3', 'third_class')
result = pivot.toPandas()
result


Here is the output.

All required columns and the first 5 rows of the solution are shown

sexfirst_classsecond_classsthird_class
female7915
male334

PySpark Interview Questions #11:Consecutive Days

This question is about identifying users who were active for three or more consecutive days.


DataFrame: sf_events
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/2054-consecutive-days?code_type=6

We need to find users who have been active for at least three days in a row. Let’s break down the solution.

  • Start with the 'sf_events' dataset and remove any duplicate records.
  • Then ensure the 'date' column is in a standard date format (YYYY-MM-DD).
  • Next, assign a rank to each user's activity date. This rank is calculated within each user's set of dates, ordered chronologically.
  • Create a new column 'consecutive_days' by subtracting the rank from the date. This helps to identify consecutive days.
  • Then group the data by 'user_id' and 'consecutive_days', counting the number of records in each group.
  • Filter these groups to keep only those where the count is three or more, indicating three or more consecutive days of activity.
  • Finally, select the 'user_id' of these active users and convert the data into a pandas dataframe.

In summary, we are pinpointing users who were active for three consecutive days or more and listing their IDs. Let’s see the code.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df = sf_events.dropDuplicates()
df = df.withColumn('date', F.to_date(df['date'], 'yyyy-MM-dd'))
df = df.withColumn('rank', F.row_number().over(Window.partitionBy('user_id').orderBy('date')))
df = df.withColumn('consecutive_days', F.date_sub(df['date'], df['rank'] - 1))
result = df.groupBy('user_id', 'consecutive_days').agg(F.count('*').alias('counter')).filter(F.col('counter') >= 3).select('user_id')
result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

user_id
U4

PySpark Interview Questions #12:Find all records with words that start with the letter 'g'

This question asks us to identify records from a dataset where either of two fields, 'words1' or 'words2', contains words starting with the letter 'g'.


DataFrame: google_word_lists
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9806-find-all-records-with-words-that-start-with-the-letter-g?code_type=6

We need to find records with words beginning with 'g' in either of two columns ('words1' or 'words2'). Let’s break down the solution.

  • Begin with the 'google_word_lists' dataset.
  • Apply filters to both 'words1' and 'words2' columns to check if any word starts with the letter 'g'. We use the regular expression ('rlike') function for this purpose. The expression checks for words starting with 'g' either at the beginning of the string or preceded by a comma, space, or other delimiters.
  • After applying these filters, select the records that meet our criteria.
  • Finally, convert these filtered records into a pandas dataframe for easy viewing and analysis.

In simple terms, we are finding and listing records where either the 'words1' or 'words2' field contains a word that starts with the letter 'g'. Let’s see the code.

import pyspark.sql.functions as F

movie_catalogue = movie_catalogue.withColumn(
    "movie_minutes",
    F.regexp_extract(F.col("duration"), r"(\d+)", 1).cast("float")
)

result = movie_catalogue.orderBy(F.desc("movie_minutes")).drop("movie_minutes")

result.toPandas()

Here is the output.

All required columns and the first 5 rows of the solution are shown

words1words2
google,facebook,microsoftflower,nature,sun
sun,naturegoogle,apple
beach,photofacebook,green,orange

PySpark Interview Questions #13:Top Teams In The Rio De Janeiro 2016 Olympics

This question asks us to identify the top three medal-winning teams for each event at the Rio De Janeiro 2016 Olympics, and display them as 'gold team', 'silver team', and 'bronze team' along with the number of medals they won. In case of a tie, the teams should be ordered alphabetically. If there is no team for a position, it should be labeled as 'No Team'.


DataFrame: olympics_athletes_events
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9960-top-teams-in-the-rio-de-janeiro-2016-olympics?code_type=6

We are analyzing the 2016 Rio De Janeiro Olympics data to determine the top three teams in terms of medal counts for each event. Let’s break down the solution.

  • Start with the 'olympics_athletes_events' dataset and convert it to a pandas dataframe.
  • Filter this data to include only the 2016 Olympics and events where medals were awarded.
  • Then group the data by 'event' and 'team', counting the number of medals for each team in each event.
  • For each event, rank the teams based on their medal counts and, in case of a tie, alphabetically by team name.
  • Identify the top three teams for each event and label them as 'gold team', 'silver team', and 'bronze team', including the count of medals they won in a formatted string.
  • Group this data by 'event' and aggregate the top teams in their respective positions.
  • Replace any missing team positions with 'No Team'.
  • The final output includes the event name, and the top three teams with their medal counts

In summary, we are showcasing the leading medal-winning teams in each event from the 2016 Olympics, labeled according to their rank and presented in an easy-to-read format. Let’s see the code.

import pandas as pd
import numpy as np

olympics_athletes_events = olympics_athletes_events.toPandas()

y_2016 = olympics_athletes_events[(olympics_athletes_events['year'] == 2016) &(olympics_athletes_events['medal'].notnull()) ]
n_medal = y_2016.groupby(['event','team']).size().to_frame('medals_count').reset_index()
n_medal['team_position'] = n_medal.sort_values(['medals_count', 'team'], ascending = [False, True]).groupby(['event']).cumcount() + 1
less_3_medals = n_medal[n_medal['team_position'] <= 3]
less_3_medals.loc[less_3_medals['team_position'] == 1, 'gold_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
less_3_medals.loc[less_3_medals['team_position'] == 2, 'silver_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
less_3_medals.loc[less_3_medals['team_position'] == 3, 'bronze_team'] = less_3_medals['team'] + " with "+ less_3_medals['medals_count'].astype(str) + " medals"
result = less_3_medals.groupby('event').agg({'gold_team':'first', 'silver_team':'first', 'bronze_team':'first'}).reset_index().fillna('No Team')
result

Here is the output.

All required columns and the first 5 rows of the solution are shown

eventgold_teamsilver_teambronze_team
Archery Men's IndividualFrance with 1 medalsNo TeamNo Team
Athletics Men's MarathonKenya with 1 medalsNo TeamNo Team
Athletics Women's 100 metresUnited States with 1 medalsNo TeamNo Team
Basketball Women's BasketballSpain with 1 medalsUnited States with 1 medalsNo Team
Boxing Women's LightweightRussia with 1 medalsNo TeamNo Team

PySpark Interview Questions #14:Exclusive Amazon Products

This question asks us to identify products that are only sold on Amazon and not available at Top Shop and Macy's. We need to list these exclusive products along with their name, brand, price, and rating.


DataFrames: innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com
Expected Output Type: pandas.DataFrame

Link to this question : https://platform.stratascratch.com/coding/9608-exclusive-amazon-products?code_type=6

We are tasked with finding products that are exclusive to Amazon, meaning they aren't sold at Top Shop and Macy's. Let’s break down the solution.

  • Start with the 'innerwear_amazon_com' dataset, which includes details about products sold on Amazon.
  • Check these Amazon products against the 'innerwear_macys_com' and 'innerwear_topshop_com' datasets to ensure they are not available at Macy's and Top Shop. We do this using 'leftanti' joins, which find records in the first dataset that do not have matching records in the second dataset. We use 'product_name' and 'mrp' (maximum retail price) as key fields to compare.
  • After performing these joins, select the 'product_name', 'brand_name', 'price', and 'rating' columns for the final output.
  • Finally, convert this filtered data into a pandas dataframe for easier analysis.

In short, we are identifying and listing products that are unique to Amazon by ensuring they are not sold at Top Shop and Macy's, based on their name and price. Let’s see the code.

# Import your libraries
import pyspark

# Start writing code
(innerwear_amazon_com
    .join(innerwear_macys_com, ['product_name','mrp'], 'leftanti')
    .join(innerwear_topshop_com, ['product_name','mrp'], 'leftanti')
    .select('product_name','brand_name','price','rating').toPandas())

Here is the output.

All required columns and the first 5 rows of the solution are shown

product_namebrand_namepricerating
Calvin Klein Women's Bottoms Up Hipster PantyCalvin-Klein$11.004.5
Wacoal Women's Retro Chic Underwire BraWacoal$60.004.4
Calvin Klein Women's Carousel 3 Pack ThongCalvin-Klein$19.994
b.tempt'd by Wacoal Women's Lace Kiss Braletteb-temptd$11.654
Wacoal Women's Front Close T-Back BraWacoal$46.004.2

Conclusion

In this article, we covered key aspects of PySpark, presenting questions from basic to advanced levels. These are important for understanding real-world data science challenges and using PySpark effectively in various scenarios.

Continuous practice and engagement with practical problems are crucial for budding data scientists. Regularly tackling data projects and data science interview questions sharpens your PySpark skills, leading to greater proficiency.

Visit StrataScratch to go deeper into data science and PySpark. This is your pathway to advancing in your data science career.

PySpark Interview Questions


Become a data expert. Subscribe to our newsletter.