Python Pandas Interview Questions for Data Science (Part 2)

Concluding part of Python Pandas Interview Questions


The concluding part of the using Pandas for Python Data Science Interview Questions

In the previous article, Python Pandas Interview Questions for Data Science Part 1, we looked a how to get data into Pandas and perform basic calculations like

  • Sorting DataFrames
  • Handling Duplicates
  • Aggregations
  • Merging DataFrames
  • Calculated Fields

In the second part of the series, we will build on that knowledge and use it to solve more complex Python Pandas interview questions. We will look at areas like

  • Datetime Manipulation
  • Text Manipulation
  • Applying Functions
  • Advanced Aggregations
  • Offsetting Data
  • Statistics using Pandas

If you are not acquainted with the Pandas library, please go through that article. Let us begin.

Applying Functions

In the previous part of our series, we looked at how to create additional fields. We can also use functions from Python libraries or user defined functions to manipulate values. To do this, we use the apply() method. The apply() method uses vectorization and hence, can calculate values much faster compared to iterating through each row in a series. Here is an easy one from an AirBnB Data Science Interview.

City With Most Amenities

Find the city with the most amenities across all their host's properties from the given dataset of searches for properties on Airbnb. Assume that each row represents a unique host. Output the name of the city as your solution.

Python Pandas Interview Questions for City With Most Amenities

You can try to solve this https://platform.stratascratch.com/coding/9633-city-with-most-amenities

The problem uses the airbnb_search_details dataset with contains these fields

idint64
pricefloat64
property_typeobject
room_typeobject
amenitiesobject
accommodatesint64
bathroomsint64
bed_typeobject
cancellation_policyobject
cleaning_feebool
cityobject
host_identity_verifiedobject
host_response_rateobject
host_sincedatetime64[ns]
neighbourhoodobject
number_of_reviewsint64
review_scores_ratingfloat64
zipcodeint64
bedroomsint64
bedsint64

Here is how the data is presented.

Table: airbnb_search_details
idpriceproperty_typeroom_typeamenitiesaccommodatesbathroomsbed_typecancellation_policycleaning_feecityhost_identity_verifiedhost_response_ratehost_sinceneighbourhoodnumber_of_reviewsreview_scores_ratingzipcodebedroomsbeds
12513361555.68ApartmentEntire home/apt{TV,"Wireless Internet","Air conditioning","Smoke detector","Carbon monoxide detector",Essentials,"Lock on bedroom door",Hangers,Iron}21Real BedflexibleFALSENYCt89%2015-11-18East Harlem3871002901
7196412366.36CabinPrivate room{"Wireless Internet",Kitchen,Washer,Dryer,"Smoke detector","First aid kit","Fire extinguisher",Essentials,"Hair dryer","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}23Real BedmoderateFALSELAf100%2016-09-10Valley Glen14919160611
16333776482.83HousePrivate room{TV,"Cable TV",Internet,"Wireless Internet",Kitchen,"Free parking on premises","Pets live on this property",Dog(s),"Indoor fireplace","Buzzer/wireless intercom",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Safety card","Fire extinguisher",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50","Self Check-In",Lockbox}21Real BedstrictTRUESFt100%2013-12-26Richmond District117969411811
1786412448.86ApartmentPrivate room{"Wireless Internet","Air conditioning",Kitchen,Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit","Fire extinguisher",Essentials,Shampoo,"Lock on bedroom door",Hangers,"translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}21Real BedstrictTRUENYCt93%2010-05-11Williamsburg8861121111
14575777506.89VillaPrivate room{TV,Internet,"Wireless Internet","Air conditioning",Kitchen,"Free parking on premises",Essentials,Shampoo,"translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}62Real BedstrictTRUELAt70%2015-10-2221009070333

Approach and Solution

Let us break this problem down. All the amenities for a given property are given in the amenities column. The amenities are listed in a string separated by commas. We find the number of amenities by simply splitting the string using the comma separator and taking the length of the resultant list.

# Import your libraries
import pandas as pd
 
# get the number of number of amenities
airbnb_search_details['num_amenities'] = airbnb_search_details
['amenities'].apply(lambda x : len(x.split(",")))
airbnb_search_details[['city', 'id', 'amenities', 'num_amenities']]

The dataset now looks like this.

output for Python Pandas interview questions

We can now simply sum up the number of amenities for each city and output the city with the maximum number of amenities. We can combine all these methods in a single line using method chaining.

# Summarize by city and output the city with the most amenities
airbnb_search_details.groupby(by = ['city'], as_index = False).agg
({'num_amenities' : 'sum'}).sort_values(by = ['num_amenities'], 
ascending = False).reset_index()['city'][0]

We are not restricted to built-in functions. We can create our own functions and create lambda functions on the fly. Let us try that in the following Python Pandas interview question. This one appeared in a City of San Francisco Data Science Interview.

Classify Business Type

Classify each business as a restaurant, cafe, school, or other. A business classified as a restaurant should have the word 'restaurant' in the business name. For cafes, either 'cafe', 'café', or 'coffee' should be in the business name. Schools will have the word 'School'. If a business is not a restaurant, café or a school, as per the above rules, it should be classified as ‘other’

Python Pandas interview questions to Classify Business Type

You can solve this https://platform.stratascratch.com/coding/9726-classify-business-type

This problem uses the sf_restaurant_health_violations dataset that has the following fields.

business_idint64
business_nameobject
business_addressobject
business_cityobject
business_stateobject
business_postal_codefloat64
business_latitudefloat64
business_longitudefloat64
business_locationobject
business_phone_numberfloat64
inspection_idobject
inspection_datedatetime64[ns]
inspection_scorefloat64
inspection_typeobject
violation_idobject
violation_descriptionobject
risk_categoryobject

The data looks like this.

Dataset for Python Pandas interview questions to Classify Business Type

Approach and Solution

The only relevant field for this Python Pandas interview question is the business_name column. We write a lambda function and use the apply method to check which of the business rules a business satisfies. Once we have the category, we output the relevant fields as required in the problem.

# Import your libraries
import pandas as pd
 
# Classify as per the rules
sf_restaurant_health_violations['category'] = sf_restaurant_health_violations
['business_name'].apply(lambda x: \
'school' if x.lower().find('school') >= 0 \
else 'restaurant' if x.lower().find('restaurant') >= 0 \
else 'cafe' if (x.lower().find('cafe') >= 0 or x.lower().find('café') >= 0 \
or  x.lower().find('coffee') >= 0) \
else 'other'
    )
# Output relevant fields    
sf_restaurant_health_violations[['business_name', 'category']].drop_duplicates()

As we can see, the apply method is a very powerful way of manipulating values and creating calculated fields using functions either user defined or from a Python library.

Advanced Aggregations

We had seen aggregations using the groupby method in the last article. Pandas supports other aggregations as well and we can also create a spreadsheet style pivot table. Let us see how to do that with Pandas. We start one from a City of San Francisco Data Science Problem.

Make a pivot table to find the highest payment in each year for each employee

Find the highest payment for each employee across each of the years 2011 through to 2014. Output the result as a table with the year in columns and employee names along the rows sorted in alphabetical order.

Python Pandas interview questions to make a pivot table

You can try to solve this https://platform.stratascratch.com/coding/10145-make-a-pivot-table-to-find-the-highest-payment-in-each-year-for-each-employee

The problem uses the sf_public_salaries dataset with the following fields.

idint64
employeenameobject
jobtitleobject
basepayfloat64
overtimepayfloat64
otherpayfloat64
benefitsfloat64
totalpayfloat64
totalpaybenefitsfloat64
yearint64
notesdatetime64[ns]
agencyobject
statusobject

The data is presented thus.

Dataset 4 for Python Pandas interview questions

Approach and Solution

While this problem might be a bit difficult with SQL, with Pandas, this problem can be solved in a single line of code using the pivot_table() function. We simply pass the right arguments and get the desired output.

# Import your libraries
import pandas as pd
 
# create the pivot table 
pd.pivot_table(data = sf_public_salaries, columns = ['year'], 
index = 'employeename', values = 'totalpay', aggfunc = 
'max', fill_value = 0).reset_index()

The pivot_table() method is quite powerful and can help solve complex aggregations very quickly. Here is another example. This one is from a Facebook Data Science Interview.

Time Between Two Events

Report the user with the least amount of time between a page load and their first scroll down. The output should include the user id, page load time, first scroll down time, and time between the two events in seconds.

Python Pandas interview questions for Time Between Two Events

You can solve this https://platform.stratascratch.com/coding/9784-time-between-two-events

This problem uses the facebook_web_log dataset that has the following columns.

user_idint64
timestampdatetime64[ns]
actionobject

The data is presented thus.

Dataset for Python Pandas interview questions for Time Between Two Events

Approach and Solution

Again, this interview question could be a little complex in SQL. However, in Pandas, this is relatively straightforward with the use of a pivot table. We start off by creating a pivot table with the earliest instance of each action for each user_id. Since we need only the page_load and scroll_down events, we keep only those columns in the output.

import pandas as pd
 
# Find the first instance of diffrent actions
summ_df = pd.pivot_table(data = facebook_web_log, index = 'user_id', columns = 
'action', aggfunc = 'min', values = 'timestamp').reset_index()[['user_id', 
'page_load', 'scroll_down']]

The output looks like this.

Output of Python Pandas interview questions for Time Between Two Events

Now the problem becomes very simple. We can directly calculate the duration by taking the difference between the scroll_down and page_load timestamps. We then output the user_id and other required fields for the user with the lowest duration.

# Caclulate duration
summ_df['duration'] = summ_df['scroll_down'] - summ_df['page_load']
# Output the user details for the user with the lowest duration
summ_df.sort_values(by = ['duration'])[:1]

As you can see, the pivot_table() function allows us to do multiple aggregations without having to separate them out and merge them and we might do in SQL.

Datetime Manipulation

Datetime manipulation is one of the most frequently asked Data Science Interview questions. The ubiquity of datetime datasets and a range of complexity that can be achieved with simple looking data makes it a popular Data Science test area. Pandas has numerous datetime functions covering a wide range of datetime use cases. Once you have loaded data into Pandas in datetime format, one can extract the relevant datetime functions from a datetime object in Pandas by invoking the .dt accessor. This provides us access to a wide variety of datetime methods that can be accessed over the entire Pandas Series.

Let us use it in a real world Python Pandas interview question. This one is from a DoorDash Data Science Interview.

Avg Earnings per Weekday and Hour

Report the average number of earnings per hour for each day of the week. Use the customer_placed_order_datetime field as for calculating the relevant date time values. The Earnings can be taken as the sum of the 'Order total' field.
Consider Monday as the first day of the week.

Python Pandas interview questions for Avg Earnings

You can solve this https://platform.stratascratch.com/coding/2034-avg-earnings-per-weekday-and-hour

The problem uses the doordash_delivery dataset that has the following fields.

customer_placed_order_datetimedatetime64[ns]
placed_order_with_restaurant_datetimedatetime64[ns]
driver_at_restaurant_datetimedatetime64[ns]
delivered_to_consumer_datetimedatetime64[ns]
driver_idint64
restaurant_idint64
consumer_idint64
is_newbool
delivery_regionobject
is_asapbool
order_totalfloat64
discount_amountfloat64
tip_amountfloat64
refunded_amountfloat64

The dataset looks like this.

Dataset of Python Pandas interview questions for Avg Earnings

Approach and Solution

To solve this Python Pandas interview question, we need to extract the day of the week and the hour from the relevant datetime field. As mentioned in the problem, this field is 'customer_placed_order_datetime'.

To get the day of the week, we use the .dt.weekday attribute. As per the documentation, this returns 0 for Monday and 6 for Sunday. Since we need to start with 1 for Monday, we add 1 to the result.

# Import your libraries
import pandas as pd
 
# Keep relevant fields
dd_df = doordash_delivery[['customer_placed_order_datetime', 'order_total']]
# Get the day of the week (add 1 to keep have Monday = 1)
dd_df['weekday'] = dd_df['customer_placed_order_datetime'].dt.weekday + 1

Similarly, we can also extract the hour of the day. To do this we use the .dt.hour attribute of the datetime object.

# Hour of the day
dd_df['hour'] = dd_df['customer_placed_order_datetime'].dt.hour

We can now simply aggregate by day of the week and hour of the day and output the relevant columns.

dd_df.groupby(by = ['weekday', 'hour'], as_index = False).agg
({'order_total': 'mean'})

Let us try a slightly more difficult Python Pandas interview question. This one is from a Salesforce Data Science Interview and uses both datetime manipulation as well as pivot tables.

User Growth Rate

Calculate the growth rate of active users for Jan 2021 over Dec 2020 for each account.

Python Pandas interview questions for User Growth Rate

You can solve this https://platform.stratascratch.com/coding/2052-user-growth-rate

The problem uses the sf_events dataset that has the following columns.

datedatetime64[ns]
account_idobject
user_idobject

The data looks like this.

Dataset for Python Pandas interview questions for User Growth Rate

Approach and Solution

We need to aggregate number of users across two different time periods for each account. This is a perfect case for a pivot table. But before we do that, we start off by extracting the year and month from the date as we have to aggregate across months. To do this we use the .strftime() method. This extends the strftime() method available in the Python datetime library. This is very similar to the TO_CHAR() function in SQL that we had discussed in SQL Datetime article.

# Import your libraries
import pandas as pd
 
# Create the Year - Month indicator
sf_events['month'] = sf_events['date'].dt.strftime('%Y-%m')

We can now aggregate using the pivot table for the months of Dec 2020 and Jan 2021, calculate the growth rate and output the relevant columns.

# Aggregate relevant months
summ_df = sf_events[sf_events['month'].isin(['2020-12', '2021-01'])].
pivot_table(
    index = 'account_id', columns = 'month', values = 'user_id', aggfunc = 
    'nunique').reset_index()
# Calculate growth rate and output relevant columns
summ_df['growth_rate'] = summ_df['2021-01'] / summ_df['2020-12']
summ_df[['account_id', 'growth_rate']]

Text Manipulation

As with datetime functions, Pandas provides a range of string functions. Like the .dt accessor for datetime functions, we can use .str accessor to use the standard string functions across the entire Series. There are some additional functions beyond the standard string library that can come in handy. Let us look at a few examples of Python Pandas interview questions. The first one is from a City of Los Angeles Data Science Interview

'BAKERY' Owned Facilities

Python Pandas interview questions for BAKERY Facilities

You can solve this https://platform.stratascratch.com/coding/9697-bakery-owned-facilities

The problem uses the los_angeles_restaurant_health_inspections dataset with the following fields.

serial_numberobject
activity_datedatetime64[ns]
facility_nameobject
scoreint64
gradeobject
service_codeint64
service_descriptionobject
employee_idobject
facility_addressobject
facility_cityobject
facility_idobject
facility_stateobject
facility_zipobject
owner_idobject
owner_nameobject
pe_descriptionobject
program_element_peint64
program_nameobject
program_statusobject
record_idobject

The data is presented as below.

Dataset of Python Pandas interview questions for BAKERY Facilities

Approach and Solution

While there are a lot of columns in the dataset, the relevant ones are owner_name and pe_description. We start off by keeping only the relevant columns in the dataset and dropping duplicates (if any).

We then proceed to search for the text BAKERY in the owner_name field and LOW RISK in the pe_description field. To do this, we use the str.lower() method to convert all the values to lowercase and the .str.find() method to find the instances of the relevant text. The .str.find() is an extension of the Python built in method find() for string type variables.

We then output the rows that satisfy the two criteria using a Boolean Mask.

# Import your libraries
import pandas as pd
 
# Keep relevant fields
rel_df = los_angeles_restaurant_health_inspections[['owner_name', 'pe_description']].drop_duplicates()
# Find the relevant text in the two fields. 
rel_df[(rel_df['owner_name'].str.lower().str.find('bakery') >= 0) &( rel_df['pe_description'].str.lower().str.find('low risk') >=0)]

Besides the usual string methods, the .str accessor also has some additional methods. One such method is explode().

As the name suggests, the method splits the Series or a particular column in a DataFrame. The other values (in case of a DataFrame) and the indexes are repeated. Let us see how this is used in practice. We use this in the AirBnB Data Science Interview Question that we had solved earlier in the article.

City With Most Amenities

Find the city with the most amenities across all their host's properties from the given dataset of searches for properties on Airbnb. Assume that each row represents a unique host. Output the name of the city as your solution.

Link to the question https://platform.stratascratch.com/coding/9633-city-with-most-amenities

Python Pandas interview questions for City with Amenities

The relevant fields in the airbnb_search_details dataset are amenities and city

Dataset of Python Pandas interview questions for City with Amenities

Approach and Solution

We start off by keeping on the relevant fields in the dataset (this is not needed in the actual solution. We are doing it to make the solution simpler to understand).

# Import your libraries
import pandas as pd
 
# Keep Relevant fields
rel_df = airbnb_search_details[['amenities', 'city']]

We split the amenities string into a list by calling the split method.

# Split the amenities string
rel_df['amenities'] = rel_df['amenities'].str.split(",")

Now we invoke the explode() method on the amenities column.

rel_df = rel_df.explode('amenities')
Output of Python Pandas interview questions for City with Amenities

As one can see, the explode method creates a separate row for each object inside an iterable like list, set, tuple, etc. Now we can aggregate the amenities on the city and output the result as earlier.

# Summaroze by city
rel_df.groupby(by = ['city'], as_index = False).agg({'amenities' : 'count'}).
sort_values(by = ['amenities'], ascending = False).reset_index()['city'][0]

The explode is a very powerful function and comes in very handy for text manipulation based questions.

Statistics using Pandas

Given the ability to work with tabular data, Pandas is a natural choice for statistical manipulation as well. While NumPy is considered the go-to library for statistical manipulation, since Pandas is built-on NumPy, it inherits quite a few statistical measures that can be invoked to calculate these measures easily. Let us look at a few examples. The first one is from a City of Los Angeles Data Science Interview question.

Find the variance and the standard deviation of scores that have grade A

Python Pandas interview questions to Find the variance

You can try to solve this https://platform.stratascratch.com/coding/9708-find-the-variance-and-the-standard-deviation-of-scores-that-have-grade-a

The problem uses the los_angeles_restaurant_health_inspections dataset that we saw previously. The dataset has the following fields.

serial_numberobject
activity_datedatetime64[ns]
facility_nameobject
scoreint64
gradeobject
service_codeint64
service_descriptionobject
employee_idobject
facility_addressobject
facility_cityobject
facility_idobject
facility_stateobject
facility_zipobject
owner_idobject
owner_nameobject
pe_descriptionobject
program_element_peint64
program_nameobject
program_statusobject
record_idobject

The data is presented thus.

Dataset of Python Pandas interview questions for BAKERY Facilities

Approach and Solution

Let us solve this by doing the calculations ourselves. We will then solve this by invoking the built in Pandas methods. We start off by subsetting the relevant fields and keeping only those scores that correspond to grade A.

# Import your libraries
import pandas as pd
 
# Subset relevant rows
la_df = los_angeles_restaurant_health_inspections[los_angeles_restaurant_health_
inspections['grade'] == 'A'][['grade', 'score']]
la_df
Output of Python Pandas interview questions to Find the variance

Variance is the average squared deviation from the mean. Population variance is defined as

Pandas interview questions

Standard Deviation is the square root of the variance

We can calculate variance by calculating the average of the squared difference from the mean of the ranks. We can accomplish this in a single step by using vectorized operations. From the variance we can calculate the standard deviation easily. And finally output the results in the desired format.

variance = ((la_df['score'] - la_df['score'].mean())**2).mean()
stdev = variance ** 0.5
output_df = pd.DataFrame({'variance' : [variance], 'stdev' : [stdev]})

Instead of calculating the variance from the formula, we can also invoke the built-in Pandas statistical methods. Since we are calculating population variance and standard deviation, we need to specify to Pandas to not use Bessel's correction. We can do this by setting the ddof argument to 0 in both variance and standard deviation calculations.

variance = la_df['score'].var(ddof = 0)
stdev = la_df['score'].std(ddof = 0)
output_df = pd.DataFrame({'variance' : [variance], 'stdev' : [stdev]})

Let's try a slightly trickier one. This is from a Google Data Science Interview.

Correlation Between Emails And Activity Time

Find the correlation between the number of emails received by a user and the total exercise per day. The total exercise per day is the number of user sessions per day.

Python Pandas interview questions for Correlation

You can try to solve this https://platform.stratascratch.com/coding/10069-correlation-between-e-mails-and-activity-time

This problem uses two datasets

  • google_gmail_emails which has the following fields.
idint64
from_userobject
to_userobject
dayint64

The data is presented thus.

dataset for pandas interview question
  • The second dataset google_fit_location contains the following fields
user_idobject
session_idint64
step_idint64
dayint64
latitudefloat64
longitudefloat64
altitudefloat64

The data looks like this.

Pandas interview question dataset

Approach and Solution

We start off by calculating the number of emails sent to each user per day. We do this by aggregating the number of emails by user_id and day

import pandas as pd
import numpy as np


mail_base = google_gmail_emails.groupby(
['to_user', 'day']).size().to_frame('n_emails').reset_index()

We get the following dataset.

Output for Python Pandas interview questions for Correlation

We do the same for other dataset, counting the number of user sessions per day. Note: we need to count each user session only once.

location = google_fit_location.groupby(
['user_id', 'day'])['session_id'].nunique().to_frame('total_exercise').reset_index()

Giving us

Output 6 for Python Pandas interview questions

We now merge the two datasets on user and day

merged = pd.merge(mail_base, location, left_on=['to_user', 'day'], right_on=['user_id', 'day'], how='outer').fillna(0)

We can now calculate the correlation using the corr() function. The correlation output will provide the correlation across the two variables. For example a correlation output for two variables x and y will contain something akin to this.

pandas data science interview question

We can need the values across the diagonal (highlighted in green). Therefore, we subset the relevant fields.

result = merged['n_emails'].corr(merged['total_exercise'])

Offsetting Data

Another common business case especially with time-series data is finding out their previous or next values. Pandas has capabilities to support these SQL style lag and lead operations. Let's use these methods in practice with a problem from an Uber Data Science Interview.

Year Over Year Churn

Calculated the churn of the drivers for each year and report if the number has increased or decreased when compared to the previous year.

Python Pandas interview questions for Year Churn

You can solve the full https://platform.stratascratch.com/coding/10017-year-over-year-churn

The problem uses the lyft_drivers dataset that has the following fields.

indexint64
start_datedatetime64[ns]
end_datedatetime64[ns]
yearly_salaryint64

The dataset looks like this.

Dataset for Python Pandas interview questions for Year Churn

Approach and Solution

We start off by calculating the churn for each year. To do this, we first calculate the year of the churn from the end_date field and then count the churn for each year.

# Import your libraries
import pandas as pd
 
# Get the year from exit date
lyft_drivers['year'] = lyft_drivers['end_date'].dt.year
 
# Get the number of the drivers churned for each year
summ_df = lyft_drivers.groupby(by = ['year'], as_index = False).agg
({'index' : 'count'}).sort_values(by = ['year']).rename
(columns = {'index' : 'churn'}).dropna()

This gives us the following summary data.

Output for Python Pandas interview questions for Year Churn

To check if the number has increased or decreased compared to the previous year. For this, we need to offset or shift the values down for the churn numbers. We can accomplish this by using the shift() method. As the name suggests, the shift method offsets the data by n rows. One can also pass the negative number in order to shift the values up.

# Fetch the prev year's churn numbers
summ_df['prev_churn'] = summ_df['churn'].shift(1).fillna(0)
summ_df
Output 2 for Python Pandas interview questions for Year Churn

We can compare the value with the previous item and determine if the number has increased or decreased.

# Compare the two churn numbers and output the change
summ_df['change'] = summ_df.apply(lambda x: 'increase' if x['churn'] > x['prev_churn'] else('no change' if x['churn'] == x['prev_churn'] else 'decrease'), axis=1)
summ_df

Bonus Python Pandas Interview Questions

Bonus Python Pandas Interview Questions

We finish off by solving a couple of problems that combine all the things that we have learnt. The first one uses window functions. This one Is from an Amazon Data Science Interview.

Revenue Over Time

Find the three-month moving average for each month.

Pandas interview questions for Revenue Over Time

You can try to solve this https://platform.stratascratch.com/coding/10314-revenue-over-time

The problem uses the amazon_purchases dataset that has the following fields.

user_idint64
created_atdatetime64[ns]
purchase_amtint64

The data looks like this.

Dataset for Pandas interview questions for Revenue Over Time

Approach and Solution

We start off by subsetting the data with relevant transactions. We remove refund transactions as described in the problem. We find the month indicator as required by invoking the strftime method on the created_date field. We further proceed to aggregate the purchase transactions for each month.

# Import your libraries
import pandas as pd
 
# Remove refund transactions
pos_df = amazon_purchases[amazon_purchases['purchase_amt'] > 0]
# Create Month indicator
pos_df['month'] = pos_df['created_at'].dt.strftime("%Y-%m")
# Aggregate the purchases by month
summ_df = pos_df.groupby(by = ['month'], as_index = False).sum()
[['month', 'purchase_amt']].sort_values(by = ['month'])
Output for Pandas interview questions for Revenue Over Time

To find the moving average, we use the rolling function. The rolling function creates a moving window of n rows. We can change the arguments to get our desired output. Since we do not want the first two observations to return null values, we set the min_periods argument to 1. Finally we invoke the mean() method to calculate average over the three months and return the relevant fields.

# Calculate the rolling average, ensure that the value is calculated 
  even for the first two months
summ_df['roll_avg'] = summ_df['purchase_amt'].rolling
(3, min_periods = 1).mean()
# Output relevant fields
summ_df[['month', 'roll_avg']]

This next one is from an Amazon Data Science Interview and uses the apply method in an innovative way.

Player with the Longest Streak

A streak is a set of consecutive matches won by a particular player. The streak ends when a player loses their next match. Output the ID(s) of the player(s) with the longest streak and the length of the streak(s).

Pandas interview question for Longest Streak

You can solve this Python Pandas https://platform.stratascratch.com/coding/2059-player-with-longest-streak

This Pandas interview question uses the players_results dataset with the following fields.

player_idint64
match_datedatetime64[ns]
match_resultobject

The data is presented thus.

Dataset for Pandas interview questions for Longest Streak

Approach and Solution

In order to solve this problem, we need to get the results of a player. Suppose the sequence of results of a player are

WWLWLWWWLWWWWWWWLLLWLW

We can simply split the string by using the letter ‘L’ as the separator. This will split the string into lists. Something like this.

[WW] [W] [WWW] [WWWWWWW] [] [] [] [W] [W].

Finally, we find the length of the longest list and we will be able to determine the streak of the player. To do this in Pandas, we need to concatenate the results. To do this, we simply apply the sum() method.

The sum() method should give a total of values when numeric data is passed. But when string type data is passed, it will perform the addition operation. The addition operator in Python will concatenate the strings, which is exactly what we need!!.

We start off by concatenating the results.

# Import your libraries
import pandas as pd
 
# Create the sequence of results
streak_df = players_results.groupby(by = ['player_id'], as_index = False).agg
({'match_result': 'sum'})
Output for Pandas interview questions for Longest Streak

We then proceed to split the match_result string using the letter L as the separator. We also explode the results so that we get each list as a separate row.

# Split the sequence using 'L' as the separator and explode
streak_df['streak'] = streak_df['match_result'].str.split('L')
streak_df = streak_df.explode(column = 'streak')
Output 2 for Pandas interview questions for Longest Streak

Now the rest of the problem is straightforward. We simply need to aggregate the maximum length of the streak string and output the relevant fields.

# Find the length of the streak
streak_df['streak_len'] = streak_df['streak'].apply(len)
# Aggregate
streaks_df = streak_df.groupby(by = ['player_id'], as_index = False).agg
({'streak_len' : 'max'})
# Output relevant fields
streaks_df['rank'] = streaks_df['streak_len'].rank(method = 
'dense', ascending = False)
streaks_df[streaks_df['rank'] == 1].drop(columns = ['rank'])

Conclusion

In this series of articles, we looked at how to use Pandas and how to solve Python Pandas interview questions. Proficiency in Pandas is expected if one is serious about a career in Data Science with Python as the major tool. Using Pandas is as easy as using spreadsheet software like MS-Excel, Google Sheets, Numbers or LibreOffice Calc. All one needs to gain proficiency in Pandas is a bit of time and curiosity. We have over 700 coding and non-coding problems on the StrataScratch platform that are relevant to Data Science Interviews. These problems have been sourced from the actual Data Science interviews at top companies like Uber, Netflix, Noom, Microsoft, Facebook, et al. Check out our recent post on Python Interview Questions and Answers. On StrataScratch, you will have an opportunity to join a group of 20,000+ like minded data science aspirants and get a collaborative learning experience. Sign up on StrataScratch today and prepare for the most in-demand jobs at big tech firms and start-ups across the world.


Concluding part of Python Pandas Interview Questions


Become a data expert. Subscribe to our newsletter.