Get data science interview questions delivered to your inbox
Please enter a valid email address.

Top 30 Python Interview Questions and Answers

A comprehensive overview of the types of Python interview questions asked in Data Science Interviews at top companies like Amazon, Google, Microsoft, etc.

Top 30 Python Interview Questions and Answers

Python has been consistently rated as the top programming language. Besides being widely used in a range of applications from web and mobile development, API management, process automation, etc, Python rules the Data Science world. Python is free, open-source and is backed by a vast collection of well-documented and continuously updated libraries. Unlike other languages with finicky syntaxes, Python is very easy to pick up and even people without CS backgrounds can learn it easily.

Technical concepts used in Python Interview Questions

The most common test areas in Python Data Science Interviews include

  • Using the Python standard library.
    • Specifically built-in data types and data structures like strings, Boolean, list, dictionaries, etc.
    • Manipulating datetime objects.
  • Working with Tabular data
    • Specifically using Pandas and NumPy libraries
    • Knowledge of Pandas functions including
      • Indexing and Slicing
      • Aggregation and Rollups
      • Vectorized operations

Python Interview Questions

For this article, we will focus specifically on Python interview questions for the Data Scientist position. The problems involving Python for Data Scientist roles can broadly be grouped into the following categories.

Python Interview Questions Categories
  • Math, Stats and Probability questions.
  • Data Cleaning and Wrangling
  • Datetime Operations
  • Business Metrics and KPIs
  • Text Manipulation
  • Product Specific Problems

Let us look at each of these types of problems in detail.

Math and Probability Questions for Python Data Science Interviews

These python interview questions require the use of mathematical functions and (or) formulae. Some of the problems can be solved using high math, however, some might need a greater knowledge of probability and combinatorics. Let us look at a couple of them. The first one is from Credit Karma and requires the use of simple high school math.

1. Share of Loan Balance

Group each submission’s total balance based on rate_type and report the rate_type, loan_id and balance of each loan type. Also calculate the percentage of the submission’s total balance each loan constitutes.

Python Interview Questions for Share of Loan Balance

You can solve this problem here.

https://platform.stratascratch.com/coding/2001-share-of-loan-balance

This is the data that we will be working with.

submissions

idint64
balancefloat64
interest_ratefloat64
rate_typeobject
loan_idint64

And here is what the data looks like.

Python Interview Questions Data for Share of Loan Balance

Approach

This is a relatively straightforward set.

  • We first need to get the totals across all rate types (fixed or variable)
  • Then we proceed to merge this information to the full dataset by rate_type
  • Now the loan share can be calculated relatively easily.

Solution

# Import your libraries
import pandas as pd
 
# Get sum loan balances by loan type
loan_summ = submissions[['rate_type', 'balance']].groupby(['rate_type'], 
as_index = False).sum().rename(columns = ({'balance': 'total_balance'}))
 
# Merge with the full dataset
output_df = pd.merge(submissions, loan_summ, on = 'rate_type', how = 'left')
 
# Calculate the share
output_df['loan_share'] = output_df['balance'] / output_df['total_balance']
 
# Keep relevant columns
output_df = output_df[['loan_id', 'rate_type', 'balance', 'loan_share']]
 
# Submit
output_df

The next one is similar. This involves understanding of probability but is not too difficult. This one is from a Lyft Data Science Interview.

2. Find the probability of ordering a ride based on the weather and the hour

Find the probability of ordering a ride based on the weather and the hour and display the weather, hour and the corresponding probability sorted by the weather and the hour in ascending order.

Python Interview Questions to Find the probability

You can solve this python interview question here.

https://platform.stratascratch.com/coding/10019-find-the-probability-of-ordering-a-ride-based-on-the-weather-and-the-hour

This is the dataset that we will be working with.

lyft_rides

indexint64
weatherobject
hourint64
travel_distancefloat64
gasoline_costfloat64

This is what the data looks like.

Python Interview Questions Data to Find the probability

Approach

  • Here we are required to find out the probability of the rides for each combination of weather and hour.
  • Therefore, we start off by aggregating the data by weather and hour. Since we need to count the number of events, we take up the gasoline_cost variable and count the values. This gives the number of occurrences by each required combination.
  • We then find the probability by dividing these counts by the total number of occurrences (which is the number of rows in the dataset).
  • Finally, we sort the results in the required order and output only the relevant columns.

Solution

# Import your libraries
import pandas as pd
 
# Find the number of rides by weather and hour
out_df = lyft_rides[['weather', 'hour', 'gasoline_cost']].groupby(by = 
['weather', 'hour'], as_index = False).count().rename(columns = 
{'gasoline_cost': 'count'})
 
# Calculate probability
out_df['probability'] = out_df['count'] / len(lyft_rides)
 
# Output the relevant columns and sort
out_df = out_df[['weather', 'hour', 'probability']].sort_values(by = 
['weather', 'hour'])
 
# Submit
out_df

You can refer to this article for more such questions on using Python for solving Math, Probability and Stats questions for Data Science Interviews.

Statistics Questions for Python Data Science Interviews

Statistics based Data Science Interview questions may not always involve coding. You might be asked problems that involve use of statistics and machine learning algorithms.

In this article, we will concentrate on problems that involve coding in Python. Most of these python interview questions will test your understanding of statistical measures like mean, median, mode, percentiles, variance, standard deviation, et al. Let us look at a question that appeared in Python Data Science Interviews. This involves calculating percentiles and median.

3. Find students with a median writing score

Python Interview Questions to Find students

The question is available here

https://platform.stratascratch.com/coding/9610-find-students-with-a-median-writing-score

This is the dataset that we are working with

sat_scores

schoolobject
teacherobject
student_idfloat64
sat_writingfloat64
sat_verbalfloat64
sat_mathfloat64
hrs_studiedfloat64
idint64
average_satfloat64
lovedatetime64[ns]

And this is the data in the dataset.

Python Interview Questions Data to Find students

Approach

Median is the middle value when the data is sorted in ascending or descending order. Median is also the 50th percentile. To solve this python interview question, we can call the rank method on the writing score column and then get the 50th percentile score. Then subset the data to get only those values that have a percentile rank of 50.

Solution

# Import your libraries
import pandas as pd
import numpy as np
 
# calculate percentiles with built in function
sat_scores['percentile'] = (sat_scores['sat_writing'].rank(pct = True) * 
100).apply(np.floor)
 
# output median (50th percentile) values
sat_scores[sat_scores['percentile'] == 50]

Data Cleaning and Wrangling: Working with duplicates

Data Science is not only about Math and Statistics. One also needs to have a good handle on how to manipulate data to find the right output. A very common problem in Data Science Interviews is dealing with duplicates and finding out unique values. It is very important to know if the values in a particular column are unique, especially when you are merging additional datasets. Let us take a couple of examples that involve these concepts.

The first one is from a Virgin Group Data Science Interview

4. Count the unique origin airports

Find how many different origin airports exist?

Python Interview Questions to Count the airports

The python interview question can be solved on the Stratascratch platform here.

https://platform.stratascratch.com/coding/9672-count-the-unique-origin-airports

These are fields in the dataset.

us_flights

flight_datedatetime64[ns]
unique_carrierobject
flight_numint64
originobject
destobject
arr_delayfloat64
cancelledint64
distancefloat64
carier_delayfloat64
weather_delayfloat64
late_aircraft_delayfloat64
nas_delayfloat64
security_delayfloat64
actual_elapsed_timefloat64

And this is how the data set looks like.

Python Interview Questions Data to Count the airports

Approach

There are quite a few fields in this dataset, but the one that is of interest to us is the origin. We need to count the distinct number of airports. We can simply do this by passing the values to a set. A set is a standard Python data type that contains only distinct values. If you know about sets, we can solve this python interview question in a single line of code!!

Solution

# Import your libraries
import pandas as pd
 
# Convert the column values to a set and find its length
len(set(us_flights['origin']))

Let us try another one. This one is from a SalesForce Data Science Interview.

5. Duplicate Emails

Find all emails with duplicates.

Python Interview Questions for Duplicate Emails

You can solve this python interview question here.

https://platform.stratascratch.com/coding/9895-duplicate-emails

The fields in the data set are:

employee

idint64
first_nameobject
last_nameobject
ageint64
sexobject
employee_titleobject
departmentobject
salaryint64
targetint64
bonusint64
emailobject
cityobject
addressobject
manager_idint64

This is what the data looks like.

Python Interview Questions Data for Duplicate Emails

Approach

  • The dataset again has a lot of fields, but the one we are interested in is email.
  • To solve this problem, we can use the groupby method in Pandas and calculate the count.
  • We then subset the resulting dataframe selecting only those emails with a count greater than 1. In other words, we select only those email ids that appear more than once.

Solution

# Import your libraries
import pandas as pd
 
# Find the count of all email ids
count_df = employee[['email', 'id']].groupby(by = 'email', 
as_index = False).count()
# Subset only repeated values.
count_df[count_df['id'] > 1]['email']

Data Cleaning and Wrangling: Datetime manipulation

Another popular genre of python interview questions is one that involves manipulating date and time values. These questions are a perennial favorite of interviewers as you can create a range of questions with very simple looking datasets. We have looked at manipulating date time values with SQL in this SQL scenario based interview questions article. Let us see how to do something similar in Python.

The first one is from an Airbnb Data Science Interview

6. Find the average difference between booking and check-in dates

Find out the average number of days between the booking and check-in dates for AirBnB hosts. Report the results based on the average number of days ordered in descending order.

Python Interview Questions to Find the average difference

You can solve this problem here. https://platform.stratascratch.com/coding/9614-find-the-average-difference-between-booking-and-check-in-dates

The dataset contains the following fields

airbnb_contacts

id_guestobject
id_hostobject
id_listingobject
ts_contact_atdatetime64[ns]
ts_reply_atdatetime64[ns]
ts_accepted_atdatetime64[ns]
ts_booking_atdatetime64[ns]
ds_checkindatetime64[ns]
ds_checkoutdatetime64[ns]
n_guestsint64
n_messagesint64

The dataset looks like this.

Python Interview Questions Dataset to Find the average difference

Approach

  • The relevant fields here are id_host to identify the host, ts_booking_at : the booking timestamp and the ds_checkin the check-in time.
  • We start off by converting the values timestamps to dates. To do this, we use the dt.date method for pandas timestamp.
  • We then take the difference between the check-in date and the booking date. The resulting value will be a timedelta object. This represents the difference between the two timestamps. We convert this timedelta value to days. This can be achieved by using the dt.days method in Pandas. The rest of the solution is pretty simple now.
  • We take the average of the timestamps, grouped by the host_id.
  • Then we sort the averages in descending order and remove any null values.
    • Note: There might be null values because the guest might have initiated a conversation but might not have proceeded to book a stay.

Solution

# Import your libraries
import pandas as pd
 
# Extract only the days from the timestamp, calculate the difference
and convert the difference to days. 
airbnb_contacts['date_diff'] = (airbnb_contacts['ds_checkin'].dt.date - 
airbnb_contacts['ts_booking_at'].dt.date).dt.days
# Calculate the average of the date_diff, sorting the result in descending 
order and dropping and null values. 
avg_df = airbnb_contacts[['id_host', 'date_diff']].groupby(by = ['id_host'], 
as_index = False).mean().sort_values(by = ['date_diff'], 
ascending = False).dropna()

Let us crank up the difficulty level for the next one. This one is from a SalesForce Data Science Interview.

7. Consecutive Days

Find all the users who were active for 3 consecutive days or more.

Python Interview Questions for Consecutive Days

You can solve this python interview question here.

https://platform.stratascratch.com/coding/2054-consecutive-days

The dataset contains the following fields

sf_events

datedatetime64[ns]
account_idobject
user_idobject

And it looks like this.

Python Interview Questions Data for Consecutive Days

Approach

  • The dataset is straightforward. However, the solution is a little complex as one needs to identify three or more consecutive days of login.
  • We begin by extracting only the date parts from the timestamps. To do this we use the dt.date method.
  • There might be multiple logins on the same day. Hence, we remove duplicate dates for the same user and sort the dataframe by user id and login dates.
  • We then find the next login date and the next-to-next login date. To ensure that we do this for the same user only, we use the groupby.shift() method. This is akin to a window function in SQL.
  • We finally subset only those values where the difference between the first and the second login and between the second and third login is one. To ensure that we do not double count values when a streak of more than 3 days is present, we drop any duplicates that might be present.

Solution

# Import your libraries
import pandas as pd
 
# Extract date part from the time stamps.
sf_events['date_stamp'] = sf_events['date'].dt.date
 
# Drop duplicate dates for the same user and sort login dates for each user
sf_events = sf_events.drop_duplicates(subset = ['user_id', 'date_stamp']).
sort_values(by = 
['user_id', 'date_stamp'])
 
# Extract the next login date for the same user
sf_events['d+1'] = sf_events[['user_id', 'date_stamp']].groupby(by = 
['user_id']).shift(-1)
 
# Extract the third login date from the same user
sf_events['d+2'] = sf_events[['user_id', 'date_stamp']].groupby(by = 
['user_id']).shift(-2)
 
# Subset only those users where the date difference is 1 between the first 
and the second login and that between the second and the third login
 
sf_events[((sf_events['d+1'] - sf_events['date_stamp']).dt.days == 1) & 
((sf_events['d+2'] - sf_events['d+1']).dt.days == 1)]
['user_id'].drop_duplicates()

Business Metrics and KPIs

Reporting KPIs and monitoring various model results is a key aspect of any Data Scientist’s day-to-day work. A lot of Python Data Science Interviews, especially at entry levels test the ability of a candidate to calculate metrics and develop business reports. These python interview questions test a combination of a candidate’s business understanding, a sense of how the data should be organized and coding skills. Here, we illustrate some of the commonly used metrics that are widely used by most businesses and how to create them.

Business Metrics and KPIs: Ranking

Positions or Ranks are one of the most reported numbers in most KPIs. There can be multiple use cases for ranks – top selling products, most cost-effective channels, best performing ads, et al. Let us look at a couple of problems from Python Data Science interviews that required the candidate to use ranking concepts.

The first one is from the City of San Francisco.

8. Find the top 5 highest paid and top 5 least paid employees in 2012

Find the top 5 highest paid and top 5 least paid employees in 2012. Output the employee’s name along with the corresponding total pay with benefits. Sort records based on the total payment with benefits in ascending order.

Python Interview Questions to Find the top

You can solve this problem here

https://platform.stratascratch.com/coding/9979-find-the-top-5-highest-paid-and-top-5-least-paid-employees-in-2012

The dataset has the following fields.

sf_public_salaries

idint64
employeenameobject
jobtitleobject
basepayfloat64
overtimepayfloat64
otherpayfloat64
benefitsfloat64
totalpayfloat64
totalpaybenefitsfloat64
yearint64
notesdatetime64[ns]
agencyobject
statusobject

And this is how the data looks like.

Python Interview Questions Data to Find the top

Approach

  • This is a relatively straightforward problem.
  • We start off by subsetting the data for 2012.
  • We then sort it in ascending order by the column totalpaybenefits.
  • Finally, we take the first five and the last five rows and out the relevant fields.

Solution

# Import your libraries
import pandas as pd
 
# take the 2012 data and sort
yr_2012 = sf_public_salaries[sf_public_salaries['year'] == 2012].sort_values
(by = 
['totalpaybenefits'])
 
# take the first 5 and the bottom 5 rows. 
yr_2012[:5].append(yr_2012[-5:], ignore_index = True)[
    ['employeename', 'totalpaybenefits']]

Let us try a more difficult one. This is from an Airbnb Data Science Interview.

9. Ranking Hosts By Beds

Rank the AirBnB hosts based on the number of beds they have listed. The host with the most beds should be ranked highest and one with the least number of beds should be ranked last. Hosts with the same number of beds should have the same rank. A host may own multiple properties. Report the host ID, number of beds, and rank from highest rank to lowest.

Python Interview Questions for Ranking Hosts

You can solve the problem here:

https://platform.stratascratch.com/coding/10161-ranking-hosts-by-beds

The dataset has the following fields.

airbnb_apartments

host_idint64
apartment_idobject
apartment_typeobject
n_bedsint64
n_bedroomsint64
countryobject
cityobject

Here is a sample of the data present in the dataset.

Python Interview Questions Data for Ranking Hosts

Approach

  • As mentioned in the problem, hosts may have multiple properties listed on the platform. Therefore, before we start ranking, we need to sum up the total number of beds that a host has.
  • We then rank these hosts based on the total number beds listed.
  • Finally, we sort the hosts by their ranks and output the results.

Solution

# Import your libraries
import pandas as pd
 
# roll up the number of beds for a host
hosts_summ = airbnb_apartments.groupby(by = ['host_id'], 
as_index = False).sum()[['host_id','n_beds']]
 
# Rank the hosts by the number of beds, with the highest getting the best rank
hosts_summ['rank'] = hosts_summ['n_beds'].rank(method = 'dense', 
ascending = False)
 
# sort the output by rank in ascending order
hosts_summ.sort_values(by = ['rank'])

Business Metrics and KPIs: Aggregations

Aggregations are bread and butter for most Data Scientists. It is not possible to inspect each data point, therefore aggregations are used to examine trends, outliers, and possible relationships. For Data Analyst positions, these are even more important. You can look at the similarities and differences in the job responsibilities for various Machine Learning and Data Science Job Titles or Positions in this article. Let us look at some examples that use aggregation functions.

The first one is from Whole Foods Market Data Science Interview

10. Products Report Summary

Find the number of transactions and total sales for each of the product categories in 2017. Output the product categories, number of transactions, and total sales in descending order. The sales column represents the total cost the customer paid for the product so no additional calculations need to be done on the column. Only include product categories that have products sold.

Python Interview Questions for Products Report Summary

You can solve the full problem here.

https://platform.stratascratch.com/coding/2039-products-report-summary

The problem uses two datasets. The dataset wfm_transactions has the following fields

customer_idint64
store_idint64
transaction_datedatetime64[ns]
transaction_idint64
product_idint64
salesint64

This is a snapshot of the data that it contains.

Python Interview Questions Data for Products Report Summary

The second dataset is named wfm_products and contains the following fields.

product_idint64
product_descriptionobject
product_brandobject
product_categoryobject

This is how the data in wfm_products looks like.

Python Interview Questions Data 2 for Products Report Summary

Approach

  • We start off by merging the two datasets using the product_id key. Since we need only the products that have at least one transaction, we perform a left join.
  • Next, we need to subset only the transactions that were made in the year 2017. We invoke the year method on the transaction_dt timestamp object for this.
  • Finally, we aggregate the sales and the number of transactions by the product_category field, and report the results after sorting them in the descending order of the sales.

Solution

# Import your libraries
import pandas as pd
 
# Merge the datasets
merged_df = pd.merge(wfm_transactions, wfm_products, on = 'product_id', 
how = 'left')
 
# Find the year of transaction
merged_df['year'] = merged_df['transaction_date'].dt.year
 
# Keep only 2017 transactions
merged_df = merged_df[merged_df['year'] == 2017]
 
# Summarize, sort and submit.
merged_df.groupby(by = ['product_category'], as_index = False).agg(
    {'sales' : 'sum','transaction_id' :'count'}).sort_values(by = ['sales'], 
    ascending = False)

The second one is from a Noom Data Science Interview.

11. Signups By Billing Cycle

Write a query that returns a table containing the number of signups for each weekday and for each billing cycle frequency. Report the signups with the weekday number (e.g., 1, 2, 3) as rows in your table and the billing cycle frequency (e.g., annual, monthly, quarterly) as columns.

Python Interview Questions for Signups

You can solve the full problem here: https://platform.stratascratch.com/coding/2032-signups-by-billing-cycle

The python interview question uses two datasets. The signups dataset contains the following fields

signup_idint64
signup_start_datedatetime64[ns]
signup_stop_datedatetime64[ns]
plan_idint64
locationobject

The data in signups looks like this.

Python Interview Questions Data for Signups

The plans table contains these fields

idint64
billing_cycleobject
avg_revenuefloat64
currencyobject

And the data looks like this.

Python Interview Questions Plans Data for Signups

Approach

  • Since we need the billing cycle from the second table, let us merge the two tables. The join keys will be the signup_id field in the signups table and the id field in the plans table.
  • Then we need to create a field for identifying the day of the week. We can use the weekday method for a datetime object for this.
  • Finally we can generate the report on one line using the pivot table method. We use the weekday for rows, billing cycle for columns and count method for aggregation.

Solution

import pandas as pd
 
# merge the datasets
merged_df = pd.merge(signups, plan, left_on = 'plan_id', right_on = 'id', 
how = 'left')
 
# Get the weekday
merged_df['weekday'] = merged_df['signup_start_date'].dt.weekday
 
# Create the report
pd.pivot_table(data = merged_df, values = 'avg_revenue', 
columns = 'billing_cycle', 
aggfunc = 'count', index = 'weekday').reset_index()

Business Metrics and KPIs: Profit and Profitability

Another group of business metrics that are very widely used and requested are those involving profits and profitability. These metrics usually require you to drill down the overall numbers into subcategories like location, product, segment, et al. Familiarity with aggregation methods is critical to solving these problems in Python. These problems might also require you to calculate the metric in a very specific manner, so read the python interview questions very carefully when solving these problems. Here look at a problem that was asked in a Noom Data Science Interview.

12. Find The Most Profitable Location

Calculate the average signup duration and average revenue for each location. Compare these two measures by taking the ratio of the average revenue and average duration for each location. Report the location, average duration, average revenue, and ratio. Sort your results from highest ratio to lowest.

Python Interview Questions for Most Profitable Location

You can solve the problem here:

https://platform.stratascratch.com/coding/2033-find-the-most-profitable-location

The problem involves two datasets. signups and transactions.

signups

signup_idint64
signup_start_datedatetime64[ns]
signup_stop_datedatetime64[ns]
plan_idint64
locationobject

The data contained in it looks like this.

Signups Data of Python Interview Questions for Most Profitable Location

The transactions dataset contains the following fields

transaction_idint64
signup_idint64
transaction_start_datedatetime64[ns]
transaction_refunded_datedatetime64[ns]
amtfloat64

The data in this dataset looks like this.

Transactions Data of Python Interview Questions for Most Profitable Location

Approach

  • This problem looks straightforward but can lead to wrong results if one is not careful.
  • The problem lies in the way the data is arranged. There might be multiple transactions with the same signup id. If we merge the datasets together and then calculate the duration, it will lead to duplicated values of duration and therefore the wrong mean.
  • To overcome this problem, we calculate the two averages separately and then merge the dataset.
  • We first calculate the duration from the signups dataset and summarize it by taking the average value for a location.
  • To get the average transaction value for each location, we pull the location from the signups dataset into the transactions dataset. Then we calculate the average transaction amounts for each location.
  • Then, we merge the two summary datasets together and calculate the ratio as defined in the problem.
  • Finally, we report the numbers sorted in descending order of the calculated ratio.

Solution

# Import your libraries
import pandas as pd
 
# Calculate the duration of signups
signups['duration'] = (signups['signup_stop_date'] - signups
['signup_start_date']).dt.days
# Get the average duration
avg_dur_df = signups.groupby(by = ['location'], as_index = False).mean()
 
# Get the location_id from the signups dataset
merged_df = pd.merge(left = transactions, right = signups
[['signup_id', 'location']], 
on = 'signup_id', how = 'left')
# Summarize by location
trans_df = merged_df.groupby(by = ['location'], as_index = False).mean()
 
# Merge the dataframes keeping only the relevant columns
final_merged_df = pd.merge(avg_dur_df[['location', 'duration']], trans_df[
    ['location', 'amt']], on = 'location', how = 'inner')
 
# Calculate ratio, sort and submit.
final_merged_df['ratio'] = final_merged_df['amt'] / final_merged_df
['duration']
final_merged_df.sort_values(by = ['ratio'], ascending = False)

Business Metrics and KPIs: Churn and Retention

Account Retention and Churn (number of accounts leaving the platform) is a critical metric for all businesses. It is easier to transact with existing accounts versus trying to add more accounts to the platform. All businesses track churn and retention numbers closely and a lot of Data Science Interviews will involve these metrics in one form or the other. Here, we look at a couple of problems that involve churn, retention rates and associated measures.

The first one is from Lyft Data Science Interview

13. Year Over Year Churn

Determine whether the churn rate of drivers increased or decreased in each year compared to the previous one. Report the year the drivers left Lyft along with the corresponding number of churns in that year, the number of churns in the previous year, and an indication on whether the number has been increased (output the value 'increase') or decreased (output the value 'decrease'). Sort the records by the year in ascending order.

Python Interview Questions for Year Over Year Churn

You can solve the problem here:

https://platform.stratascratch.com/coding/10017-year-over-year-churn

The problem uses lyft_drivers dataset that contains the following fields

indexint64
start_datedatetime64[ns]
end_datedatetime64[ns]
yearly_salaryint64

The data looks like this.

Python Interview Questions Data for Year Over Year Churn

Approach

  • We start off by finding the year from the end_date date field. This is the date when the driver exited for the Lyft. To do this, we invoke the year method on the timestamp object.
  • Then, we summarize the number of drivers based on this “year” field. We also drop any values where the year is missing since these drivers are still on the platform. This leaves us with a summary data for the number of exits for each of the years.
  • To get the previous year’s values, we use the shift method and get a field with values shifted one row. This gives us the previous year’s churn value.
  • Finally, we compare the two churn numbers and output the required fields.

Solution

# Import your libraries
import pandas as pd
 
# Get the year from exit date
lyft_drivers['year'] = lyft_drivers['end_date'].dt.year
 
# Get the number of the drivers churned for each year
summ_df = lyft_drivers.groupby(by = ['year'], as_index = False).agg(
    {'index' : 'count'}).sort_values(by = ['year']).rename(columns = 
    {'index' : 'churn'}).dropna()
 
# Fetch the prev year's churn numbers
summ_df['prev_churn'] = summ_df['churn'].shift(1).fillna(0)
 
# Compare the two churn numbers and output the change
summ_df['change'] = (summ_df['churn'] > summ_df
['prev_churn']).apply(lambda x : 'increase' if x == True else 'decrease')
summ_df

Let us try a more difficult one. This one is from a Salesforce Data Science Interview.

14. Retention Rate

Compare the retention rates in Jan 2021 over that of Dec 2020. Retention rate is defined as the percentage of users a business retains over a given period of time. Assume that the user is retained if he stays with the app in any future months. Your output should have account_id and the Jan 2021 retention rate divided by Dec 2020 retention rate.

Python Interview Questions for Retention Rate

You can solve the full problem here:

https://platform.stratascratch.com/coding/2053-retention-rate

The problem uses the sf_events dataset that has the following fields.

datedatetime64[ns]
account_idobject
user_idobject

This is how the data looks like.

Python Interview Questions Data for Retention Rate

Approach

  • We can split the problem into two parts.
  • Finding retention rates for each month
    • Let us begin by finding the retention rate for December 2020.
    • We need to get the base numbers for December 2020. To get this, we need to get all those users who logged into the app in December 2020. We remove duplicates to ensure that a user is counted only once.
    • Then we get all those users that logged onto the app after December 2020.
    • We merge these two datasets keeping only the users that logged in December 2020.
    • We can now summarize by account_id and calculate the retention rate for December 2020.
    • We can repeat the process for January 2021 to get those numbers.
  • Comparing the retention rate for the two months
    • Once we have the retention rate datasets for December 2020 and January 2021, we can simply merge the two datasets and calculate the retention ratio. We will again use a left join (with the December dataset being the left table) as we need to compare December values with January ones.

Solution

# Import your libraries
import pandas as pd
 
# Get Accounts active in December
dec2020 = sf_events[(sf_events['date'].dt.year == 2020) & (sf_events
['date'].dt.month == 12)].drop_duplicates(subset = ['account_id', 'user_id'])
dec2020['in_dec'] = 1
 
# Get accounts active after December
aft_dec2020 = sf_events[sf_events['date'].dt.date > pd.to_datetime
('2020-12-31')].drop_duplicates(subset = ['account_id', 'user_id'])
aft_dec2020['aft_dec'] = 1
 
# Merge the datasets to get December retention numbers
dec_merged = pd.merge(dec2020, aft_dec2020, on = 
['account_id', 'user_id'], how = 'left')
dec_summ = dec_merged.groupby(by = ['account_id'], as_index = False).sum()
dec_summ['dec_retention'] = dec_summ['aft_dec'] / dec_summ['in_dec']
 
# Repeat the process for Jan 2021
jan2021 = sf_events[(sf_events['date'].dt.year == 2021) & (sf_events
['date'].dt.month == 1)].drop_duplicates(subset = ['account_id', 'user_id'])
jan2021['in_jan'] = 1
aft_jan2021 = sf_events[sf_events['date'].dt.date > pd.to_datetime
('2021-01-31')].drop_duplicates(subset = ['account_id', 'user_id'])
aft_jan2021['aft_jan'] = 1
 
jan_merged = pd.merge(jan2021, aft_jan2021, on = 
['account_id', 'user_id'], how = 'left')
jan_summ = jan_merged.groupby(by = ['account_id'], as_index = False).sum()
jan_summ['jan_retention'] = jan_summ['aft_jan'] / jan_summ['in_jan']
 
 
 
# Merge the Dec20 and Jan21 datasets, calculate the relative retention rate 
# and output.
final_merge = pd.merge(dec_summ, jan_summ, on = 'account_id', how = 'left')
final_merge['retention'] = final_merge['jan_retention'] / final_merge
['dec_retention']
final_merge[['account_id', 'retention']]

Business Metrics and KPIs: Loss Prevention

Loss prevention and fraud detection is a critical aspect of every business. Businesses must be at the top of their game to understand loss making portfolios and manage frauds. Here we look at some loss prevention problems that have appeared in Data Science interviews.

The first one is a relatively simple one from a Netflix Data Science Interview.

15. Top Percentile Fraud

ABC Corp is a mid-sized insurer in the US. They have seen a significant increase in fraudulent claims for their personal auto insurance portfolio. To mitigate these risks, they have developed an ML based predictive model to identify potential fraudulent claims. Each account is assigned a fraud score. Your objective is to identify the top 5 percentile of claims from each state. Your output should be policy number, state, claim cost, and fraud score.

Python Interview Questions for Top Percentile Fraud

You can solve the full problem here:

https://platform.stratascratch.com/coding/10303-top-percentile-fraud

The problem uses the fraud_score dataset with the following fields.

policy_numobject
stateobject
claim_costint64
fraud_scorefloat64

The data looks like this

Python Interview Questions Data for Top Percentile Fraud

Approach

  • This is a relatively straightforward problem that can be solved using the rank method. We use the rank method with the pct = True option to get percentile values.
  • To rank (and calculate percentiles) for each state, we use the groupby method.
  • Finally, we subset the top 5% of the frauds (above 95%th percentile) and report the values.

Solution

# Import your libraries
import pandas as pd
 
# Calculate Percentile Rank
fraud_score['pctile'] = fraud_score.groupby(by = 
['state'])['fraud_score'].rank(pct = True)
 
# Keep only the top 5%
fraud_score[fraud_score['pctile'] > 0.95].drop(columns = 'pctile')

Let us try a more difficult problem. This one is from an Uber Data Science Interview.

16. Percentage Of Revenue Loss

Calculate the percentage of incomplete orders along with the revenue loss percentage for each service category. Output the name of the service, percentage of incomplete orders, and revenue loss from the incomplete orders.

Python Interview Questions for Percentage Of Revenue

You can solve this problem here. https://platform.stratascratch.com/coding/2048-percentage-of-revenue-loss

The problem uses the uber_orders dataset that contains the following fields.

order_datedatetime64[ns]
number_of_ordersint64
status_of_orderobject
monetary_valuefloat64
service_nameobject

The data in the set looks like this.

Python Interview Questions Data for Percentage Of Revenue

Approach

  • This python interview question requires us to summarize the dataset twice. Once for the entire range of values and then for incomplete orders. We create two summary datasets aggregated by the service_name field.
  • We then merge the two datasets and calculate the percentage loss numbers for transactions and monetary value and output the relevant fields.

Solution

# Import your libraries
import pandas as pd
 
# Get the overall numbers
overall_df = uber_orders.groupby(by = ['service_name'], as_index = False).sum()
 
# Summarize incomplete orders
loss_df = uber_orders[uber_orders['status_of_order'] != 'Completed'].groupby
(by = ['service_name'], as_index = False).sum()
 
# Merge the datasets
summ_df = pd.merge(overall_df, loss_df, on = 'service_name', how = 'left', 
suffixes = ['_total', '_loss'])
 
# Calculate loss metrics and output the relevant columns
summ_df['loss_order_pct'] = summ_df['number_of_orders_loss'] / summ_df
['number_of_orders_total'] * 100
summ_df['loss_value_pct'] = summ_df
['monetary_value_loss'] / summ_df['monetary_value_total'] * 100
summ_df[['service_name', 'loss_order_pct', 'loss_value_pct']]

Business Metrics and KPIs: Forecasting

Forecasting is a critical part of any business to ensure that one does not suffer from stock outs or over supply of inventory. This is a slightly complex business problem that appeared in Uber Data Science Interviews.

17. Naive Forecasting

Develop a naive forecast for a new metric called "distance per dollar" defined as the (distance_to_travel/monetary_cost) in our dataset and measure its accuracy. A naive forecast is simply the value of the previous observation. Distance to travel is defined as the ratio of the “distance to travel” and “monetary cost” fields. The forecast number is the monthly aggregate across all trips for the previous month.

To check the accuracy of the forecast, we use the Root Mean Squared Error (RMSE) Metric. RMSE is defined as sqrt(mean(square(actual - forecast)).
Output the RMSE rounded to the 2nd decimal spot as the result.

You can solve the problem here. https://platform.stratascratch.com/coding/10313-naive-forecasting

The problem uses the following dataset

uber_request_logs

request_idint64
request_datedatetime64[ns]
request_statusobject
distance_to_travelfloat64
monetary_costfloat64
driver_to_client_distancefloat64

This is a sample of the data present in the uber_request_logs dataset.

Python Interview Questions Data for Naive Forecasting

Approach

  • There are two major challenges to solve this set.
    • We first need to extract the month from the timestamp. To do this, we construct a string using the date methods.
    • Then we summarize the dataset based on this month.
    • Remember, we need to calculate the monthly value of the distance_to_travel and monetary_cost.
    • For this, we sum the two columns across months and then take a ratio. If we take the average of individual distance_to_travel to monetary_cost ratios, it will give the wrong answer.
  • We then use the shift method to get the previous month’s value as the naïve forecast for the next month.
  • Finally, we calculate the RMSE as per the formula given in the problem, round off the result to two decimal places and out the result.

Solution

# Import your libraries
import pandas as pd
import numpy as np
 
# Create the month field from the date
uber_request_logs['month'] = pd.to_datetime(uber_request_logs
['request_date'].apply
(lambda x : str(x.year) + str(x.month).zfill(2) + '01')).dt.date
 
# Summarize the dataset by month
uber_summ = uber_request_logs.groupby(by = ['month'], as_index = False).
sum().sort_values(by = ['month'])
 
# Calculate the distance per dollar metric
uber_summ['dist_per_dollar'] = uber_summ['distance_to_travel'] / uber_summ
['monetary_cost'] 
 
# Pull the Naive Forecast
uber_summ['forecast'] = uber_summ['dist_per_dollar'].shift(1)
 
# Calculate RMSE and round to two decimal places
round(np.sqrt(((uber_summ['dist_per_dollar'] - uber_summ
['forecast']) **2).mean()),2)

Text Manipulation

While traditionally a lot of data storage systems contain tabular data, over the last decade, there has been an explosion of unstructured data usually stored in JSON and text format. Working with these data types presents a different challenge compared to working with traditional databases. Here we look at a few text wrangling problems that have appeared in Python Data Science Interviews.

We start off with an easy one. This appeared in a Yelp Data Science Interview.

18. Reviews of Categories

Find the total number of reviews for each category. Output the total number of reviews and the category ordered by the number of reviews with the category having the highest number of reviews coming first and the one with lowest last.

Python Interview Questions for Reviews of Categories

You can solve the problem here.

https://platform.stratascratch.com/coding/10049-reviews-of-categories

The problem uses the yelp_business dataset that has the following fields.

business_idobject
nameobject
neighborhoodobject
addressobject
cityobject
stateobject
postal_codeobject
latitudefloat64
longitudefloat64
starsfloat64
review_countint64
is_openint64
categoriesobject

The data contained in the dataset looks like this.

Python Interview Questions Data for Reviews of Categories

Approach

  • The fields that are of interest to us are review_count and categories.
  • As one can see, a restaurant can belong to multiple categories. Therefore, we need to split the categories string and count the reviews in each of the resultant categories.
  • To do this quickly, we first convert the string into a list of categories. We then invoke the string method split(), using a semicolon (;) as the delimiter.
  • Once we get the list, we can call the explode method on the column to get each individual category as a separate row with the rest of the data duplicated.
  • Finally, we can summarize by the individual categories, sort in descending order of the reviews and output the result.

Solution

# Import your libraries
import pandas as pd
 
# convert categories into a list
yelp_business['category_list'] = yelp_business['categories'].str.split(';')
 
# Explode the category list, summarize by the resultant categories, sort and 
submit
yelp_business.explode('category_list').groupby(by = ['category_list'], 
as_index = False).agg({'review_count' : 'sum'}).sort_values
(by = 'review_count', ascending = False)

Let us try a slightly more difficult problem. This one came up in a Google Data Science Interview.

19. Find the number of times each word appears in drafts

Output the number of times each word appears in drafts. Report the word and the corresponding number of occurrences.

Python Interview Questions Data to Find the number of times

You can solve this problem here.

https://platform.stratascratch.com/coding/9817-find-the-number-of-times-each-word-appears-in-drafts

The problem uses the following dataset

google_file_store

filenameobject
contentsobject

The data looks like this.

Data of Python Interview Questions to Find the number of times

Approach

  • We first start by selecting the drafts from all the documents.
  • Since we need a word count, we split the contents column and call the explode method again.
  • Finally, we use the value_counts() method to find the frequency of each word and report the result.

Solution

# Import your libraries
import pandas as pd
 
# Get only the files that are "drafts"
drafts = google_file_store[google_file_store['filename'].str.lower().
str.contains('draft')]
 
# split and explode the contents field, calculate frequencies and submit
drafts['contents'].str.split().explode().value_counts()

Let us finish text manipulation with a problem involving string matching. This came up in a City of San Francisco Data Science Interview.

20. Classify Business Type

Classify each business as per the following criteria.

  • If the word ‘restaurant’ appears in the business name, classify it as a restaurant.
  • If the words ‘café’, ‘coffee’, or 'café' appear in the business name, then it should be classified as a cafe.
  • If the word ‘school’ appears in the business name, then classify it as a school.
  • All other businesses not satisfying any of the above conditions should be classified under other.
Python Interview Questions for Classify Business Type

You can solve the problem here.

https://platform.stratascratch.com/coding/9726-classify-business-type

The problem uses the following dataset

sf_restaurant_health_violations

business_idint64
business_nameobject
business_addressobject
business_cityobject
business_stateobject
business_postal_codefloat64
business_latitudefloat64
business_longitudefloat64
business_locationobject
business_phone_numberfloat64
inspection_idobject
inspection_datedatetime64[ns]
inspection_scorefloat64
inspection_typeobject
violation_idobject
violation_descriptionobject
risk_categoryobject

The dataset looks like this.

Python Interview Questions Data for Classify Business Type

Approach

  • The only field of interest here is the business_name.
  • We use multiple if conditions inside a lambda function.
  • We check the business names as per each of the classification rules and assign a category
  • Finally, we output the business names with their categories

Solution

# Import your libraries
import pandas as pd
 
# Classify as per the rules
sf_restaurant_health_violations['category'] = sf_restaurant_health_violations
['business_name'].apply(lambda x: \
'school' if x.lower().find('school') >= 0 \
else 'restaurant' if x.lower().find('restaurant') >= 0 \
else 'cafe' if (x.lower().find('cafe') >= 0 or x.lower().find('café') >= 0 \
or  x.lower().find('coffee') >= 0) \
else 'other'
    )
 
 
# Output relevant fields    
sf_restaurant_health_violations[['business_name', 'category']].
drop_duplicates()

Product Specific Python Interview Questions

Besides these types, there are some problems that are very peculiar to the company and its products. Here we list ten such problems that will give you a very good understanding of the type of problems that you can expect. The idea behind these problems is to relate the concepts known to you with the products and services that a company offers and try to visualize the data setup, metrics that might be used and prepare for their Data Science Interview. You will also find a lot of concepts already covered thus far.

21. Airbnb Python Interview Question

City With Most Amenities

Given a dataset of searches for properties on Airbnb, find the city with the most amenities across all their host's properties. Assume that each search result (i.e., each row) represents a unique host. Output the name of the city as your solution.

Python Interview Questions for City With Most Amenities

You can try the full problem here. https://platform.stratascratch.com/coding/9633-city-with-most-amenities

The problem uses the following dataset

airbnb_search_details

idint64
pricefloat64
property_typeobject
room_typeobject
amenitiesobject
accommodatesint64
bathroomsint64
bed_typeobject
cancellation_policyobject
cleaning_feebool
cityobject
host_identity_verifiedobject
host_response_rateobject
host_sincedatetime64[ns]
neighbourhoodobject
number_of_reviewsint64
review_scores_ratingfloat64
zipcodeint64
bedroomsint64
bedsint64

This is how the dataset looks.

Python Interview Questions Data for City With Most Amenities

Approach

  • While there are a lot of fields in the dataset, there solution just needs two: amenities and city
  • We need to aggregate the number of amenities across all listings in a city. We start off by finding the number of amenities in a particular listing. To do this, we find the length of the amenities set.
  • We then aggregate the amenities for each city and finally output the city with the highest number of amenities.

Solution

# Import your libraries
import pandas as pd
 
# get the number of number of amenities
airbnb_search_details['num_amenities'] = airbnb_search_details
['amenities'].apply(lambda x : len(x))
 
# Summarize by city and output the city with the most amenities
airbnb_search_details.groupby(by = ['city'], as_index = False).agg
({'num_amenities' : 'sum'}).sort_values(by = ['num_amenities'], 
ascending = False)['city'].iloc[0]

Airbnb is an online marketplace for vacation stays, homestays and lodging. You can expect problems with cases presented in this setting. You can also go through a more detailed AirBnB Data Science Interview guide here.

22. Amazon Python Interview Question

Revenue Over Time

Given a table with users, their purchase amount, and date purchased, determine the three-month rolling average of the total revenues generated from purchases. Output the result sorted by the month from earliest to latest month.

Note: Exclude the returns which are represented by negative purchase values. For the first two months, calculate the average only up to that month.

Python Interview Questions for Revenue Over Time

You can solve the problem here.

https://platform.stratascratch.com/coding/10314-revenue-over-time

The problem uses the amazon_purchases dataset that has the following fields.

user_idint64
created_atdatetime64[ns]
purchase_amtint64

And the dataset looks like this.

Data of Python Interview Questions for Revenue Over Time

Approach

  • We start off by excluding the return transactions. To do this, we keep only those transactions where the purchase_amt is positive.
  • Then we extract the month from the date field.
  • We then aggregate the purchases for each month. This results in a dataset with total monthly purchases.
  • To calculate the rolling average, we call the rolling() method on the purchases column.
  • To ensure that the values are calculated even for the first two months even though we do not have three months of data, we set min_periods to 1.
  • Finally we subset the relevant fields and output the result.

Solution

# Import your libraries
import pandas as pd
 
# Remove refund transactions
pos_df = amazon_purchases[amazon_purchases['purchase_amt'] > 0]
# Create Month indicator
pos_df['month'] = pos_df['created_at'].apply(lambda x : str(x.year) + "-" + 
str(x.month).zfill(2))
 
# Aggregate the purchases by month
summ_df = pos_df.groupby(by = ['month'], as_index = False).sum()
[['month', 'purchase_amt']].sort_values(by = ['month'])
 
# Calculate the rolling average, ensure that the value is calculated even 
for the first two months
summ_df['roll_avg'] = summ_df['purchase_amt'].rolling
(3, min_periods = 1).mean()
 
# Output relevant fields
summ_df[['month', 'roll_avg']]

Amazon is one of the largest companies in the world. It operates across multiple product and service categories across the world. It is mostly known for its eponymous website. Find more data science interview questions from Amazon here.

23. DoorDash Python Interview Question

Avg Order Cost During Rush Hours

Find the average order cost during rush hours (3PM to 6PM) in San Jose. Order the results by hour.

Python Interview Questions for Avg Order Cost

You can solve the full problem here:

https://platform.stratascratch.com/coding/2035-avg-order-cost-during-rush-hours

The problem uses the following dataset

doordash_delivery

customer_placed_order_datetimedatetime64[ns]
placed_order_with_restaurant_datetimedatetime64[ns]
driver_at_restaurant_datetimedatetime64[ns]
delivered_to_consumer_datetimedatetime64[ns]
driver_idint64
restaurant_idint64
consumer_idint64
is_newbool
delivery_regionobject
is_asapbool
order_totalfloat64
discount_amountfloat64
tip_amountfloat64
refunded_amountfloat64

This is how the dataset looks

Python Interview Questions Data for Avg Order Cost

Approach

  • We start by extracting the hour of delivery by using the hour method on the customer_placed_order_datetime field as suggested in the problem.
  • We then subset data keeping only the orders placed in the San Jose area and between the 3PM to 6PM.
  • Finally, we summarize the mean order value for each hour and output.

Solution

# Import your libraries
import pandas as pd
 
# Find the hour or delivery
doordash_delivery['hour'] = doordash_delivery
['customer_placed_order_datetime'].dt.hour
# Keep only San Jose orders placed between 3 to 6 PM
rush_df = doordash_delivery[(doordash_delivery['hour'].isin([15,16,17, 18])) & 
(doordash_delivery['delivery_region'] == 'San Jose')]
# Aggregate and output
rush_df.groupby(by = ['hour'], as_index = False).agg({'order_total': 'mean'})

Doordash is an online food delivery and ordering platform and is the largest such platform in the US. It also provides convenience store delivery services. You can attempt more such problems from Doordash Data Science Interviews Questions here.

24. Google Python Interview Question

Activity Rank

Email activity rank is calculated on the total number of emails sent, with the user with the highest number of emails sent given the top rank. Calculate the email activity rank for each user. If two users have sent the same number of emails, then the users are ranked in the alphabetical order of their IDs. Report the user id, number of emails sent and the rank. Output should be sorted by rank.

Python Interview Questions for Activity Rank

You can try this problem here.

https://platform.stratascratch.com/coding/10351-activity-rank

The problem uses the google_gmail_emails dataset that contains the following fields.

idint64
from_userobject
to_userobject
dayint64

This is how the data looks.

Data of Python Interview Questions for Activity Rank

Approach

  • The sender’s user id is the from_user column. Therefore, we summarize the dataset by the from_user field aggregating the number of emails sent.
  • We sort the resultant dataset by descending order of number of emails sent. To break the ties, we use the alphabetical order of the user id as a secondary sort field.
  • Finally, we create a rank field and output the relevant fields

Solution

# Import your libraries
import pandas as pd
 
# Get the number of mails sent by each user, sort by the results
summ_df = google_gmail_emails.groupby(by = ['from_user'], as_index = False).
count().sort_values(by = ['id', 'from_user'], ascending = [False, True])
 
# Add a Rank field and output relevant fields
summ_df['rank'] = [i+1 for i in range(len(summ_df))]
summ_df[['from_user', 'id', 'rank']]

Google is synonymous with search. One of the most valuable companies in the world Google operates in multiple domains. Check out our ultimate guide to become a Data Scientist at Google.

25. Microsoft Python Interview Questions

Users Exclusive Per Client

Write a query that returns the number of users who are exclusive to only one client. Output the client_id and number of exclusive users.

Python Interview Questions for Users Exclusive Per Client

You can solve the problem here.

https://platform.stratascratch.com/coding/2025-users-exclusive-per-client

The problem uses the fact_events dataset which has the following fields.

idint64
time_iddatetime64[ns]
user_idobject
customer_idobject
client_idobject
event_typeobject
event_idint64

The data contained in it looks like this.

Data of Python Interview Questions for Users Exclusive Per Client

Approach

  • Since we need to only find the number of clients used by the a particular user, we remove all duplicate combinations of user_id and client_id
  • We then count the number of client_ids used by each user.
  • Finally, we subset only those users where the number of clients used is equal to one and output the relevant fields.

Solution

# Import your libraries
import pandas as pd
 
# Find the number of clients used by a user. 
user_df = fact_events.drop_duplicates(subset = ['user_id', 'client_id'])
[['user_id', 'client_id']].groupby(by = ['user_id'], as_index = False).count()
 
# Subset those users who use only one client and output relevant fields
out_df = fact_events[fact_events['user_id'].isin(user_df[user_df['client_id'] 
== 1]['user_id']
    )][['user_id', 'client_id']].drop_duplicates().groupby(by = ['client_id'], 
as_index =False).count()
out_df

Microsoft is one of the oldest tech companies and recently crossed a market value of $1 trillion. Microsoft operates across multiple business segments: Operating Systems, Productivity Software, Tablet PCs, Cloud Infrastructure among others. Find the recent Python interview questions from Microsoft for the Data Scientist position here.

26. Netflix Python Interview Question

Win-to-Nomination Ratio

For each nominee, calculate the win-to-nomination ratio. Report each nominee and their win-to-nomination ratio such that the nominee with the highest ratio is on top while the one with the lowest ratio is bottom.

Python Interview Questions for Win-to-Nomination

You can solve the problem here.

https://platform.stratascratch.com/coding/9752-win-to-nomination-ratio

The problem uses the following dataset

oscar_nominees

yearfloat64
categoryobject
nomineeobject
movieobject
winnerbool
idint64

This is how the dataset looks.

Python Interview Questions Data for Win-to-Nomination

Approach

  • Given the nature of the dataset, we can easily get the required metrics by using the pivot_table function.
  • We aggregate the number of nominations counting the number of wins separately for each nominee
  • Now we can calculate the win-to-nomination ratio easily.
  • Finally, we sort the summarized dataset in descending order of the ratio and output the relevant fields.

Solution

# Import your libraries
import pandas as pd
 
# Aggregate the number of nominations for each nominee, reporting wins separately
summ_df = pd.pivot_table(data = oscar_nominees, index = 'nominee', columns = 
'winner', values = 'category', aggfunc = 'count').reset_index().fillna(0)
 
# Calculate the win to nomination ratio
summ_df['ratio'] = summ_df[True] / (summ_df[True] + summ_df[False])
 
# Sort in descending order of ratio and output relevant fields
summ_df[['nominee', 'ratio']].sort_values(by = ['ratio'], ascending = False)

Netflix is the largest media organization in the world by market capitalization and is widely known for its OTT media platform and original programming. You can solve more Netflix python interview questions on the StrataScratch platform.

27. Noom Python Interview Question

Signup By Day

Create a query that returns an array of signup ids by day. Order the output by the number of signups with the day on with the date with the most number of signups appearing first.

Python Interview Questions for Signup By Day

You can solve the problem here.

https://platform.stratascratch.com/coding/2030-signup-by-day

The problem uses the signups dataset that contains the following fields.

signup_idint64
signup_start_datedatetime64[ns]
signup_stop_datedatetime64[ns]
plan_idint64
locationobject

This is how the data looks like.

Python Interview Questions Data for Signup By Day

Approach

  • We first take the date value from the signup_start_date field.
  • Since we need to create a list of signups, we group all signup_ids by date and create a list using the apply method.
  • We now get the number of signups for each day by taking the length of the created list.
  • Finally, we sort the summarized dataset by the number of signups and output the relevant fields.

Solution

# Import your libraries
import pandas as pd
 
# Extract date value
signups['date'] = pd.to_datetime(signups['signup_start_date'].dt.date)
 
# Groupby date and create a list of signup_id
summ_df = signups[['date', 'signup_id']].groupby(by = ['date'])
['signup_id'].apply(list).reset_index()
 
# Calculate the number of signups for each day
summ_df['num_signups'] = summ_df['signup_id'].apply(lambda x : len(x))
 
# Order by the number of signups and output the relevant fields
summ_df.sort_values(by = ['num_signups', 'date'], ascending = 
[False, True]).drop(columns = ['num_signups'])

Noom operates a subscription-based app that tracks a person’s food intake and exercise habits. You can see more Noom Data Science Interview Problems here.

28. Salesforce Python Interview Question

Total Sales In Different Currencies

You are provided the sales amount for a multinational company in different currencies. You are also provided the currency exchange rates against US dollars (USD) on different dates. Report the total sales, per quarter, for the first 2 quarters in 2020, and the sales in USD currency.

Python Interview Questions for Total Sales

You can try the problem here

https://platform.stratascratch.com/coding/2041-total-sales-in-different-currencies

The problem uses two datasets.

sf_exchange_rate

source_currencyobject
target_currencyobject
exchange_ratefloat64
datedatetime64[ns]

The data is arranged in the following manner.

sf exchange data of python interview questions for total sales

The second dataset is called sf_sales_amount containing the following fields.

sales_datedatetime64[ns]
sales_amountint64
currencyobject

The data present in this set looks like this.

sf sales data of python interview questions for total sales

Approach

  • The sales data is present in the sf_sales_amount dataset. However, it is reported in different currencies. To get the USD sales value, we need to get the exchange rate on the date of the transaction.
  • We start off by merging the two datasets matching the combination of sales_date and currency from the sf_sales_amount dataset with the combination of date and source_currency fields from the sf_exchange_rate dataset. Keeping all the transactions in the sf_sales_amount dataset.
  • We then extract the reporting quarter from the sales_date field. To do this we use the built-in quarter method for datetime objects.
  • Finally we calculate the USD amount by multiplying the sales amount with the exchange rate and summarize by quarter.

Solution

# Import your libraries
import pandas as pd
 
# Merge the two datasets, by currency and date
merged_df = pd.merge(sf_sales_amount, sf_exchange_rate, how = 'left', left_on = 
['sales_date', 'currency'], right_on = ['date', 'source_currency'])
 
# Calculate the USD value of  sales
merged_df['usd_value'] = merged_df['sales_amount'] * merged_df['exchange_rate']
 
# Create a quarter field from the date
merged_df['quarter'] = merged_df['sales_date'].dt.quarter
 
# Summarize the usd sales value by quarter
summ_df = merged_df.groupby(by = ['quarter'], as_index = False).sum()
 
# Keep only the first two quarter data and output relevant fields
summ_df[summ_df['quarter'].isin([1,2])][['quarter', 'usd_value']]

Salesforce is a cloud-based SaaS company that provides CRM and associated enterprise solutions focused on marketing automation, customer service, application development, et al. You can solve more Salesforce Data Scientist Coding Interview problems here.

29. Twitch Python Interview Question

Viewers Turned Streamers

Find out how many streaming sessions are conducted by users who started out as viewers (users whose first session on the platform was as a viewer). Output the user id and number of sessions. Sort in descending order of the number of streaming sessions conducted. If two users have the same number of sessions, then the order in the ascending order of their user ids.

Python Interview Questions for Viewers Turned Streamers

You can solve the full problem here.

https://platform.stratascratch.com/coding/2012-viewers-turned-streamers

The problem uses the twitch_session dataset that contains the following fields.

user_idint64
session_startdatetime64[ns]
session_enddatetime64[ns]
session_idint64
session_typeobject

This is the data contained in the dataset.

Data of Python Interview Questions for Viewers Turned Streamers

Approach

  • We start by taking the information about the very first session for a user.
  • To find the streaming sessions, we count the number of sessions for each user where the session_type is ‘streamer’
  • We merge the two datasets keeping only those users whose first session was as a ‘viewer’.
  • Finally, we sort the dataset in descending order of the number of streaming sessions. To break the ties, we sort in the ascending order of the user ids.

Solution

# Import your libraries
import pandas as pd
 
# Get the very first session details for each user
first_session = twitch_sessions.sort_values(by = ['user_id', 'session_start']).
drop_duplicates(subset = ['user_id'])
 
# Summarize streaming information
streamer_sessions = twitch_sessions[twitch_sessions['session_type'] == 'streamer'].
groupby(by = ['user_id']).agg({'session_id' : 'count'})
 
# merge the two datasets, subset relevant rows and output relevant fields
merged_df = pd.merge(first_session[first_session['session_type'] == 'viewer']
[['user_id']], streamer_sessions, on = 'user_id', how = 'left').sort_values
(by = ['session_id', 'user_id'], ascending = [False, True])

Twitch is a subsidiary of Amazon and is a live-streaming service focusing primarily on broadcasts of esports competitions and video game live streams. You can find more Twitch Data Science Interview questions here.

30. Uber Python Interview Question

Highest Total Miles

Given a table of Uber rides that contains the mileage and the purpose for the business expense, find out the business purposes that generate the most miles driven. Report the top 3 business purpose categories by total mileage.

Python Interview Questions for Highest Total Miles

You can solve the problem here

https://platform.stratascratch.com/coding/10169-highest-total-miles

The problem uses the following dataset

my_uber_drives

start_datedatetime64[ns]
end_datedatetime64[ns]
categoryobject
startobject
stopobject
milesfloat64
purposeobject

This is how the data is presented.

Dataset of Python Interview Questions for Highest Total Miles

Approach

  • To solve this problem, we aggregate the number of miles by purpose.
  • We sort the summarized dataset in descending order of the number of miles and report the top three results.
  • We can combine this entire set of operations into a single step with method-chaining.

Solution

# Import your libraries
import pandas as pd
 
# Sum the number of miles by purpose and report the top three records
my_uber_drives.groupby(by = ['purpose'], as_index = False).agg
({'miles': 'sum'}).sort_values(by = ['miles'], ascending = False)[:3]

Uber is the largest ride-sharing company in the US. Besides ride-sharing, Uber also has package delivery, freight transportation, food delivery and courier operations. You can find more Uber Data Science Interview questions here.

Also, check out our previous article where we show how to approach and solve the challenging Python interview questions.

Conclusion

Here you have a comprehensive overview of the types of Python interview questions asked in Data Science Interviews at top companies like Amazon, Google, Microsoft, etc. We also looked at various aspects of business that you need to be familiar with to complete these coding interviews. The keys to improving your skills are patience, persistence, and practice.

Also, if you're confused between the two statistical languages, Python and R, and want to know which language is better, check out our article on Python vs R for Data Science.

Join a community of over 20,000 likeminded data science aspirants to collaborate and compete in solving over 500 coding and non-coding Data Science Interview questions. Sign up for Stratascratch today and make your dream of working at top tech companies and trailblazing start-ups a reality.