# Top 30 Python Interview Questions and Answers

A comprehensive overview of the types of Python interview questions asked in Data Science Interviews at top companies like Amazon, Google, Microsoft, etc.

Python has been consistently rated as the top programming language. Besides being widely used in a range of applications from web and mobile development, API management, process automation, etc, Python rules the Data Science world. Python is free, open-source and is backed by a vast collection of well-documented and continuously updated libraries. Unlike other languages with finicky syntaxes, Python is very easy to pick up and even people without CS backgrounds can learn it easily.

## Technical concepts used in Python Interview Questions

The most common test areas in Python Data Science Interviews include

• Using the Python standard library.
• Specifically built-in data types and data structures like strings, Boolean, list, dictionaries, etc.
• Manipulating datetime objects.
• Working with Tabular data
• Specifically using Pandas and NumPy libraries
• Knowledge of Pandas functions including
• Indexing and Slicing
• Aggregation and Rollups
• Vectorized operations

## Python Interview Questions

For this article, we will focus specifically on Python interview questions for the Data Scientist position. The problems involving Python for Data Scientist roles can broadly be grouped into the following categories.

• Math, Stats and Probability questions.
• Data Cleaning and Wrangling
• Datetime Operations
• Text Manipulation
• Product Specific Problems

Let us look at each of these types of problems in detail.

## Math and Probability Questions for Python Data Science Interviews

These python interview questions require the use of mathematical functions and (or) formulae. Some of the problems can be solved using high math, however, some might need a greater knowledge of probability and combinatorics. Let us look at a couple of them. The first one is from Credit Karma and requires the use of simple high school math.

#### 1. Share of Loan Balance

Group each submission’s total balance based on rate_type and report the rate_type, loan_id and balance of each loan type. Also calculate the percentage of the submission’s total balance each loan constitutes.

You can solve this problem here.

https://platform.stratascratch.com/coding/2001-share-of-loan-balance

This is the data that we will be working with.

submissions

 id int64 balance float64 interest_rate float64 rate_type object loan_id int64

And here is what the data looks like.

Approach

This is a relatively straightforward set.

• We first need to get the totals across all rate types (fixed or variable)
• Then we proceed to merge this information to the full dataset by rate_type
• Now the loan share can be calculated relatively easily.

Solution

``````# Import your libraries
import pandas as pd

# Get sum loan balances by loan type
loan_summ = submissions[['rate_type', 'balance']].groupby(['rate_type'],
as_index = False).sum().rename(columns = ({'balance': 'total_balance'}))

# Merge with the full dataset
output_df = pd.merge(submissions, loan_summ, on = 'rate_type', how = 'left')

# Calculate the share
output_df['loan_share'] = output_df['balance'] / output_df['total_balance']

# Keep relevant columns
output_df = output_df[['loan_id', 'rate_type', 'balance', 'loan_share']]

# Submit
output_df``````

The next one is similar. This involves understanding of probability but is not too difficult. This one is from a Lyft Data Science Interview.

#### 2. Find the probability of ordering a ride based on the weather and the hour

Find the probability of ordering a ride based on the weather and the hour and display the weather, hour and the corresponding probability sorted by the weather and the hour in ascending order.

You can solve this python interview question here.

https://platform.stratascratch.com/coding/10019-find-the-probability-of-ordering-a-ride-based-on-the-weather-and-the-hour

This is the dataset that we will be working with.

lyft_rides

 index int64 weather object hour int64 travel_distance float64 gasoline_cost float64

This is what the data looks like.

Approach

• Here we are required to find out the probability of the rides for each combination of weather and hour.
• Therefore, we start off by aggregating the data by weather and hour. Since we need to count the number of events, we take up the gasoline_cost variable and count the values. This gives the number of occurrences by each required combination.
• We then find the probability by dividing these counts by the total number of occurrences (which is the number of rows in the dataset).
• Finally, we sort the results in the required order and output only the relevant columns.

Solution

``````# Import your libraries
import pandas as pd

# Find the number of rides by weather and hour
out_df = lyft_rides[['weather', 'hour', 'gasoline_cost']].groupby(by =
['weather', 'hour'], as_index = False).count().rename(columns =
{'gasoline_cost': 'count'})

# Calculate probability
out_df['probability'] = out_df['count'] / len(lyft_rides)

# Output the relevant columns and sort
out_df = out_df[['weather', 'hour', 'probability']].sort_values(by =
['weather', 'hour'])

# Submit
out_df``````

You can refer to this article for more such questions on using Python for solving Math, Probability and Stats questions for Data Science Interviews.

## Statistics Questions for Python Data Science Interviews

Statistics based Data Science Interview questions may not always involve coding. You might be asked problems that involve use of statistics and machine learning algorithms.

In this article, we will concentrate on problems that involve coding in Python. Most of these python interview questions will test your understanding of statistical measures like mean, median, mode, percentiles, variance, standard deviation, et al. Let us look at a question that appeared in Python Data Science Interviews. This involves calculating percentiles and median.

#### 3. Find students with a median writing score

The question is available here

https://platform.stratascratch.com/coding/9610-find-students-with-a-median-writing-score

This is the dataset that we are working with

sat_scores

 school object teacher object student_id float64 sat_writing float64 sat_verbal float64 sat_math float64 hrs_studied float64 id int64 average_sat float64 love datetime64[ns]

And this is the data in the dataset.

Approach

Median is the middle value when the data is sorted in ascending or descending order. Median is also the 50th percentile. To solve this python interview question, we can call the rank method on the writing score column and then get the 50th percentile score. Then subset the data to get only those values that have a percentile rank of 50.

Solution

``````# Import your libraries
import pandas as pd
import numpy as np

# calculate percentiles with built in function
sat_scores['percentile'] = (sat_scores['sat_writing'].rank(pct = True) *
100).apply(np.floor)

# output median (50th percentile) values
sat_scores[sat_scores['percentile'] == 50]``````

## Data Cleaning and Wrangling: Working with duplicates

Data Science is not only about Math and Statistics. One also needs to have a good handle on how to manipulate data to find the right output. A very common problem in Data Science Interviews is dealing with duplicates and finding out unique values. It is very important to know if the values in a particular column are unique, especially when you are merging additional datasets. Let us take a couple of examples that involve these concepts.

The first one is from a Virgin Group Data Science Interview

#### 4. Count the unique origin airports

Find how many different origin airports exist?

The python interview question can be solved on the Stratascratch platform here.

https://platform.stratascratch.com/coding/9672-count-the-unique-origin-airports

These are fields in the dataset.

us_flights

 flight_date datetime64[ns] unique_carrier object flight_num int64 origin object dest object arr_delay float64 cancelled int64 distance float64 carier_delay float64 weather_delay float64 late_aircraft_delay float64 nas_delay float64 security_delay float64 actual_elapsed_time float64

And this is how the data set looks like.

Approach

There are quite a few fields in this dataset, but the one that is of interest to us is the origin. We need to count the distinct number of airports. We can simply do this by passing the values to a set. A set is a standard Python data type that contains only distinct values. If you know about sets, we can solve this python interview question in a single line of code!!

Solution

``````# Import your libraries
import pandas as pd

# Convert the column values to a set and find its length
len(set(us_flights['origin']))``````

Let us try another one. This one is from a SalesForce Data Science Interview.

#### 5. Duplicate Emails

Find all emails with duplicates.

You can solve this python interview question here.

https://platform.stratascratch.com/coding/9895-duplicate-emails

The fields in the data set are:

employee

 id int64 first_name object last_name object age int64 sex object employee_title object department object salary int64 target int64 bonus int64 email object city object address object manager_id int64

This is what the data looks like.

Approach

• The dataset again has a lot of fields, but the one we are interested in is email.
• To solve this problem, we can use the groupby method in Pandas and calculate the count.
• We then subset the resulting dataframe selecting only those emails with a count greater than 1. In other words, we select only those email ids that appear more than once.

Solution

``````# Import your libraries
import pandas as pd

# Find the count of all email ids
count_df = employee[['email', 'id']].groupby(by = 'email',
as_index = False).count()
# Subset only repeated values.
count_df[count_df['id'] > 1]['email']``````

## Data Cleaning and Wrangling: Datetime manipulation

Another popular genre of python interview questions is one that involves manipulating date and time values. These questions are a perennial favorite of interviewers as you can create a range of questions with very simple looking datasets. We have looked at manipulating date time values with SQL in this SQL scenario based interview questions article. Let us see how to do something similar in Python.

The first one is from an Airbnb Data Science Interview

#### 6. Find the average difference between booking and check-in dates

Find out the average number of days between the booking and check-in dates for AirBnB hosts. Report the results based on the average number of days ordered in descending order.

You can solve this problem here. https://platform.stratascratch.com/coding/9614-find-the-average-difference-between-booking-and-check-in-dates

The dataset contains the following fields

airbnb_contacts

 id_guest object id_host object id_listing object ts_contact_at datetime64[ns] ts_reply_at datetime64[ns] ts_accepted_at datetime64[ns] ts_booking_at datetime64[ns] ds_checkin datetime64[ns] ds_checkout datetime64[ns] n_guests int64 n_messages int64

The dataset looks like this.

Approach

• The relevant fields here are id_host to identify the host, ts_booking_at : the booking timestamp and the ds_checkin the check-in time.
• We start off by converting the values timestamps to dates. To do this, we use the dt.date method for pandas timestamp.
• We then take the difference between the check-in date and the booking date. The resulting value will be a timedelta object. This represents the difference between the two timestamps. We convert this timedelta value to days. This can be achieved by using the dt.days method in Pandas. The rest of the solution is pretty simple now.
• We take the average of the timestamps, grouped by the host_id.
• Then we sort the averages in descending order and remove any null values.
• Note: There might be null values because the guest might have initiated a conversation but might not have proceeded to book a stay.

Solution

``````# Import your libraries
import pandas as pd

# Extract only the days from the timestamp, calculate the difference
and convert the difference to days.
airbnb_contacts['date_diff'] = (airbnb_contacts['ds_checkin'].dt.date -
airbnb_contacts['ts_booking_at'].dt.date).dt.days
# Calculate the average of the date_diff, sorting the result in descending
order and dropping and null values.
avg_df = airbnb_contacts[['id_host', 'date_diff']].groupby(by = ['id_host'],
as_index = False).mean().sort_values(by = ['date_diff'],
ascending = False).dropna()``````

Let us crank up the difficulty level for the next one. This one is from a SalesForce Data Science Interview.

#### 7. Consecutive Days

Find all the users who were active for 3 consecutive days or more.

You can solve this python interview question here.

https://platform.stratascratch.com/coding/2054-consecutive-days

The dataset contains the following fields

sf_events

 date datetime64[ns] account_id object user_id object

And it looks like this.

Approach

• The dataset is straightforward. However, the solution is a little complex as one needs to identify three or more consecutive days of login.
• We begin by extracting only the date parts from the timestamps. To do this we use the dt.date method.
• There might be multiple logins on the same day. Hence, we remove duplicate dates for the same user and sort the dataframe by user id and login dates.
• We then find the next login date and the next-to-next login date. To ensure that we do this for the same user only, we use the groupby.shift() method. This is akin to a window function in SQL.
• We finally subset only those values where the difference between the first and the second login and between the second and third login is one. To ensure that we do not double count values when a streak of more than 3 days is present, we drop any duplicates that might be present.

Solution

``````# Import your libraries
import pandas as pd

# Extract date part from the time stamps.
sf_events['date_stamp'] = sf_events['date'].dt.date

# Drop duplicate dates for the same user and sort login dates for each user
sf_events = sf_events.drop_duplicates(subset = ['user_id', 'date_stamp']).
sort_values(by =
['user_id', 'date_stamp'])

# Extract the next login date for the same user
sf_events['d+1'] = sf_events[['user_id', 'date_stamp']].groupby(by =
['user_id']).shift(-1)

# Extract the third login date from the same user
sf_events['d+2'] = sf_events[['user_id', 'date_stamp']].groupby(by =
['user_id']).shift(-2)

# Subset only those users where the date difference is 1 between the first
and the second login and that between the second and the third login

sf_events[((sf_events['d+1'] - sf_events['date_stamp']).dt.days == 1) &
((sf_events['d+2'] - sf_events['d+1']).dt.days == 1)]
['user_id'].drop_duplicates()``````

Reporting KPIs and monitoring various model results is a key aspect of any Data Scientist’s day-to-day work. A lot of Python Data Science Interviews, especially at entry levels test the ability of a candidate to calculate metrics and develop business reports. These python interview questions test a combination of a candidate’s business understanding, a sense of how the data should be organized and coding skills. Here, we illustrate some of the commonly used metrics that are widely used by most businesses and how to create them.

## Business Metrics and KPIs: Ranking

Positions or Ranks are one of the most reported numbers in most KPIs. There can be multiple use cases for ranks – top selling products, most cost-effective channels, best performing ads, et al. Let us look at a couple of problems from Python Data Science interviews that required the candidate to use ranking concepts.

The first one is from the City of San Francisco.

#### 8. Find the top 5 highest paid and top 5 least paid employees in 2012

Find the top 5 highest paid and top 5 least paid employees in 2012. Output the employee’s name along with the corresponding total pay with benefits. Sort records based on the total payment with benefits in ascending order.

You can solve this problem here

https://platform.stratascratch.com/coding/9979-find-the-top-5-highest-paid-and-top-5-least-paid-employees-in-2012

The dataset has the following fields.

sf_public_salaries

 id int64 employeename object jobtitle object basepay float64 overtimepay float64 otherpay float64 benefits float64 totalpay float64 totalpaybenefits float64 year int64 notes datetime64[ns] agency object status object

And this is how the data looks like.

Approach

• This is a relatively straightforward problem.
• We start off by subsetting the data for 2012.
• We then sort it in ascending order by the column totalpaybenefits.
• Finally, we take the first five and the last five rows and out the relevant fields.

Solution

``````# Import your libraries
import pandas as pd

# take the 2012 data and sort
yr_2012 = sf_public_salaries[sf_public_salaries['year'] == 2012].sort_values
(by =
['totalpaybenefits'])

# take the first 5 and the bottom 5 rows.
yr_2012[:5].append(yr_2012[-5:], ignore_index = True)[
['employeename', 'totalpaybenefits']]``````

Let us try a more difficult one. This is from an Airbnb Data Science Interview.

#### 9. Ranking Hosts By Beds

Rank the AirBnB hosts based on the number of beds they have listed. The host with the most beds should be ranked highest and one with the least number of beds should be ranked last. Hosts with the same number of beds should have the same rank. A host may own multiple properties. Report the host ID, number of beds, and rank from highest rank to lowest.

You can solve the problem here:

The dataset has the following fields.

airbnb_apartments

 host_id int64 apartment_id object apartment_type object n_beds int64 n_bedrooms int64 country object city object

Here is a sample of the data present in the dataset.

Approach

• As mentioned in the problem, hosts may have multiple properties listed on the platform. Therefore, before we start ranking, we need to sum up the total number of beds that a host has.
• We then rank these hosts based on the total number beds listed.
• Finally, we sort the hosts by their ranks and output the results.

Solution

``````# Import your libraries
import pandas as pd

# roll up the number of beds for a host
hosts_summ = airbnb_apartments.groupby(by = ['host_id'],
as_index = False).sum()[['host_id','n_beds']]

# Rank the hosts by the number of beds, with the highest getting the best rank
hosts_summ['rank'] = hosts_summ['n_beds'].rank(method = 'dense',
ascending = False)

# sort the output by rank in ascending order
hosts_summ.sort_values(by = ['rank'])``````

## Business Metrics and KPIs: Aggregations

Aggregations are bread and butter for most Data Scientists. It is not possible to inspect each data point, therefore aggregations are used to examine trends, outliers, and possible relationships. For Data Analyst positions, these are even more important. You can look at the similarities and differences in the job responsibilities for various Machine Learning and Data Science Job Titles or Positions in this article. Let us look at some examples that use aggregation functions.

The first one is from Whole Foods Market Data Science Interview

#### 10. Products Report Summary

Find the number of transactions and total sales for each of the product categories in 2017. Output the product categories, number of transactions, and total sales in descending order. The sales column represents the total cost the customer paid for the product so no additional calculations need to be done on the column. Only include product categories that have products sold.

You can solve the full problem here.

The problem uses two datasets. The dataset wfm_transactions has the following fields

 customer_id int64 store_id int64 transaction_date datetime64[ns] transaction_id int64 product_id int64 sales int64

This is a snapshot of the data that it contains.

The second dataset is named wfm_products and contains the following fields.

 product_id int64 product_description object product_brand object product_category object

This is how the data in wfm_products looks like.

Approach

• We start off by merging the two datasets using the product_id key. Since we need only the products that have at least one transaction, we perform a left join.
• Next, we need to subset only the transactions that were made in the year 2017. We invoke the year method on the transaction_dt timestamp object for this.
• Finally, we aggregate the sales and the number of transactions by the product_category field, and report the results after sorting them in the descending order of the sales.

Solution

``````# Import your libraries
import pandas as pd

# Merge the datasets
merged_df = pd.merge(wfm_transactions, wfm_products, on = 'product_id',
how = 'left')

# Find the year of transaction
merged_df['year'] = merged_df['transaction_date'].dt.year

# Keep only 2017 transactions
merged_df = merged_df[merged_df['year'] == 2017]

# Summarize, sort and submit.
merged_df.groupby(by = ['product_category'], as_index = False).agg(
{'sales' : 'sum','transaction_id' :'count'}).sort_values(by = ['sales'],
ascending = False)``````

The second one is from a Noom Data Science Interview.

#### 11. Signups By Billing Cycle

Write a query that returns a table containing the number of signups for each weekday and for each billing cycle frequency. Report the signups with the weekday number (e.g., 1, 2, 3) as rows in your table and the billing cycle frequency (e.g., annual, monthly, quarterly) as columns.

You can solve the full problem here: https://platform.stratascratch.com/coding/2032-signups-by-billing-cycle

The python interview question uses two datasets. The signups dataset contains the following fields

 signup_id int64 signup_start_date datetime64[ns] signup_stop_date datetime64[ns] plan_id int64 location object

The data in signups looks like this.

The plans table contains these fields

 id int64 billing_cycle object avg_revenue float64 currency object

And the data looks like this.

Approach

• Since we need the billing cycle from the second table, let us merge the two tables. The join keys will be the signup_id field in the signups table and the id field in the plans table.
• Then we need to create a field for identifying the day of the week. We can use the weekday method for a datetime object for this.
• Finally we can generate the report on one line using the pivot table method. We use the weekday for rows, billing cycle for columns and count method for aggregation.

Solution

``````import pandas as pd

# merge the datasets
merged_df = pd.merge(signups, plan, left_on = 'plan_id', right_on = 'id',
how = 'left')

# Get the weekday
merged_df['weekday'] = merged_df['signup_start_date'].dt.weekday

# Create the report
pd.pivot_table(data = merged_df, values = 'avg_revenue',
columns = 'billing_cycle',
aggfunc = 'count', index = 'weekday').reset_index()``````

## Business Metrics and KPIs: Profit and Profitability

Another group of business metrics that are very widely used and requested are those involving profits and profitability. These metrics usually require you to drill down the overall numbers into subcategories like location, product, segment, et al. Familiarity with aggregation methods is critical to solving these problems in Python. These problems might also require you to calculate the metric in a very specific manner, so read the python interview questions very carefully when solving these problems. Here look at a problem that was asked in a Noom Data Science Interview.

#### 12. Find The Most Profitable Location

Calculate the average signup duration and average revenue for each location. Compare these two measures by taking the ratio of the average revenue and average duration for each location. Report the location, average duration, average revenue, and ratio. Sort your results from highest ratio to lowest.

You can solve the problem here:

https://platform.stratascratch.com/coding/2033-find-the-most-profitable-location

The problem involves two datasets. signups and transactions.

signups

 signup_id int64 signup_start_date datetime64[ns] signup_stop_date datetime64[ns] plan_id int64 location object

The data contained in it looks like this.

The transactions dataset contains the following fields

 transaction_id int64 signup_id int64 transaction_start_date datetime64[ns] transaction_refunded_date datetime64[ns] amt float64

The data in this dataset looks like this.

Approach

• This problem looks straightforward but can lead to wrong results if one is not careful.
• The problem lies in the way the data is arranged. There might be multiple transactions with the same signup id. If we merge the datasets together and then calculate the duration, it will lead to duplicated values of duration and therefore the wrong mean.
• To overcome this problem, we calculate the two averages separately and then merge the dataset.
• We first calculate the duration from the signups dataset and summarize it by taking the average value for a location.
• To get the average transaction value for each location, we pull the location from the signups dataset into the transactions dataset. Then we calculate the average transaction amounts for each location.
• Then, we merge the two summary datasets together and calculate the ratio as defined in the problem.
• Finally, we report the numbers sorted in descending order of the calculated ratio.

Solution

``````# Import your libraries
import pandas as pd

# Calculate the duration of signups
signups['duration'] = (signups['signup_stop_date'] - signups
['signup_start_date']).dt.days
# Get the average duration
avg_dur_df = signups.groupby(by = ['location'], as_index = False).mean()

# Get the location_id from the signups dataset
merged_df = pd.merge(left = transactions, right = signups
[['signup_id', 'location']],
on = 'signup_id', how = 'left')
# Summarize by location
trans_df = merged_df.groupby(by = ['location'], as_index = False).mean()

# Merge the dataframes keeping only the relevant columns
final_merged_df = pd.merge(avg_dur_df[['location', 'duration']], trans_df[
['location', 'amt']], on = 'location', how = 'inner')

# Calculate ratio, sort and submit.
final_merged_df['ratio'] = final_merged_df['amt'] / final_merged_df
['duration']
final_merged_df.sort_values(by = ['ratio'], ascending = False)``````

## Business Metrics and KPIs: Churn and Retention

Account Retention and Churn (number of accounts leaving the platform) is a critical metric for all businesses. It is easier to transact with existing accounts versus trying to add more accounts to the platform. All businesses track churn and retention numbers closely and a lot of Data Science Interviews will involve these metrics in one form or the other. Here, we look at a couple of problems that involve churn, retention rates and associated measures.

The first one is from Lyft Data Science Interview

#### 13. Year Over Year Churn

Determine whether the churn rate of drivers increased or decreased in each year compared to the previous one. Report the year the drivers left Lyft along with the corresponding number of churns in that year, the number of churns in the previous year, and an indication on whether the number has been increased (output the value 'increase') or decreased (output the value 'decrease'). Sort the records by the year in ascending order.

You can solve the problem here:

https://platform.stratascratch.com/coding/10017-year-over-year-churn

The problem uses lyft_drivers dataset that contains the following fields

 index int64 start_date datetime64[ns] end_date datetime64[ns] yearly_salary int64

The data looks like this.

Approach

• We start off by finding the year from the end_date date field. This is the date when the driver exited for the Lyft. To do this, we invoke the year method on the timestamp object.
• Then, we summarize the number of drivers based on this “year” field. We also drop any values where the year is missing since these drivers are still on the platform. This leaves us with a summary data for the number of exits for each of the years.
• To get the previous year’s values, we use the shift method and get a field with values shifted one row. This gives us the previous year’s churn value.
• Finally, we compare the two churn numbers and output the required fields.

Solution

``````# Import your libraries
import pandas as pd

# Get the year from exit date
lyft_drivers['year'] = lyft_drivers['end_date'].dt.year

# Get the number of the drivers churned for each year
summ_df = lyft_drivers.groupby(by = ['year'], as_index = False).agg(
{'index' : 'count'}).sort_values(by = ['year']).rename(columns =
{'index' : 'churn'}).dropna()

# Fetch the prev year's churn numbers
summ_df['prev_churn'] = summ_df['churn'].shift(1).fillna(0)

# Compare the two churn numbers and output the change
summ_df['change'] = (summ_df['churn'] > summ_df
['prev_churn']).apply(lambda x : 'increase' if x == True else 'decrease')
summ_df``````

Let us try a more difficult one. This one is from a Salesforce Data Science Interview.

#### 14. Retention Rate

Compare the retention rates in Jan 2021 over that of Dec 2020. Retention rate is defined as the percentage of users a business retains over a given period of time. Assume that the user is retained if he stays with the app in any future months. Your output should have account_id and the Jan 2021 retention rate divided by Dec 2020 retention rate.

You can solve the full problem here:

https://platform.stratascratch.com/coding/2053-retention-rate

The problem uses the sf_events dataset that has the following fields.

 date datetime64[ns] account_id object user_id object

This is how the data looks like.

Approach

• We can split the problem into two parts.
• Finding retention rates for each month
• Let us begin by finding the retention rate for December 2020.
• We need to get the base numbers for December 2020. To get this, we need to get all those users who logged into the app in December 2020. We remove duplicates to ensure that a user is counted only once.
• Then we get all those users that logged onto the app after December 2020.
• We merge these two datasets keeping only the users that logged in December 2020.
• We can now summarize by account_id and calculate the retention rate for December 2020.
• We can repeat the process for January 2021 to get those numbers.
• Comparing the retention rate for the two months
• Once we have the retention rate datasets for December 2020 and January 2021, we can simply merge the two datasets and calculate the retention ratio. We will again use a left join (with the December dataset being the left table) as we need to compare December values with January ones.

Solution

``````# Import your libraries
import pandas as pd

# Get Accounts active in December
dec2020 = sf_events[(sf_events['date'].dt.year == 2020) & (sf_events
['date'].dt.month == 12)].drop_duplicates(subset = ['account_id', 'user_id'])
dec2020['in_dec'] = 1

# Get accounts active after December
aft_dec2020 = sf_events[sf_events['date'].dt.date > pd.to_datetime
('2020-12-31')].drop_duplicates(subset = ['account_id', 'user_id'])
aft_dec2020['aft_dec'] = 1

# Merge the datasets to get December retention numbers
dec_merged = pd.merge(dec2020, aft_dec2020, on =
['account_id', 'user_id'], how = 'left')
dec_summ = dec_merged.groupby(by = ['account_id'], as_index = False).sum()
dec_summ['dec_retention'] = dec_summ['aft_dec'] / dec_summ['in_dec']

# Repeat the process for Jan 2021
jan2021 = sf_events[(sf_events['date'].dt.year == 2021) & (sf_events
['date'].dt.month == 1)].drop_duplicates(subset = ['account_id', 'user_id'])
jan2021['in_jan'] = 1
aft_jan2021 = sf_events[sf_events['date'].dt.date > pd.to_datetime
('2021-01-31')].drop_duplicates(subset = ['account_id', 'user_id'])
aft_jan2021['aft_jan'] = 1

jan_merged = pd.merge(jan2021, aft_jan2021, on =
['account_id', 'user_id'], how = 'left')
jan_summ = jan_merged.groupby(by = ['account_id'], as_index = False).sum()
jan_summ['jan_retention'] = jan_summ['aft_jan'] / jan_summ['in_jan']

# Merge the Dec20 and Jan21 datasets, calculate the relative retention rate
# and output.
final_merge = pd.merge(dec_summ, jan_summ, on = 'account_id', how = 'left')
final_merge['retention'] = final_merge['jan_retention'] / final_merge
['dec_retention']
final_merge[['account_id', 'retention']]``````

## Business Metrics and KPIs: Loss Prevention

Loss prevention and fraud detection is a critical aspect of every business. Businesses must be at the top of their game to understand loss making portfolios and manage frauds. Here we look at some loss prevention problems that have appeared in Data Science interviews.

The first one is a relatively simple one from a Netflix Data Science Interview.

#### 15. Top Percentile Fraud

ABC Corp is a mid-sized insurer in the US. They have seen a significant increase in fraudulent claims for their personal auto insurance portfolio. To mitigate these risks, they have developed an ML based predictive model to identify potential fraudulent claims. Each account is assigned a fraud score. Your objective is to identify the top 5 percentile of claims from each state. Your output should be policy number, state, claim cost, and fraud score.

You can solve the full problem here:

https://platform.stratascratch.com/coding/10303-top-percentile-fraud

The problem uses the fraud_score dataset with the following fields.

 policy_num object state object claim_cost int64 fraud_score float64

The data looks like this

Approach

• This is a relatively straightforward problem that can be solved using the rank method. We use the rank method with the pct = True option to get percentile values.
• To rank (and calculate percentiles) for each state, we use the groupby method.
• Finally, we subset the top 5% of the frauds (above 95%th percentile) and report the values.

Solution

``````# Import your libraries
import pandas as pd

# Calculate Percentile Rank
fraud_score['pctile'] = fraud_score.groupby(by =
['state'])['fraud_score'].rank(pct = True)

# Keep only the top 5%
fraud_score[fraud_score['pctile'] > 0.95].drop(columns = 'pctile')``````

Let us try a more difficult problem. This one is from an Uber Data Science Interview.

#### 16. Percentage Of Revenue Loss

Calculate the percentage of incomplete orders along with the revenue loss percentage for each service category. Output the name of the service, percentage of incomplete orders, and revenue loss from the incomplete orders.

You can solve this problem here. https://platform.stratascratch.com/coding/2048-percentage-of-revenue-loss

The problem uses the uber_orders dataset that contains the following fields.

 order_date datetime64[ns] number_of_orders int64 status_of_order object monetary_value float64 service_name object

The data in the set looks like this.

Approach

• This python interview question requires us to summarize the dataset twice. Once for the entire range of values and then for incomplete orders. We create two summary datasets aggregated by the service_name field.
• We then merge the two datasets and calculate the percentage loss numbers for transactions and monetary value and output the relevant fields.

Solution

``````# Import your libraries
import pandas as pd

# Get the overall numbers
overall_df = uber_orders.groupby(by = ['service_name'], as_index = False).sum()

# Summarize incomplete orders
loss_df = uber_orders[uber_orders['status_of_order'] != 'Completed'].groupby
(by = ['service_name'], as_index = False).sum()

# Merge the datasets
summ_df = pd.merge(overall_df, loss_df, on = 'service_name', how = 'left',
suffixes = ['_total', '_loss'])

# Calculate loss metrics and output the relevant columns
summ_df['loss_order_pct'] = summ_df['number_of_orders_loss'] / summ_df
['number_of_orders_total'] * 100
summ_df['loss_value_pct'] = summ_df
['monetary_value_loss'] / summ_df['monetary_value_total'] * 100
summ_df[['service_name', 'loss_order_pct', 'loss_value_pct']]``````

## Business Metrics and KPIs: Forecasting

Forecasting is a critical part of any business to ensure that one does not suffer from stock outs or over supply of inventory. This is a slightly complex business problem that appeared in Uber Data Science Interviews.

#### 17. Naive Forecasting

Develop a naive forecast for a new metric called "distance per dollar" defined as the (distance_to_travel/monetary_cost) in our dataset and measure its accuracy. A naive forecast is simply the value of the previous observation. Distance to travel is defined as the ratio of the “distance to travel” and “monetary cost” fields. The forecast number is the monthly aggregate across all trips for the previous month.

To check the accuracy of the forecast, we use the Root Mean Squared Error (RMSE) Metric. RMSE is defined as sqrt(mean(square(actual - forecast)).
Output the RMSE rounded to the 2nd decimal spot as the result.

You can solve the problem here. https://platform.stratascratch.com/coding/10313-naive-forecasting

The problem uses the following dataset

uber_request_logs

 request_id int64 request_date datetime64[ns] request_status object distance_to_travel float64 monetary_cost float64 driver_to_client_distance float64

This is a sample of the data present in the uber_request_logs dataset.

Approach

• There are two major challenges to solve this set.
• We first need to extract the month from the timestamp. To do this, we construct a string using the date methods.
• Then we summarize the dataset based on this month.
• Remember, we need to calculate the monthly value of the distance_to_travel and monetary_cost.
• For this, we sum the two columns across months and then take a ratio. If we take the average of individual distance_to_travel to monetary_cost ratios, it will give the wrong answer.
• We then use the shift method to get the previous month’s value as the naïve forecast for the next month.
• Finally, we calculate the RMSE as per the formula given in the problem, round off the result to two decimal places and out the result.

Solution

``````# Import your libraries
import pandas as pd
import numpy as np

# Create the month field from the date
uber_request_logs['month'] = pd.to_datetime(uber_request_logs
['request_date'].apply
(lambda x : str(x.year) + str(x.month).zfill(2) + '01')).dt.date

# Summarize the dataset by month
uber_summ = uber_request_logs.groupby(by = ['month'], as_index = False).
sum().sort_values(by = ['month'])

# Calculate the distance per dollar metric
uber_summ['dist_per_dollar'] = uber_summ['distance_to_travel'] / uber_summ
['monetary_cost']

# Pull the Naive Forecast
uber_summ['forecast'] = uber_summ['dist_per_dollar'].shift(1)

# Calculate RMSE and round to two decimal places
round(np.sqrt(((uber_summ['dist_per_dollar'] - uber_summ
['forecast']) **2).mean()),2)``````

## Text Manipulation

While traditionally a lot of data storage systems contain tabular data, over the last decade, there has been an explosion of unstructured data usually stored in JSON and text format. Working with these data types presents a different challenge compared to working with traditional databases. Here we look at a few text wrangling problems that have appeared in Python Data Science Interviews.

We start off with an easy one. This appeared in a Yelp Data Science Interview.

#### 18. Reviews of Categories

Find the total number of reviews for each category. Output the total number of reviews and the category ordered by the number of reviews with the category having the highest number of reviews coming first and the one with lowest last.

You can solve the problem here.

https://platform.stratascratch.com/coding/10049-reviews-of-categories

The problem uses the yelp_business dataset that has the following fields.

 business_id object name object neighborhood object address object city object state object postal_code object latitude float64 longitude float64 stars float64 review_count int64 is_open int64 categories object

The data contained in the dataset looks like this.

Approach

• The fields that are of interest to us are review_count and categories.
• As one can see, a restaurant can belong to multiple categories. Therefore, we need to split the categories string and count the reviews in each of the resultant categories.
• To do this quickly, we first convert the string into a list of categories. We then invoke the string method split(), using a semicolon (;) as the delimiter.
• Once we get the list, we can call the explode method on the column to get each individual category as a separate row with the rest of the data duplicated.
• Finally, we can summarize by the individual categories, sort in descending order of the reviews and output the result.

Solution

``````# Import your libraries
import pandas as pd

# convert categories into a list

# Explode the category list, summarize by the resultant categories, sort and
submit
as_index = False).agg({'review_count' : 'sum'}).sort_values
(by = 'review_count', ascending = False)``````

Let us try a slightly more difficult problem. This one came up in a Google Data Science Interview.

#### 19. Find the number of times each word appears in drafts

Output the number of times each word appears in drafts. Report the word and the corresponding number of occurrences.

You can solve this problem here.

https://platform.stratascratch.com/coding/9817-find-the-number-of-times-each-word-appears-in-drafts

The problem uses the following dataset

 filename object contents object

The data looks like this.

Approach

• We first start by selecting the drafts from all the documents.
• Since we need a word count, we split the contents column and call the explode method again.
• Finally, we use the value_counts() method to find the frequency of each word and report the result.

Solution

``````# Import your libraries
import pandas as pd

# Get only the files that are "drafts"
str.contains('draft')]

# split and explode the contents field, calculate frequencies and submit
drafts['contents'].str.split().explode().value_counts()``````

Let us finish text manipulation with a problem involving string matching. This came up in a City of San Francisco Data Science Interview.

Classify each business as per the following criteria.

• If the word ‘restaurant’ appears in the business name, classify it as a restaurant.
• If the words ‘café’, ‘coffee’, or 'café' appear in the business name, then it should be classified as a cafe.
• If the word ‘school’ appears in the business name, then classify it as a school.
• All other businesses not satisfying any of the above conditions should be classified under other.

You can solve the problem here.

The problem uses the following dataset

sf_restaurant_health_violations

The dataset looks like this.

Approach

• The only field of interest here is the business_name.
• We use multiple if conditions inside a lambda function.
• We check the business names as per each of the classification rules and assign a category
• Finally, we output the business names with their categories

Solution

``````# Import your libraries
import pandas as pd

# Classify as per the rules
sf_restaurant_health_violations['category'] = sf_restaurant_health_violations
'school' if x.lower().find('school') >= 0 \
else 'restaurant' if x.lower().find('restaurant') >= 0 \
else 'cafe' if (x.lower().find('cafe') >= 0 or x.lower().find('café') >= 0 \
or  x.lower().find('coffee') >= 0) \
else 'other'
)

# Output relevant fields
drop_duplicates()``````

## Product Specific Python Interview Questions

Besides these types, there are some problems that are very peculiar to the company and its products. Here we list ten such problems that will give you a very good understanding of the type of problems that you can expect. The idea behind these problems is to relate the concepts known to you with the products and services that a company offers and try to visualize the data setup, metrics that might be used and prepare for their Data Science Interview. You will also find a lot of concepts already covered thus far.

#### 21. Airbnb Python Interview Question

City With Most Amenities

Given a dataset of searches for properties on Airbnb, find the city with the most amenities across all their host's properties. Assume that each search result (i.e., each row) represents a unique host. Output the name of the city as your solution.

You can try the full problem here. https://platform.stratascratch.com/coding/9633-city-with-most-amenities

The problem uses the following dataset

airbnb_search_details

 id int64 price float64 property_type object room_type object amenities object accommodates int64 bathrooms int64 bed_type object cancellation_policy object cleaning_fee bool city object host_identity_verified object host_response_rate object host_since datetime64[ns] neighbourhood object number_of_reviews int64 review_scores_rating float64 zipcode int64 bedrooms int64 beds int64

This is how the dataset looks.

Approach

• While there are a lot of fields in the dataset, there solution just needs two: amenities and city
• We need to aggregate the number of amenities across all listings in a city. We start off by finding the number of amenities in a particular listing. To do this, we find the length of the amenities set.
• We then aggregate the amenities for each city and finally output the city with the highest number of amenities.

Solution

``````# Import your libraries
import pandas as pd

# get the number of number of amenities
airbnb_search_details['num_amenities'] = airbnb_search_details
['amenities'].apply(lambda x : len(x))

# Summarize by city and output the city with the most amenities
airbnb_search_details.groupby(by = ['city'], as_index = False).agg
({'num_amenities' : 'sum'}).sort_values(by = ['num_amenities'],
ascending = False)['city'].iloc``````

Airbnb is an online marketplace for vacation stays, homestays and lodging. You can expect problems with cases presented in this setting. You can also go through a more detailed AirBnB Data Science Interview guide here.

#### 22. Amazon Python Interview Question

Revenue Over Time

Given a table with users, their purchase amount, and date purchased, determine the three-month rolling average of the total revenues generated from purchases. Output the result sorted by the month from earliest to latest month.

Note: Exclude the returns which are represented by negative purchase values. For the first two months, calculate the average only up to that month.

You can solve the problem here.

https://platform.stratascratch.com/coding/10314-revenue-over-time

The problem uses the amazon_purchases dataset that has the following fields.

 user_id int64 created_at datetime64[ns] purchase_amt int64

And the dataset looks like this.

Approach

• We start off by excluding the return transactions. To do this, we keep only those transactions where the purchase_amt is positive.
• Then we extract the month from the date field.
• We then aggregate the purchases for each month. This results in a dataset with total monthly purchases.
• To calculate the rolling average, we call the rolling() method on the purchases column.
• To ensure that the values are calculated even for the first two months even though we do not have three months of data, we set min_periods to 1.
• Finally we subset the relevant fields and output the result.

Solution

``````# Import your libraries
import pandas as pd

# Remove refund transactions
pos_df = amazon_purchases[amazon_purchases['purchase_amt'] > 0]
# Create Month indicator
pos_df['month'] = pos_df['created_at'].apply(lambda x : str(x.year) + "-" +
str(x.month).zfill(2))

# Aggregate the purchases by month
summ_df = pos_df.groupby(by = ['month'], as_index = False).sum()
[['month', 'purchase_amt']].sort_values(by = ['month'])

# Calculate the rolling average, ensure that the value is calculated even
for the first two months
summ_df['roll_avg'] = summ_df['purchase_amt'].rolling
(3, min_periods = 1).mean()

# Output relevant fields
summ_df[['month', 'roll_avg']]``````

Amazon is one of the largest companies in the world. It operates across multiple product and service categories across the world. It is mostly known for its eponymous website. Find more data science interview questions from Amazon here.

#### 23. DoorDash Python Interview Question

Avg Order Cost During Rush Hours

Find the average order cost during rush hours (3PM to 6PM) in San Jose. Order the results by hour.

You can solve the full problem here:

https://platform.stratascratch.com/coding/2035-avg-order-cost-during-rush-hours

The problem uses the following dataset

doordash_delivery

 customer_placed_order_datetime datetime64[ns] placed_order_with_restaurant_datetime datetime64[ns] driver_at_restaurant_datetime datetime64[ns] delivered_to_consumer_datetime datetime64[ns] driver_id int64 restaurant_id int64 consumer_id int64 is_new bool delivery_region object is_asap bool order_total float64 discount_amount float64 tip_amount float64 refunded_amount float64

This is how the dataset looks

Approach

• We start by extracting the hour of delivery by using the hour method on the customer_placed_order_datetime field as suggested in the problem.
• We then subset data keeping only the orders placed in the San Jose area and between the 3PM to 6PM.
• Finally, we summarize the mean order value for each hour and output.

Solution

``````# Import your libraries
import pandas as pd

# Find the hour or delivery
doordash_delivery['hour'] = doordash_delivery
['customer_placed_order_datetime'].dt.hour
# Keep only San Jose orders placed between 3 to 6 PM
rush_df = doordash_delivery[(doordash_delivery['hour'].isin([15,16,17, 18])) &
(doordash_delivery['delivery_region'] == 'San Jose')]
# Aggregate and output
rush_df.groupby(by = ['hour'], as_index = False).agg({'order_total': 'mean'})``````

Doordash is an online food delivery and ordering platform and is the largest such platform in the US. It also provides convenience store delivery services. You can attempt more such problems from Doordash Data Science Interviews Questions here.

#### 24. Google Python Interview Question

Activity Rank

Email activity rank is calculated on the total number of emails sent, with the user with the highest number of emails sent given the top rank. Calculate the email activity rank for each user. If two users have sent the same number of emails, then the users are ranked in the alphabetical order of their IDs. Report the user id, number of emails sent and the rank. Output should be sorted by rank.

You can try this problem here.

https://platform.stratascratch.com/coding/10351-activity-rank

The problem uses the google_gmail_emails dataset that contains the following fields.

 id int64 from_user object to_user object day int64

This is how the data looks.

Approach

• The sender’s user id is the from_user column. Therefore, we summarize the dataset by the from_user field aggregating the number of emails sent.
• We sort the resultant dataset by descending order of number of emails sent. To break the ties, we use the alphabetical order of the user id as a secondary sort field.
• Finally, we create a rank field and output the relevant fields

Solution

``````# Import your libraries
import pandas as pd

# Get the number of mails sent by each user, sort by the results
summ_df = google_gmail_emails.groupby(by = ['from_user'], as_index = False).
count().sort_values(by = ['id', 'from_user'], ascending = [False, True])

# Add a Rank field and output relevant fields
summ_df['rank'] = [i+1 for i in range(len(summ_df))]
summ_df[['from_user', 'id', 'rank']]``````

Google is synonymous with search. One of the most valuable companies in the world Google operates in multiple domains. Check out our ultimate guide to become a Data Scientist at Google.

#### 25. Microsoft Python Interview Questions

Users Exclusive Per Client

Write a query that returns the number of users who are exclusive to only one client. Output the client_id and number of exclusive users.

You can solve the problem here.

https://platform.stratascratch.com/coding/2025-users-exclusive-per-client

The problem uses the fact_events dataset which has the following fields.

 id int64 time_id datetime64[ns] user_id object customer_id object client_id object event_type object event_id int64

The data contained in it looks like this.

Approach

• Since we need to only find the number of clients used by the a particular user, we remove all duplicate combinations of user_id and client_id
• We then count the number of client_ids used by each user.
• Finally, we subset only those users where the number of clients used is equal to one and output the relevant fields.

Solution

``````# Import your libraries
import pandas as pd

# Find the number of clients used by a user.
user_df = fact_events.drop_duplicates(subset = ['user_id', 'client_id'])
[['user_id', 'client_id']].groupby(by = ['user_id'], as_index = False).count()

# Subset those users who use only one client and output relevant fields
out_df = fact_events[fact_events['user_id'].isin(user_df[user_df['client_id']
== 1]['user_id']
)][['user_id', 'client_id']].drop_duplicates().groupby(by = ['client_id'],
as_index =False).count()
out_df``````

Microsoft is one of the oldest tech companies and recently crossed a market value of \$1 trillion. Microsoft operates across multiple business segments: Operating Systems, Productivity Software, Tablet PCs, Cloud Infrastructure among others. Find the recent Python interview questions from Microsoft for the Data Scientist position here.

#### 26. Netflix Python Interview Question

Win-to-Nomination Ratio

For each nominee, calculate the win-to-nomination ratio. Report each nominee and their win-to-nomination ratio such that the nominee with the highest ratio is on top while the one with the lowest ratio is bottom.

You can solve the problem here.

The problem uses the following dataset

oscar_nominees

 year float64 category object nominee object movie object winner bool id int64

This is how the dataset looks.

Approach

• Given the nature of the dataset, we can easily get the required metrics by using the pivot_table function.
• We aggregate the number of nominations counting the number of wins separately for each nominee
• Now we can calculate the win-to-nomination ratio easily.
• Finally, we sort the summarized dataset in descending order of the ratio and output the relevant fields.

Solution

``````# Import your libraries
import pandas as pd

# Aggregate the number of nominations for each nominee, reporting wins separately
summ_df = pd.pivot_table(data = oscar_nominees, index = 'nominee', columns =
'winner', values = 'category', aggfunc = 'count').reset_index().fillna(0)

# Calculate the win to nomination ratio
summ_df['ratio'] = summ_df[True] / (summ_df[True] + summ_df[False])

# Sort in descending order of ratio and output relevant fields
summ_df[['nominee', 'ratio']].sort_values(by = ['ratio'], ascending = False)``````

Netflix is the largest media organization in the world by market capitalization and is widely known for its OTT media platform and original programming. You can solve more Netflix python interview questions on the StrataScratch platform.

#### 27. Noom Python Interview Question

Signup By Day

Create a query that returns an array of signup ids by day. Order the output by the number of signups with the day on with the date with the most number of signups appearing first.

You can solve the problem here.

https://platform.stratascratch.com/coding/2030-signup-by-day

The problem uses the signups dataset that contains the following fields.

 signup_id int64 signup_start_date datetime64[ns] signup_stop_date datetime64[ns] plan_id int64 location object

This is how the data looks like.

Approach

• We first take the date value from the signup_start_date field.
• Since we need to create a list of signups, we group all signup_ids by date and create a list using the apply method.
• We now get the number of signups for each day by taking the length of the created list.
• Finally, we sort the summarized dataset by the number of signups and output the relevant fields.

Solution

``````# Import your libraries
import pandas as pd

# Extract date value
signups['date'] = pd.to_datetime(signups['signup_start_date'].dt.date)

# Groupby date and create a list of signup_id
summ_df = signups[['date', 'signup_id']].groupby(by = ['date'])
['signup_id'].apply(list).reset_index()

# Calculate the number of signups for each day
summ_df['num_signups'] = summ_df['signup_id'].apply(lambda x : len(x))

# Order by the number of signups and output the relevant fields
summ_df.sort_values(by = ['num_signups', 'date'], ascending =
[False, True]).drop(columns = ['num_signups'])``````

Noom operates a subscription-based app that tracks a person’s food intake and exercise habits. You can see more Noom Data Science Interview Problems here.

#### 28. Salesforce Python Interview Question

Total Sales In Different Currencies

You are provided the sales amount for a multinational company in different currencies. You are also provided the currency exchange rates against US dollars (USD) on different dates. Report the total sales, per quarter, for the first 2 quarters in 2020, and the sales in USD currency.

You can try the problem here

https://platform.stratascratch.com/coding/2041-total-sales-in-different-currencies

The problem uses two datasets.

sf_exchange_rate

 source_currency object target_currency object exchange_rate float64 date datetime64[ns]

The data is arranged in the following manner.

The second dataset is called sf_sales_amount containing the following fields.

 sales_date datetime64[ns] sales_amount int64 currency object

The data present in this set looks like this.

Approach

• The sales data is present in the sf_sales_amount dataset. However, it is reported in different currencies. To get the USD sales value, we need to get the exchange rate on the date of the transaction.
• We start off by merging the two datasets matching the combination of sales_date and currency from the sf_sales_amount dataset with the combination of date and source_currency fields from the sf_exchange_rate dataset. Keeping all the transactions in the sf_sales_amount dataset.
• We then extract the reporting quarter from the sales_date field. To do this we use the built-in quarter method for datetime objects.
• Finally we calculate the USD amount by multiplying the sales amount with the exchange rate and summarize by quarter.

Solution

``````# Import your libraries
import pandas as pd

# Merge the two datasets, by currency and date
merged_df = pd.merge(sf_sales_amount, sf_exchange_rate, how = 'left', left_on =
['sales_date', 'currency'], right_on = ['date', 'source_currency'])

# Calculate the USD value of  sales
merged_df['usd_value'] = merged_df['sales_amount'] * merged_df['exchange_rate']

# Create a quarter field from the date
merged_df['quarter'] = merged_df['sales_date'].dt.quarter

# Summarize the usd sales value by quarter
summ_df = merged_df.groupby(by = ['quarter'], as_index = False).sum()

# Keep only the first two quarter data and output relevant fields
summ_df[summ_df['quarter'].isin([1,2])][['quarter', 'usd_value']]``````

Salesforce is a cloud-based SaaS company that provides CRM and associated enterprise solutions focused on marketing automation, customer service, application development, et al. You can solve more Salesforce Data Scientist Coding Interview problems here.

#### 29. Twitch Python Interview Question

Viewers Turned Streamers

Find out how many streaming sessions are conducted by users who started out as viewers (users whose first session on the platform was as a viewer). Output the user id and number of sessions. Sort in descending order of the number of streaming sessions conducted. If two users have the same number of sessions, then the order in the ascending order of their user ids.

You can solve the full problem here.

https://platform.stratascratch.com/coding/2012-viewers-turned-streamers

The problem uses the twitch_session dataset that contains the following fields.

 user_id int64 session_start datetime64[ns] session_end datetime64[ns] session_id int64 session_type object

This is the data contained in the dataset.

Approach

• We start by taking the information about the very first session for a user.
• To find the streaming sessions, we count the number of sessions for each user where the session_type is ‘streamer’
• We merge the two datasets keeping only those users whose first session was as a ‘viewer’.
• Finally, we sort the dataset in descending order of the number of streaming sessions. To break the ties, we sort in the ascending order of the user ids.

Solution

``````# Import your libraries
import pandas as pd

# Get the very first session details for each user
first_session = twitch_sessions.sort_values(by = ['user_id', 'session_start']).
drop_duplicates(subset = ['user_id'])

# Summarize streaming information
streamer_sessions = twitch_sessions[twitch_sessions['session_type'] == 'streamer'].
groupby(by = ['user_id']).agg({'session_id' : 'count'})

# merge the two datasets, subset relevant rows and output relevant fields
merged_df = pd.merge(first_session[first_session['session_type'] == 'viewer']
[['user_id']], streamer_sessions, on = 'user_id', how = 'left').sort_values
(by = ['session_id', 'user_id'], ascending = [False, True])``````

Twitch is a subsidiary of Amazon and is a live-streaming service focusing primarily on broadcasts of esports competitions and video game live streams. You can find more Twitch Data Science Interview questions here.

#### 30. Uber Python Interview Question

Highest Total Miles

Given a table of Uber rides that contains the mileage and the purpose for the business expense, find out the business purposes that generate the most miles driven. Report the top 3 business purpose categories by total mileage.

You can solve the problem here

https://platform.stratascratch.com/coding/10169-highest-total-miles

The problem uses the following dataset

my_uber_drives

 start_date datetime64[ns] end_date datetime64[ns] category object start object stop object miles float64 purpose object

This is how the data is presented.

Approach

• To solve this problem, we aggregate the number of miles by purpose.
• We sort the summarized dataset in descending order of the number of miles and report the top three results.
• We can combine this entire set of operations into a single step with method-chaining.

Solution

``````# Import your libraries
import pandas as pd

# Sum the number of miles by purpose and report the top three records
my_uber_drives.groupby(by = ['purpose'], as_index = False).agg
({'miles': 'sum'}).sort_values(by = ['miles'], ascending = False)[:3]``````

Uber is the largest ride-sharing company in the US. Besides ride-sharing, Uber also has package delivery, freight transportation, food delivery and courier operations. You can find more Uber Data Science Interview questions here.

Also, check out our previous article where we show how to approach and solve the challenging Python interview questions.

### Conclusion

Here you have a comprehensive overview of the types of Python interview questions asked in Data Science Interviews at top companies like Amazon, Google, Microsoft, etc. We also looked at various aspects of business that you need to be familiar with to complete these coding interviews. The keys to improving your skills are patience, persistence, and practice.

Also, if you're confused between the two statistical languages, Python and R, and want to know which language is better, check out our article on Python vs R for Data Science.

Join a community of over 20,000 likeminded data science aspirants to collaborate and compete in solving over 500 coding and non-coding Data Science Interview questions. Sign up for Stratascratch today and make your dream of working at top tech companies and trailblazing start-ups a reality.