Python Cheat Sheet – Technical Concepts for the Job Interview

Datasets

We will be using the EU referendum results data. There are three files. You can make use of the starter notebook to try these problems.

The first file, named uk_demo.json, contains information about the area, the region the area belongs to, and the electorate in the area. The data is presented in the following manner.

The second file, named uk_results.json, contains the results of the voting for each area. The data therein is arranged thus.

We also have additional info about the rejected votes in each region in the third dataset - uk_rejected_ballots.json

The data is presented in the following manner.

Reading Data Into Python

Before we start solving these problems in this python cheat sheet, we need to read data into Python. One way to do this is to download the file locally and then read it. A simpler way is to use the requests module. The requests module is the most widely used Python module for HTTP. We can transmit data securely using security protocols and user authentication easily using the requests module. For the purpose of this exercise, we will simply use the requests module to read the JSON file and load it into a Python list.

import requests
uk_demo_src = r"https://raw.githubusercontent.com/viveknest/statascratch-solutions/main/UK%20Referendum%20Data/uk_demo.json"
f = requests.get(uk_demo_src)
uk_demo = f.json()
uk_demo[:5]

As you can see, the output is a list of dictionaries. We can do the same for the other two files as well.

uk_results_src = r"https://raw.githubusercontent.com/viveknest/statascratch-solutions/main/UK%20Referendum%20Data/uk_results.json"
f = requests.get(uk_results_src)
uk_results = f.json()
uk_results[:5]

uk_rejected_src = r"https://github.com/viveknest/statascratch-solutions/raw/main/UK%20Referendum%20Data/uk_rejected_ballots.json"
f = requests.get(uk_rejected_src)
uk_rejected = f.json()
uk_rejected[:5]

Join the Datasets

First, in this Python cheat sheet, we need to “merge” the three lists using the Area_Code as the key. To accomplish this merging, we created a nested dictionary keeping the merge field as the key for the top-level dictionary. Since we will do this for multiple datasets, we create a simple function.

# Convert lists into a dictionary of dictionaries
def dict_convert(data, key = 'Area_Code'):
    out_dict = {}
    for row in data:
        out_dict.update({
            row[key]: row
        })
    return out_dict

We can simply pass each of the lists to this function and get a nested dictionary.

uk_demo_dict = dict_convert(uk_demo)
uk_demo_dict

We repeat this process for the other lists as well.

uk_results_dict = dict_convert(uk_results)
uk_rejected_dict = dict_convert(uk_rejected)

We now have the key of each of these dictionaries as the merge key. We can simply create a merged dictionary by accessing the nested dictionary using Area_Code, which is the key of the outer dictionary in each of the three dictionaries.

merged_dict = {}

for key in uk_demo_dict.keys():
    merged_dict.update({
        key: {**uk_demo_dict[key], **uk_results_dict[key], **uk_rejected_dict[key]}
    })
merged_dict

And there we have the three datasets merged into a single dictionary. If we want to get a list again, we can use “unnest” the dictionary into a list.

merged_list = [v for k,v in merged_dict.items()]
merged_list

Merged datasets into a single dictionary in python

Summary Statistics

The next in this Python cheat sheet is to find the maximum, minimum, median, 25th, and 75th percentile values for

Electorate
Number of Votes Cast
Number of Valid Votes
Number of Remain Votes
Number of Leave Votes
Number of Rejected Votes

Since we have to repeat the same process multiple times, we will create a function that generates the summary statistics. We use a variable parameter to specify the variable for which we need to find the summary statistic. Let us start off by defining the function. We would be passing the merged_dict that we had generated earlier as the source data.

def summ_stats(source_data, variable):
    return None

We need to extract the data points for the relevant variable. Since we would like to perform mathematical operations on these data points, we will use a list. We also sort the list so as to calculate the median, 25th, and 75th percentiles easily.

def summ_stats(source_data, variable):
    # Get the variable data
    out_list = []
    for k,v in source_data.items():
        out_list.append(v[variable])
    out_list.sort()

    return out_list

summ_stats(merged_dict, 'Electorate')

Summary Statistics in Python Cheat Sheet

We start calculating the summary statistics now. We will output a dictionary with all these statistics. Finding minimum and maximum values is pretty easy. We simply use the built-in min() and max() functions.

def summ_stats(source_data, variable):
    # Get the variable data
    out_list = []
    for k,v in source_data.items():
        out_list.append(v[variable])
    out_list.sort()
    max_value = max(out_list)
    min_value = min(out_list)
    
    
    out_dict = {
        'Max': max_value
        , 'Min': min_value
    }
    
    return out_dict

summ_stats(merged_dict, 'Electorate')

Calculating the summary statistics in python cheat sheet

Median

Median is the middle value of a dataset. If the number of elements (n) is odd, then the median is the following element.

(\frac{n + 1}{2})^{th}

If the n is even, then the median is the average of the following elements

(\frac{n}{2})^{th}

and

(\frac{n}{2} + 1)^{th}

To implement this in Python, we first need to find the number of elements in the list. We can do this by finding the length of the list using the len() function. Then we check if the number is odd or even. Here is a simple implementation of the odd / even checker.

def odd_even(number):
    # This function will return True if the number is even, otherwise it will return false
    if number // 2 == number / 2:
        return True
    return False

odd_even(4)

odd_even(11)

We use the same logic to check if the number of elements is odd or even and then calculate the median as described above.

def summ_stats(source_data, variable):
    # Get the variable data
    out_list = []
    for k,v in source_data.items():
        out_list.append(v[variable])
    out_list.sort()
    max_value = max(out_list)
    min_value = min(out_list)
    num_items = len(out_list)
    # Median Value
    if num_items //2 == num_items / 2:
        mid_value1 = num_items //2 - 1
        mid_value2 = num_items //2 
        median_value = (out_list[mid_value1] + out_list[mid_value2]) / 2
    else:
        mid_value = (num_items + 1) // 2 - 1
        median_value = out_list[mid_value]
    
    
    out_dict = {
        'Max': max_value
        , 'Min': min_value
        , 'Median' : median_value
    }
    
    return out_dict

summ_stats(merged_dict, 'Electorate')

Note: We subtract 1 from each of the positions since Python is zero-indexed

Quartiles

Calculating the percentiles is a bit tricky. There are multiple options to choose from. We will be using the nearest rank method to calculate the percentiles. The location of the pth percentile value in an ordered dataset is given by

\lceil \frac{P}{100} x N \rceil

Where the below represents the ceiling function.

\lceil x \rceil

In simple terms, it rounds up a decimal to the least integer greater than or equal to x. To find the ceiling in Python, we can use the built-in function associated with all floating point values __ceil__()

The function can now be modified to incorporate the two quartiles.

def summ_stats(source_data, variable):
    # Get the variable data
    out_list = []
    for k,v in source_data.items():
        out_list.append(v[variable])
    out_list.sort()
    max_value = max(out_list)
    min_value = min(out_list)
    num_items = len(out_list)
    # Median Value
    if num_items //2 == num_items / 2:
        mid_value1 = num_items //2 - 1
        mid_value2 = num_items //2 
        median_value = (out_list[mid_value1] + out_list[mid_value2]) / 2
    else:
        mid_value = (num_items + 1) // 2 - 1
        median_value = out_list[mid_value]
    
    # Quartiles, using the nearest rank method
    q1_rank = (num_items * 0.25).__ceil__() - 1
    q3_rank = (num_items * 0.75).__ceil__() - 1
    
    q1 = out_list[q1_rank]
    q3 = out_list[q3_rank]
    
    out_dict = {
        'N': num_items
        , 'Max': max_value
        , 'Min': min_value
        , 'Median' : median_value
        , 'q1' : q1
        , 'q3' : q3
    }
        
    return out_dict

summ_stats(merged_dict, 'Electorate')

We can find the summary statistics for the other variables as well.

summ_stats(merged_dict, 'Votes_Cast')

summ_stats(merged_dict, 'Valid_Votes')

summ_stats(merged_dict, 'Remain')

summ_stats(merged_dict, 'Leave')

summ_stats(merged_dict, 'Rejected_Ballots')

Find the Area with the highest and lowest electorates

The next is this Python cheat sheet is to find the area with the highest and lowest electorates, we will use the object properties for sorting. Let us use a simple example to illustrate the process. Suppose we have the following list.

sample_list = ['Alpha', 'Gamma', 'Beta', 'Zeta', 'Delta', 'Epsilon', 'Nu', 'Mu']
sample_list

Find the Area with the highest and lowest electorates

We can sort the list in alphabetical order

sorted(sample_list)

Sort the list of electorates in alphabetical order

Or reverse alphabetical order

sorted(sample_list, reverse = True)

Suppose we want to sort by the properties of the list elements, let’s say we want to sort by the length of the string. We can use the key parameter and define a lambda function.

sorted(sample_list, key = lambda x: len(x))

We can also sort by multiple levels. The following will sort based on the length of the string and then by alphabetical order.

sorted(sample_list, key = lambda x: (len(x), x))

We will now implement this on the merged list we had created earlier.

sorted(merged_list, key = lambda x:x['Electorate'])

Since we need the one with the lowest electorate, we take the first element

sorted(merged_list, key = lambda x:x['Electorate'])[0]

From this we output the Area.

sorted(merged_list, key = lambda x:x['Electorate'])[0]['Area']

To get the Area with the highest electorate, we simply take the last element in the sorted list.

sorted(merged_list, key = lambda x:x['Electorate'])[-1]['Area']

Highest electorate in python cheat sheet

Aggregation

Find the Region wise totals

We start off by getting all the regions. We do this by using a Python set.

regions = set([area['Region'] for area in merged_list])
regions

Find the Region wise totals in python cheat sheet

Now we iterate over the entire list and add totals region wise.

region_dict = {}
for region in regions:
    reg_sum = sum([area['Electorate'] for area in merged_list if area['Region'] == region])
    region_dict.update({region: reg_sum})
    
region_dict

Contribution of each region to the total

To calculate the contribution of each region, we first find the total. We can do this by adding up the region_dict values.

total = sum(region_dict.values())
total

Contribution of each region to the total

We now calculate the contribution of the region’s electorate as a percentage of the total.

cont_dict = {}
for k,v in region_dict.items():
    cont_dict.update({k: v/total * 100})
cont_dict

Contribution of the region electorate as a percentage of the total

There we have it. We have performed the most common dataframe operations in Pandas without using the Pandas library. You can find the entire solved notebook here. Let us perform these operations using the Pandas library.

Pandas

The next is Pandas in this Python cheat sheet. The Pandas library provides us with numerous functions to manipulate structured data. It is the go-to library for data science operations and is also very useful for transitioning to Python if one comes from the spreadsheet or SQL world. We start off by importing pandas as pd in Python.

import pandas as pd

Load the datasets

We can use the built-in read_json function to read the input datasets directly into a pandas dataframe.

demo_df = pd.read_json(r"https://raw.githubusercontent.com/viveknest/statascratch-solutions/main/UK%20Referendum%20Data/uk_demo.json")
demo_df

We can do the same for the other datasets as well.

results_df = pd.read_json(r"https://raw.githubusercontent.com/viveknest/statascratch-solutions/main/UK%20Referendum%20Data/uk_results.json")
rejected_df = pd.read_json(r"https://github.com/viveknest/statascratch-solutions/raw/main/UK%20Referendum%20Data/uk_rejected_ballots.json")

Join the datasets

Pandas provides multiple methods to work with multiple datasets. We can use the merge method to perform an SQL-style join.

merged_df = pd.merge(left = demo_df, right = results_df, on = 'Area_Code').merge(rejected_df, on = 'Area_Code')
merged_df

Learn Joining the Datasets in Python Cheat Sheet

Summary Statistics

Pandas comes with a built-in function to generate summary statistics quickly. We can actually generate all the summary statistics in one go using the describe method.

merged_df.describe()

Since we need only limited statistics and only for a subset of the columns, we use the .loc method to retrieve the relevant fields.

merged_df.describe().loc[['max', 'min', '50%','25%', '75%']][['Electorate', 'Votes_Cast', 'Valid_Votes', 'Remain', 'Leave', 'Rejected_Ballots' ]]

Find the Area with the highest and lowest electorates

There are multiple ways of doing this. In out Python cheat sheet, let us try a couple of methods.

Sorting

As we did in the non-Pandas method above in this Python cheat sheet, we can sort the values in ascending or descending order and output the first or the last row to get the lowest and highest values.

merged_df.sort_values(by = 'Electorate').iloc[0]['Area']

Using the idxmax() and idxmin() functions.

The idxmax() function returns the dataframe index of the highest value. We can pass this index back to the main dataframe to find the row containing the highest value and subset the relevant column.

merged_df['Electorate'].idxmax()

merged_df.loc[merged_df['Electorate'].idxmax()]['Area']

The idxmin() function will give the index of the lowest value.

merged_df.loc[merged_df['Electorate'].idxmin()]['Area']

Find the Region wise totals

Again there are numerous ways in Pandas to aggregate data. Let us use the two most common ways.

Using groupby

Those from the SQL world can use the groupby method to achieve aggregation.

merged_df.groupby(by = 'Region').agg({'Electorate': sum})

Pivot Table

Those from the spreadsheet world can use the pivot table function which works exactly as the name suggests.

merged_df.pivot_table(index = ['Region'], aggfunc = {'Electorate': sum})

This gives an identical solution as the previous step.

Contribution of each region to the total

To find the contribution of each region, we simply divide the aggregate values from above with the sum of the column values.

merged_df.groupby(by = 'Region').agg({'Electorate': sum}) / merged_df['Electorate'].sum() * 100

You can find the entire solved solutions using Pandas here. Also, check out this Pandas cheat sheet that explains the functions any data scientist should know and includes interview questions from Forbes, Meta, Google, and Amazon.

Conclusion

In this Python Cheat Sheet, we solved data wrangling problems in Python with and without the Pandas library. If you want to succeed in Data Science, you need to be versatile with your skills. Depending on the resources available, one must be able to get a solution. Luckily, Python is one of the easiest languages to learn. If you have a good understanding of SQL or spreadsheets, Pandas - one of the most popular Python libraries for Data Science, is very easy to learn. Also, have a look at our Python vs R article. On the StrataScratch platform, you will find more than 500 questions from Data Science interviews from top companies like Netflix, Amazon, Microsoft, Google, et al. Sign up on StrataScratch and join a community of over 20,000 aspirants and mentors helping make your dream of getting top data science jobs a reality.

Python Cheat Sheet – Technical Concepts for the Job Interview

Categories

Datasets

Reading Data Into Python

Join the Datasets

Summary Statistics

Median

Quartiles

Find the Area with the highest and lowest electorates

Aggregation

Find the Region wise totals

Contribution of each region to the total

Pandas

Load the datasets

Join the datasets

Summary Statistics

Find the Area with the highest and lowest electorates

Sorting

Using the idxmax() and idxmin() functions.

Find the Region wise totals

Using groupby

Pivot Table

Contribution of each region to the total

Conclusion

Latest Posts:

From Arrays to Lists: Converting Numpy Arrays with Python

Tips for Crafting an Impressive Data Analyst Portfolio

PySpark GroupBy Guide: Super Simple Way to Group Data

Share

Follow

Categories