How to Import Pandas as pd in Python

Pandas in Data Science

Pandas is an essential tool for doing data science in Python. It allows you to perform many tasks that would be next to impossible with standard language: work with data in various formats (tabular or labeled), sort and format it, combine data from multiple sources, find and clean up messy data, and even visualize it.

Pandas API is easy to follow, especially if you already know SQL. Functions are well documented and named descriptively. Many of the features have the same name as their counterparts in SQL.

If the job you’re applying for requires some knowledge of Python, you will have to use functions from the pandas library a lot. Interviewers might ask you questions to measure your ability to make use of all the functions from the pandas library.

What Can Pandas Do?

Short answer is that you need Pandas to gain access to many functions for performing data analysis in Python. It is especially useful for doing data modeling, data analysis and data manipulation.

Pandas provides two data structures Series and DataFrame, which give you flexibility to work with data sets in different formats, like tabular, time-series, and matrix data. DataFrame provides the foundation to perform essential operations like data alignment, calculating statistics, slicing, grouping, merging, adding or splitting sets of data. It is useful for working with three dimensional data from tables.

For example, sort_values() and sort_index() functions are commonly used for sorting and organizing data. merge() allows you to combine and work with data from two tables, similar to JOINs in SQL. You can even specify the shared dimension and method for combining data. concat() function for joining strings and arrays.

The read_csv() and read_excel() function allows you to work with data from Excel spreadsheets. Calling the df.head() function will return the first five rows by default. You can provide an argument to specify the number of rows you want it to return.

Functions from the pandas library can be used to get information about available data. df.shape() returns information about the number of columns and rows. df.size() returns the number of rows multiplied by the number of columns. df.info() can give us information about value types for each column.

Functions like df.to_datetime(), df.to_period(), dt.year(), dt.month() and similar are very useful for working with datetime values.

The library also contains features for cleaning up messy data. You can set up conditions, limits, and formatting rules for detecting outlier cases. You can also define how to fix bad data. Doing this will help you remove unrealistic values, which can improve the accuracy of your data analysis.

How to Import Pandas as pd in Python

Installing Python alone is not enough to import the Pandas library, you need to install the library itself. It is only compatible with Python version 2.7 or higher.

Newer versions of Python (3.4 and higher) include the installation of pip, a tool for installing packages like Pandas library. If you’re running an older version of Python, you can install pip manually or download the latest version of Python on https://www.python.org.

Once you have pip set up, open the command prompt and type the following command:

pip install pandas

Once the installation is complete, command prompt will display some form of success message, such as: Successfully installed pandas-0.24.2

You can also download and install Pandas without using command prompt. Manually download and install the Anaconda software, which automatically installs Pandas as well.

Note: During the installation process, it’s essential to tick the checkbox for ‘Add Anaconda to my PATH environment variable’ option.

Once Pandas is installed, you can import pandas as pd and its parent library, numpy:

import pandas as pd
import numpy as np

We can simply write ‘import pandas’ to make this module available. However, it is a good practice to use the as keyword to give it a shorthand name ‘pd’. Still, this is not a requirement and you can simply write ‘import pandas’.

In this example, we also import numpy, another data science library, which is often used with pandas.

Example Problems Where You Need to Import Pandas as pd

Pandas library provides essential features for doing data analysis in Python. For this reason, solutions to Python interview questions often rely on features from the Pandas library.

Question 1 - Finding User Purchases

Solving this question comes down to working with purchase dates to find instances when a single user ordered two times within a 7-day period. Functions from the pandas library will be essential for finding an answer.

Finding User Purchases

Interview Question Date: December 2020

Amazon

MediumID 10322

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Write a query that'll identify returning active users. A returning active user is a user that has made a second purchase within 7 days of any other of their purchases. Output a list of user_ids of these returning active users.

DataFrame: amazon_transactions

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10322-finding-user-purchases

Available Data

First step towards finding an answer is to look at available data. In this case, we have only one amazon_transactions table with five columns.

Let’s preview actual data in the table:

Table: amazon_transactions

id	user_id	item	created_at	revenue
1	109	milk	2020-03-03	123
2	139	biscuit	2020-03-18	421
3	120	milk	2020-03-18	176
4	108	banana	2020-03-18	862
5	130	milk	2020-03-28	333

Step-by-Step Code Solution

Step 1: Import Pandas as pd in Python

Pandas and Numpy libraries provide an essential toolkit for performing complex data science tasks.

Almost all functions we use to answer this question come from the pandas library.

import pandas as pd
import numpy as np
from datetime import datetime

Datetime library is useful for working with date values

Step 2: Convert date values

First, we need to convert values in the created_at column to datetime type, and format them.

For that, we use the to_datetime() and dt.strftime() functions from pandas library.

pd.to_datetime() one argument, a DataFrame data set that needs to be converted to a datetime object. dt.strftime() is used to specify the date format.

amazon_transactions["created_at"] = pd.to_datetime(amazon_transactions["created_at"]).dt.strftime('%m-%d-%Y')

Step 3: Arrange dates in an ascending order

In the next step, we need to put date values (from the created_at column) in an ascending order. When it comes to dates, ascending order means from earlier to later.

We use another function specific to the Pandas library sort_values(). It is necessary to arrange values in an ascending order, based on values in the user_id and created_at columns.

df = amazon_transactions.sort_values(by=['user_id', 'created_at'], ascending=[True, True])

Step 4: Find the date of previous purchase

In this step, we create a new column to store the closest previous date when the same user placed an order.

We generate values for the new column by moving values in the created_at column one row down. This is necessary to store the previous time the user placed an order. To accomplish this, we will use the shift() function from the Pandas library. It allows us to shift the index of values from a specific column.

We give the new column a descriptive name prev_value. For every row, the value in the prev_value column will be equal to created_at value from the previous row. For the second row, the prev_value column will have a value of created_at column from the first row.

df['prev_value'] = df.groupby('user_id')['created_at'].shift()

Step 5: Find repeat orders within 7 days

Finally, we find the difference between multiple orders placed by the same user by subtracting date values in created_at and prev_value columns. For that, we once again use the pd.to_datetime() function and the dt.days function to find the number of days that is the difference between two dates.

The dt.days() function takes a date value and returns the number of days.

df['days'] = (pd.to_datetime(df['created_at']) - pd.to_datetime(df['prev_value'])).dt.days
result = df[df['days'] <= 7]['user_id'].unique()

We find every instance where the same user placed orders within 7 days and store it in the result variable. We use the unique() function to return only one repeat order for each user. Applying this function to a hash table keeps only unique values.

Here is the final solution:

import pandas as pd
import numpy as np
from datetime import datetime

amazon_transactions["created_at"] = pd.to_datetime(amazon_transactions["created_at"]).dt.strftime('%m-%d-%Y')
df = amazon_transactions.sort_values(by=['user_id', 'created_at'], ascending=[True, True])
df['prev_value'] = df.groupby('user_id')['created_at'].shift()
df['days'] = (pd.to_datetime(df['created_at']) - pd.to_datetime(df['prev_value'])).dt.days
result = df[df['days'] <= 7]['user_id'].unique()

Output

Running this Python code will return unique id values of users who placed an order within 7 days.

All required columns and the first 5 rows of the solution are shown

0
100
103
105
109
110

As you can see, in the process of finding an answer, we make extensive use of functions from the Pandas library. Python code looks a little different than SQL query, but both are following the same logical pattern.

Question 2 - Second Highest Salary

In order to find the second highest salary, we will need to rank values in the salary column.The pandas library provides all necessary functions to solve this question in Python.

Second Highest Salary

Interview Question Date: April 2019

Dropbox

Amazon

MediumID 9892

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the second highest salary of employees.

DataFrame: employee

Link to the question: https://platform.stratascratch.com/coding/9892-second-highest-salary

Available Data

It’s a good practice to study the available data before answering the question. For this question, we are dealing with one employee table with fourteen columns.

The salary column contains integer (numeric) values to describe the salary of each employee.

Now, let’s look at actual data in the table:

Table: employee

id	first_name	last_name	age	sex	employee_title	department	salary	target	bonus	email	city	address	manager_id
5	Max	George	26	M	Sales	Sales	1300	200	150	Max@company.com	California	2638 Richards Avenue	1
13	Katty	Bond	56	F	Manager	Management	150000	0	300	Katty@company.com	Arizona		1
11	Richerd	Gear	57	M	Manager	Management	250000	0	300	Richerd@company.com	Alabama		1
10	Jennifer	Dion	34	F	Sales	Sales	1000	200	150	Jennifer@company.com	Alabama		13
19	George	Joe	50	M	Manager	Management	100000	0	300	George@company.com	Florida	1003 Wyatt Street	1

Step-by-Step Code Solution

Step 1: Import Pandas as pd in Python

We need to use functions from the pandas library to work with tabular data. In the following steps, it will become clear why it’s so important to import pandas and numpy.

import pandas as pd
import numpy as np

Step 2: Get an unique list of salaries

To start off, we use the drop_duplicates() function from the Pandas library, which removes rows with duplicate values in a specific column. In this case, the salary column. This leaves us with the list of unique salary values.

distinct_salary = employee.drop_duplicates(subset = 'salary')

Step 3: Rank salaries

The question asks us to find the second highest salary. Therefore we will have to use the rank() function, which is a Pandas alternative to ranking window functions in SQL. It assigns a numeric value to each row, corresponding to its rank.

We create a ‘rnk’ column to rank values in the distinct_salary data set. When calling the rank() function, we provide two arguments, one to specify the type of ranking it should perform (dense), and another for order of values.

distinct_salary['rnk'] = distinct_salary['salary'].rank(method='dense', ascending=False)

In SQL, we have window ranking functions. The pandas library provides only the rank() functions, but we can specify the type of ranking via arguments (method=’dense’).

Dense ranking follows this principle for dealing with instances when multiple values are tied - if two salary values are the same, both will receive the same rank. It also doesn’t skip the next rank when the values are tied.

Step 4: Find the second highest salary

Once we rank all salaries, all there is left to do is return a salary that has a rank value of 2.

result = distinct_salary[distinct_salary.rnk == 2][['salary']]

The final solution will look like this:

import pandas as pd
import numpy as np

distinct_salary = employee.drop_duplicates(subset = 'salary')
distinct_salary['rnk'] = distinct_salary['salary'].rank(method='dense', ascending=False)
result = distinct_salary[distinct_salary.rnk == 2][['salary']]

Output

Running the code above will return just one column with second highest salary:

All required columns and the first 5 rows of the solution are shown

salary
200000

Question 3 - Revenue Over Time

This is a difficult question with a lot of moving parts. We will need to import pandas as pd to use functions for working with date values. We also need this library to use other essential utilities, like groupby() function, aggregate functions and rolling() function, which is very useful for this particular question.

Revenue Over Time

Interview Question Date: December 2020

Amazon

HardID 10314

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Software Engineer

Find the 3-month rolling average of total revenue from purchases given a table with users, their purchase amount, and date purchased. Do not include returns which are represented by negative purchase values. Output the year-month (YYYY-MM) and 3-month rolling average of revenue, sorted from earliest month to latest month.

A 3-month rolling average is defined by calculating the average total revenue from all user purchases for the current month and previous two months. The first two months will not be a true 3-month rolling average since we are not given data from last year. Assume each month has at least one purchase.

DataFrame: amazon_purchases

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10314-revenue-over-time

Available Data

Looking at data can help you better understand the question. In this case, we are dealing with data contained in the amazon_purchases table with three columns.

Previewing the table can help you better understand the data:

Table: amazon_purchases

user_id	created_at	purchase_amt
10	2020-01-01	3742
11	2020-01-04	1290
12	2020-01-07	4249
13	2020-01-10	4899
14	2020-01-13	-4656

Step-by-Step Code Solution

Step 1: Import Pandas as pd in Python

As a data scientist proficient in Python, you are expected to know what tools you’ll need to answer the interview question using the fewest lines of code.

pandas is one of the essential libraries that provides functions like to_datetime(), to_period(), group(), sum(), which you’ll need to find the answer.

import pandas as pd
import numpy as np
from datetime import datetime

Step 2: Set the format for float values

First, we modify pd.options to specify how numbers are displayed when outputting a DataFrame.

pd.options.display.float_format = "{:,.2f}".format

In this case, we tell pandas to output float values with 2 decimal places and use comma separator for more readability.

Step 3: Filter out negative purchase amount values

The question tells us to not account for negative purchase_amt values, which represent returns. We set the condition to store rows where the value of the purchase_amt column is more than 0. We store them in the df variable.

df=amazon_purchases[amazon_purchases['purchase_amt']>0]

Step 4: Convert datetime values to months

In the next step, we must take created_at values in the available data and use the pd.to_datetime and dt.to_period function to transform them to month values.

df['month_year'] = pd.to_datetime(df['created_at']).dt.to_period('M')

dt.to_period is another function from Pandas library, which takes datetime values to a period value, such as year, day, or name of the month of the datetime value.

Step 5: Get purchase totals for each month

We use the groupby() function to aggregate purchases by the month in which they happened, and use the sum() function to calculate total purchase_amt values for each group. Both of these functions are from the pandas library.

groupby() works similarly to the GROUP BY statement in SQL. It creates several groups based on multiple rows that have the same value in a specific column, in this case the ‘month_year’ column.

The sum() function works similarly to the aggregate function of the same name in SQL. It adds up values in the ‘purchase_amt’ column of each group.

df1 = df.groupby('month_year')['purchase_amt'].sum().reset_index(name='monthly_revenue').sort_values('month_year')
df1.set_index('month_year', inplace=True)

We use the set_index() method to set month_year as the index column for the df1 data frame. Values in this column become identifiers for each row. Setting the index is necessary for accuracy and efficiency of our data analysis.

Step 6: Calculate rolling 3-month average

Finally, we use the rolling() function from Pandas library to calculate a three month rolling revenue. This is a function used for doing rolling windows calculations. It takes two arguments: the number of periods for calculating the rolling window, and minimum period.

We create a new variable rolling_windows to store three month revenues. We need to use the mean() function to calculate three month rolling averages. Finally, we use the to_records() function to convert DataFrame structure to a NumPy record array.

rolling_windows = df1.rolling(3, min_periods=1)
rolling_mean = rolling_windows.mean()
result=rolling_mean.to_records()

Here is the final answer to this question:

import pandas as pd
import numpy as np
from datetime import datetime

pd.options.display.float_format = "{:,.2f}".format

df=amazon_purchases[amazon_purchases['purchase_amt']>0]
df['month_year'] = pd.to_datetime(df['created_at']).dt.to_period('M')
df1 = df.groupby('month_year')['purchase_amt'].sum().reset_index(name='monthly_revenue').sort_values('month_year')
df1.set_index('month_year', inplace=True)
rolling_windows = df1.rolling(3, min_periods=1)
rolling_mean = rolling_windows.mean()
result=rolling_mean.to_records()

Output

Running this code will output 3-month rolling average for each month in the year:

All required columns and the first 5 rows of the solution are shown

month_year	monthly_revenue
2020-01	26292
2020-02	23493.5
2020-03	25535.667
2020-04	24082.667
2020-05	25417.667

Question 4 - Top Percentile Fraud

In this question, we have to find the top 5 percentile claims for each state. For that, we will need to use the groupby() and rank() functions, similar to window ranking functions in SQL. The pandas library provides these and all other functions necessary to get the answer.

Top Percentile Fraud

Interview Question Date: November 2020

Google

Netflix

HardID 10303

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

ABC Corp is a mid-sized insurer in the US and in the recent past their fraudulent claims have increased significantly for their personal auto insurance portfolio. They have developed a ML based predictive model to identify propensity of fraudulent claims. Now, they assign highly experienced claim adjusters for top 5 percentile of claims identified by the model. Your objective is to identify the top 5 percentile of claims from each state. Your output should be policy number, state, claim cost, and fraud score.

DataFrame: fraud_score

Link to the question: https://platform.stratascratch.com/coding/10303-top-percentile-fraud

Solution

Another challenge which can be answered using the pandas library in Python. First, we need to use certain functions from the pandas library to generate percentile values for each state and store them in a separate column.

Once we have percentile values in a separate column, we will need to filter them. The question tells us to find records with percentile values of 0.95 or higher. Perhaps it would be a good idea to store the filtered records in a variable.

As a last step, we will need to output four columns specified in the question description.

Try to answer the question yourself below:

Question 5 - Total Cost Of Orders

The premise of the question is fairly simple. We have to aggregate the total cost of orders placed by customers. We will need functions from the pandas library to do the aggregation and other tasks to get the answer.

Total Cost Of Orders

Interview Question Date: July 2020

Etsy

Amazon

EasyID 10183

Data Engineer

Data Scientist

BI Analyst

Data Analyst

ML Engineer

Find the total cost of each customer's orders. Output customer's id, first name, and the total order cost. Order records by customer's first name alphabetically.

DataFrames: customers, orders

Expected Output Type: pandas.DataFrame

Link to the question: https://platform.stratascratch.com/coding/10183-total-cost-of-orders

Solution

For this question, we have two DataFrames (tables) - customers and orders. Neither of these tables alone contain enough information to find the answer.

The pandas library comes with a very useful merge() function, which has a functionality similar to JOINs in SQL. For arguments, we provide names of each table, and the columns that contain overlapping values.

The question asks us to calculate the total cost of orders for each user. Most probably, there are multiple orders placed by one user. In this case, you will need to use groupby() and sum() to aggregate total cost of orders for each user.

Finally, you will need a certain function from the pandas library to organize rows in an ascending alphabetical order of first names. And we must not forget to output the columns specified in the description.

Try to get the answer yourself:

Check out our post “Python Pandas Interview Questions” to practice more such questions.

Conclusion

In this article, we explained how to import pandas as pd in python and its importance for doing data science. Mastering its most essential features can bring you one step closer to landing a job and fast-track your career once you get it.

If you haven’t yet mastered pandas and Python, StrataScratch platform is a great place to start.

It has dozens of practical interview questions that require you to import and use functions from the pandas library. The platform allows you to look at available data, see approach hints if you are stuck, check the output of your code as you write it, and other features for interactive learning.

In short, StrataScratch has everything you need to hone your skills in pandas as well as Python. And as you might’ve heard, practice makes perfect, especially when it comes to technical skills like doing data science in Python.

How to Import Pandas as pd in Python

Categories

Pandas in Data Science

What Can Pandas Do?

How to Import Pandas as pd in Python

Example Problems Where You Need to Import Pandas as pd

Question 1 - Finding User Purchases

Finding User Purchases

Step-by-Step Code Solution

Question 2 - Second Highest Salary

Second Highest Salary

Step-by-Step Code Solution

Question 3 - Revenue Over Time

Revenue Over Time

Step-by-Step Code Solution

Question 4 - Top Percentile Fraud

Top Percentile Fraud

Solution

Question 5 - Total Cost Of Orders

Total Cost Of Orders

Solution

Conclusion

Latest Posts:

From Arrays to Lists: Converting Numpy Arrays with Python

Tips for Crafting an Impressive Data Analyst Portfolio

PySpark GroupBy Guide: Super Simple Way to Group Data

Share

Follow

Categories