A Guide to Master Machine Learning Modeling from Scratch

Categories:
- Written by:
Nathan Rosidi
Machine learning modeling is intimidating for beginners. With this article, we’ll make it less complicated and guide you through essential modeling techniques.
Ever wonder how Netflix knows precisely what you want, even if you don’t? It’s magic!
OK, it’s not, but it also starts with M: machine learning modeling. This is a process that provides predictive power to today’s decision-making. Recommendation systems. Fraud detection. Autonomous vehicles. Credit scoring systems. Even your phone’s camera adjusting focus. Yup, that’s all machine learning modeling.
Get ready for some explanations of a machine learning modeling workflow, practical code examples, vital ML modeling techniques, and tips to help you bring the model to life.
What is Machine Learning Modeling?
Machine learning modeling is a process of training algorithms to recognize patterns in historical data, thus making them able to recognize patterns in new, similar data.
In essence, you supply structured data consisting of features and known outcomes. The learning algorithm then optimizes internal parameters to minimize prediction error, thus creating a machine learning model that can generalize to new, unseen data.
It sounds similar to programming, but there’s one significant difference. In classical programming, you write the rules a program will follow to execute an operation. In machine learning modeling, you only provide the system with examples for it to infer rules on its own. Over time - meaning, during the training process - the model will adapt to the structure and statistical relationships it detects, making its decisions more refined.
Different problems require different approaches. That’s why there are so many machine learning algorithms. However, choosing a suitable algorithm is not the start nor the end of machine learning modeling. As I already mentioned, machine learning modeling is a process, and processes typically follow a certain workflow.
The Machine Learning Modeling Workflow
Machine learning modeling typically involves six steps.

However, when we’re talking about workflow, we don’t mean there are (always) very clear-cut borders between the stages. Machine learning modeling is an iterative process. There’s typically a lot of back-and-forth between the stages. Some techniques can be used in different stages, so the lines between them can easily get blurred.
So, this division into strict stages is for educational purposes. In practice, it’s just one continuous process.
1. Data Collection and Preparation
In the workflow, it’s difficult to say that one step is more important than the other. They all build on the one before. However, all models need high-quality data. If you mess it up here, no machine learning knowledge and algorithms will save your model. That old adage of ‘Garbage in, garbage out’ (GIGO) applies here.
After you collect raw data from sources such as web scraping, API endpoints, relational databases, cloud storage, or local files, you must transform it into a structured format that machines can read.
Next, you should perform an exploratory data analysis (EDA) to understand data structure, catch errors or anomalies, and assess distributions and relationships.
Then you preprocess it, meaning you handle missing values, duplicate records, data type inconsistencies, outliers, and encoding mismatches.
Example
I’ll use the property click prediction data project by No Broker from our platform as an example for demonstrating different stages of an ML modeling workflow.
The requirement is pretty straightforward: build a model that predicts the number of interactions a property would receive in a period of time. This can be any period. We will do a 3-day and 7-day interaction prediction for simplicity.
We will divide initial steps into:
1.1. Data Collection
1.2. Data Preparation
1.3. EDA and Data Processing
1.1. Data Collection
In the project, the data is provided in the .csv file, so we’ll use pandas to read the data. There are three .csv files in the dataset:
- `property_data_set.csv` – data about the properties
- `property_interactions.csv` – data with timestamps of interactions with properties
- `property_photos.tsv` – data containing the count of property photos
We’ll import all the required libraries and all the files. To be precise, data collection for our project will involve these steps:
1.1.1. Import Required Libraries
1.1.2. Set `pandas` Display Options
1.1.3. Load Dataset – Properties
1.1.4. Load Dataset – Property Interactions
1.1.5. Load Dataset – Photo Metadata
1.1.6. Print Dataset Shapes
1.1.7. Sampling Data With `sample()`
1.1.1. Import Required Libraries
Project Code:
##### Import required libraries
# Import the pandas library as pd
import pandas as pd
# Import the numpy library as np
import numpy as np
# Import the seaborn library as sns
import seaborn as sns
# Import the matplotlib.pyplot library as plt
import matplotlib.pyplot as plt
# Import the json library
import json
What It Does: Imports essential Python libraries for data manipulation (`pandas`, `NumPy`), visualization (`seaborn`, `Matplotlib`), and working with JSON-encoded strings.
1.1.2. Set `pandas` Display Options
Project Code:
# View options for pandas
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 10)
What It Does: Changes how many columns and rows `pandas` shows when printing DataFrames.
1.1.3. Load Dataset – Properties
Project Code:
# Read all data
# Properties data
data = pd.read_csv('property-click-prediction/resources/property_data_set.csv',parse_dates = ['activation_date'],
infer_datetime_format = True, dayfirst=True)
What It Does:
- Loads the main property dataset.
- Parses the `activation_date` column as a datetime.
- Assumes day-first date format (e.g., `31/12/2023`)
1.1.4. Load Dataset – Property Interactions
Project Code:
# Data containing the timestamps of interaction on the properties
interaction = pd.read_csv('property-click-prediction/resources/property_interactions.csv',
parse_dates = ['request_date'] , infer_datetime_format = True, dayfirst=True)
What It Does:
- Loads interaction logs where users have clicked or shown interest in a property.
- Parses `request_date` as datetime.
1.1.5. Load Dataset – Photo Metadata
Project Code:
# Data containing photo counts of properties
pics = pd.read_table('property-click-prediction/resources/property_photos.tsv')
What It Does: Loads photo data froma tab-separated values (TSV) file.
1.1.6. Print Dataset Shapes
Project Code:
# Print shape (num. of rows, num. of columns) of all data
print('Property data Shape', data.shape)
print('Pics data Shape',pics.shape)
print('Interaction data Shape',interaction.shape)
Here’s the output.

What It Does: Prints the number of rows and columns in each dataset.
1.1.7. Sampling Data With `sample()`
Project Code: In the project, we will show two rows from the property data.
# Sample of property data
data.sample(2)
Here’s the output.

Now, the same for the pics data.
# Sample of pics data
pics.sample(2)
Here’s the output.

Finally, the sample of interaction data, too.
# Sample of interaction data
interaction.sample(2)
Here’s the output.

What It Does: Display two random rows from the DataFrames we use in the project.
1.2. Data Preparation
We can now start preparing the data. It will involve these steps.
1.2.1. Preview the Dataset with `head()`
1.2.2. Output Column Types With `.dtypes`
1.2.3. Count `NaN`s With `isna()` And `sum()`
1.2.4. Access a String With Label-Based Indexing With Bracket Notation
1.2.5. Replace a Corrupted String Using `replace()`
1.2.6. Define a Cleaning and Counting Function
1.2.7. Apply the Function to All Rows With `apply()`
1.2.8. Remove the Original Column
1.2.10. Merge DataFrames With `merge()`
1.2.11. Calculate Days Between Activation and Request
1.2.12. Count With `groupby` and `agg()`
1.2.13. Rename Columns With `.rename()`
1.2.14. Categorize Data and Counting Properties in Each Column
1.2.15. Define Category Mapping Function
1.2.16. Apply Category Mapping
1.2.17. Preview Categorized Data
1.2.18. Count the Number of Properties
1.2.19. Check Data Before Merging
1.2.20. Merge 3-Day and 7-Day Interaction Features
1.2.21. Replace `NaNs` Using `fillna()`
1.2.22. Check for Missing Values
1.2.23. Merge Property Data With Photo Counts
1.2.24. Create Final Dataset for Modeling
1.2.25. Final Null Check
Here we go!
1.2.1. Preview the Dataset With `head()`
Project Code: First, let’s show the first five rows of the pics data.
# Show the first five rows
pics.head()
Here’s the output.

What It Does: Displays the first five rows of the pics DataFrame, showing:
- `property_id`: unique ID for each property
- `photo_urls`: JSON-like strings listing the images and their metadata (e.g., title, filename)
1.2.2. Output Column Types With `.dtypes`
Project Code:
# Types of columns
pics.dtypes
Here’s the output.

What It Does: Displays the data type of each column in the `pics` DataFrame.
1.2.3. Count `NaN`s With `isna()` and `sum()`
Project Code: We will now use these two functions to count the number of `NaN`s in the pics data.
# Number of nan values
pics.isna().sum()
Here’s the output.

What It Does: Calculates the number of missing (`NaN`) values in each column of the pics DataFrame.
1.2.4. Access a String With Label-Based Indexing With Bracket Notation
Project Code: We preview the `photo_urls` column of the first row in the pics dataset to see which characters we need to fix in an attempt to fix a corrupted data in the JSON file.
# Try to correct the first Json
text_before = pics['photo_urls'][0]
print('Before Correction: \n\n', text_before)
Here’s the output.

What It Does:
- Retrieves the first value from the `photo_urls` column in the `pics` DataFrame.
- Prints its raw, unprocessed content — which appears to be a malformed JSON-like string.
1.2.5. Replace a Corrupted String Using `replace()`
Project Code:
# Try to replace corrupted values then convert to json
text_after = text_before.replace('\\' , '').replace('{title','{"title').replace(']"' , ']').replace('],"', ']","')
print("\n\nAfter correction and converted to json: \n\n", json.loads(text_after))
Here’s the output.

What It Does:
- Cleans up the corrupted `photo_urls` string by:
- Replacing all backslashes `(\)` with escape character.
- Adding a quote before `title`, converting `{title:` to `{"title":`
- Fixing `]"` into just `]`
- Adding quotes between adjacent arrays or objects where they’re missing
- Converts the cleaned string into a valid Python list of dictionaries using `json.loads()`.
1.2.6. Define a Cleaning and Counting Function
Project Code:
# Function to correct corrupted json and get count of photos
def correction (x):
# if value is null put count with 0 photos
if x is np.nan or x == 'NaN':
return 0
else :
# Replace corrupted values then convert to json and get count of photos
return len(json.loads( x.replace('\\' , '').replace('{title','{"title').replace(']"' , ']').replace('],"', ']","') ))
What It Does:
- Checks if the `photo_urls` field is missing (`NaN`) and returns 0.
- Otherwise, it cleans up the corrupted JSON string (the same characters as in the previous step, only we do that for all rows) and counts how many photo entries exist.
1.2.7. Apply the Function to All Rows With `apply()`
Project Code:
# Apply Correction Function
pics['photo_count'] = pics['photo_urls'].apply(correction)
What It Does: Applies the `correction()` function row by row to the `photo_urls` column of the `pics` DataFrame and stores the result in a new column called `photo_count`.
1.2.8. Remove the Original Column
Project Code:
# Delete photo_urls column
del pics['photo_urls']
What It Does: Removes the now-unnecessary `photo_urls` column.
1.2.9. Preview the Cleaned Data
Project Code:
# Sample of Pics data
pics.sample(5)
Here’s the output.

What It Does: Displays five random rows from the updated `pics` DataFrame.
1.2.10. Merge DataFrames With `merge()`
Project Code:
# Merge data with interactions data on property_id
num_req = pd.merge(data, interaction, on ='property_id')[['property_id', 'request_date', 'activation_date']]
num_req.head(5)
Here’s the output.

What It Does:
- Merges the `data` (property listings) and `interaction` (user requests) datasets using the common key `property_id`.
- Selects only three relevant columns from the merged result:
- `property_id`: unique property identifier
- `request_date`: when a user interacted with the property
- `activation_date`: when the property was listed
1.2.11. Calculate Days Between Activation and Request
Project Code:
Now, we calculate the difference between the request and activation dates.
# Get a Time between Request and Activation Date to be able to select request within the number of days
num_req['request_day'] = (num_req['request_date'] - num_req['activation_date']) / np.timedelta64(1, 'D')
# Show the first row of data
num_req.head(1)
Here’s the output.

What It Does:
- Computes the number of days between a property’s activation date and each user request.
- Stores the result in a new column `request_day`, which contains float values representing days.
- Displays the first row to verify the new column
1.2.12. Count With `groupby` and `agg()`
Project Code: Let’s now get the number of interactions within 3 days.
# Get a count of requests in the first 3 days
num_req_within_3d = num_req[num_req['request_day'] < 3].groupby('property_id').agg({ 'request_day':'count'}).reset_index()
We then do the same for the 7-day interactions.
# Get a count of requests in the first 7 days
num_req_within_7d = num_req[num_req['request_day'] < 7].groupby('property_id').agg({ 'request_day':'count'}).reset_index()
What It Does:
- Filters requests that happened within the first 3 and 7 days after a property was activated.
- Groups them by `property_id`.
- Counts how many requests each property received in that 3-day and 7-day window.
- Stores the result as new DataFrames `num_req_within_3d` and `num_req_within_7d` with two columns
- `property_id`
- `request_day` (now representing the number of early requests)
1.2.13. Rename Columns With `.rename()`
Project Code: To customize the output, we rename the `'request_day'` column to `'request_day_within_3d'`.
# Show every property id with the number of requests in the first 3 days
num_req_within_3d = num_req_within_3d.rename({'request_day':'request_day_within_3d'},axis=1)
# Dataset with the number of requests within 3 days
num_req_within_3d
Here’s the output.

Now, the same thing, only for the 7-day interactions.
# Show every property id with the number of requests in the first 7 days
num_req_within_7d = num_req_within_7d.rename({'request_day':'request_day_within_7d'},axis=1)
# Dataset with the number of requests within 7 days
num_req_within_7d
Here’s the output.

What It Does:
- It changes column name from `request_day` to `request_day_within_3d` and `request_day_within_7d` to clearly reflect that the values represent the number of requests within the first 3 and 7 days.
- Shows each `property_id` with its corresponding request count in 3-day and 7-day windows.
1.2.14. Categorize Data and Counting Properties in Each Column
Project Code: In the project, we use `value_counts()` to list the top ten number of property interactions within three days.
num_req_within_3d['request_day_within_3d'].value_counts()[:10]
Here’s the output.

Here’s the same thing for 7-day interactions.
num_req_within_7d['request_day_within_7d'].value_counts()[:10]
Here’s the output.

What It Does:
- Counts how many properties had the same number of requests within the first 3 and 7 days.
- Returns the top 10 most common request counts (e.g., how many properties had 1 request, 2 requests, etc.).
1.2.15. Define Category Mapping Function
Project Code:
def divide(x):
if x in [1,2]:
return 'cat_1_to_2'
elif x in [3,4,5]:
return 'cat_3_to_5'
else:
return 'cat_above_5'
What It Does: Defines a function that maps numerical request counts into 3 categories.
1.2.16. Apply Category Mapping
Project Code: Here’s the function application for 3-day interactions.
num_req_within_3d['categories_3day'] = num_req_within_3d['request_day_within_3d'].apply(divide)
We apply the same function on 7-day interactions.
num_req_within_7d['categories_7day'] = num_req_within_7d['request_day_within_7d'].apply(divide)
What It Does: Applies the `divide()` function to `'request_day_within_3d'` and `'request_day_within_7d'` columns and creates new columns with category labels.
1.2.17. Preview Categorized Data
Project Code: Here we preview the 3-day data.
num_req_within_3d.head(3)
Here’s the output.

Here’s the code for the 7-day data preview.
num_req_within_7d.head(3)
Here’s the output.

What It Does: Displays the first three rows of the updated datasets to confirm the new columns have been added correctly.
1.2.18. Count the Number of Properties
Project Code: Here’s the 3-day properties count.
num_req_within_3d['categories_3day'].value_counts()
Here’s the output.

This is the 7-day properties count.
num_req_within_7d['categories_7day'].value_counts()
Here’s the output.

What It Does: Counts the number of properties in each category of the `'categories_3day'` and `'categories_7day'` columns.
1.2.19. Check Data Before Merging
Project Code:
data.sample()
Here’s the output.

pics.sample()
Here’s the output.

num_req_within_3d.sample()
Here’s the output.

num_req_within_7d.sample()
Here’s the output.

print(num_req_within_3d.shape)
print(num_req_within_7d.shape)
Here’s the output.

What It Does:
- Displays random samples from key datasets (`data`, `pics`, `num_req_within_3d`, `num_req_within_7d`)
- Prints the number of rows and columns for the 3-day and 7-day request datasets
1.2.20. Merge 3-Day and 7-Day Interaction Features
Project Code:
label_data = pd.merge(num_req_within_7d, num_req_within_3d, on ='property_id' , how='left')
What It Does: This code line merges two datasets:
- `num_req_within_7d` (requests in the first 7 days)
- `num_req_within_3d` (requests in the first 3 days)
1.2.21. Replace `NaNs` Using `fillna()`
Project Code:
label_data['request_day_within_3d'] = label_data['request_day_within_3d'].fillna(0)
label_data.head(3)
Here’s the output.

What It Does:
- `fillna(0)` replaces any `NaN` values in the `request_day_within_3d` column with 0. These `NaNs` appear when a property had no requests in the first three days
- Shows the first three rows of the merged `label_data` DataFrame.
1.2.22. Check for Missing Values
Project Code:
label_data.isna().sum()
Here’s the output.

What It Does: Counts the number of missing values (`NaN`) in each column of the `label_data` DataFrame.
1.2.23. Merge Property Data With Photo Counts
Project Code:
data_with_pics = pd.merge(data, pics, on ='property_id', how = 'left')
data_with_pics.head(3)
Here’s the output.

What It Does:
- Merges the main `data` DataFrame (property listings) with the `pics` DataFrame (which contains `photo_count`).
- Uses a left join to keep all property listings and bring in photo information where available.
- Displays the first 3 rows of the merged dataset to verify the result.
1.2.24. Create Final Dataset for Modeling
Project Code:
dataset = pd.merge(data_with_pics, label_data, on ='property_id')
dataset.head(3)
Here’s the output.

What It Does:
- Merges the enriched property data (with photo counts) from `data_with_pics` with the labeled request data (3-day and 7-day interactions) from `label_data`.
- Joins on property_id to create a complete dataset that includes:
- Property details
- Photo features
- Request counts and categories (labels)
- Shows the first three rows to confirm successful merging
1.2.25. Final Null Check
Project code:
dataset.isna().sum()
Here’s the output.

1.3. EDA and Data Processing
We’ve now entered the third substage of the Data Collection and Preparation stage. We’ll go through these steps:
1.3.1. Exploring Locality Distribution
1.3.2. Removing Columns With `drop()`
1.3.3. Dataset Summary – Nulls and Data Types
1.3.4. Visualize Distribution of a Numeric Feature on Histogram With Seaborn
1.3.5. Visualize Distribution of a Categorical Feature on a Count Plot With Seaborn
1.3.6. Split the Dataset Into Categorical and Numeric Columns With `select_dtypes()`
1.3.7. Sample Categorical & Numeric Features Value
1.3.8. Categorical Value Counts Summary
1.3.9. Categorical Feature Distribution Plots
1.3.10. Preview Numerical Features
1.3.11. Box Plot for Outlier Detection and Range Overview Using `plot()`
1.3.12. Numerical Data Statistics With `describe()`
1.3.13. Creating a Scatterplot in Seaborn for Relationship Exploration
Let’s begin with this.
1.3.1. Exploring Locality Distribution
Project Code:
dataset['locality'].value_counts()
Here’s the output.

What It Does: Counts the frequency of each unique value in the `locality` column, showing how many properties exist in each location.
1.3.2. Removing Columns With `drop()`
Project Code: We now apply the same method to the project dataset.
# Dropped those columns that won't have an effect on the number of requests
dataset = dataset.drop(['property_id', 'activation_date' ,'latitude', 'longitude', 'pin_code','locality' ] , axis=1)
What It Does: Removes columns from the dataset that are not useful for prediction, including:
- Identifiers (`property_id`) as they have no predictive power
- Raw geographic coordinates that are too detailed without transformation (`latitude`, `longitude`)
- Sparse or overly specific location info (`pin_code`, `locality`)
- A raw date field (`activation_date`) that hasn’t been transformed into a usable feature
1.3.3. Dataset Summary – Nulls and Data Types
Project Code:
# Some info about all columns
print('Column : Num. of null values')
print(dict(dataset.isna().sum()))
print('\n\n')
print('Column : data type')
print(dict(dataset.dtypes))
Here’s the output.

What It Does:
- Prints the number of missing (`NaN`) values for each column.
- Prints the data type of each column.
1.3.4. Visualize Distribution of a Numeric Feature on Histogram With Seaborn
Project Code: We first plot the 3-day interactions on a histogram.
# Show histogram of the number of requests in first 3 days
plt.figure(figsize=(10,5))
sns.histplot(dataset, x="request_day_within_3d")
plt.title('histogram of num. of requests in first 3 days')
plt.show()
Here’s the output.

It shows that the data distribution is right-skewed. The data is heavily unbalanced, as most listings are low-engagement, and a few are high-performing outliers.
Let’s now make the same plots for the 7-day interactions.
# Show histogram of the number of requests in first 7 days
plt.figure(figsize=(10,5))
sns.histplot(dataset, x="request_day_within_7d")
plt.title('histogram of num. of requests in first 7 days')
plt.show()
Here’s the histogram.

What It Does: Draws a histogram showing how many properties received different numbers of requests within the first 3 days after activation.
1.3.5. Visualize Distribution of a Categorical Feature on a Count Plot With Seaborn
Project Code: We first visualize the 3-day interactions.
sns.countplot(y=dataset.categories_3day)
plt.title('Value count for each category within 3 days')
plt.show()
Here’s the output.

Now, the same for the 7-day interactions.
sns.countplot(y=dataset.categories_7day)
plt.title('Value count for each category within 7 days')
plt.show()

What It Does:
- Draws a horizontal bar chart showing how many properties fall into each category in the `categories_3day` and `categories_7day` column.
- The y-axis shows the category names (`cat_1_to_2`, `cat_3_to_5`, `cat_above_5`).
- The x-axis shows the count of properties in each category.
1.3.6. Split the Dataset Into Categorical and Numeric Columns With `select_dtypes()`
Project Code: Now we do the same for `dataset` and create two DataFrames; one for categorical, an the other for numeric columns.
# Get categorical columns
df_cat = dataset.select_dtypes(include=['object'])
# Get numeric columns
df_num = dataset.select_dtypes(exclude=['object'])
print("Categorical Columns : \n",list(df_cat.columns) )
print("Numeric Columns : \n",list(df_num.columns) )
Here’s the output.

What It Does:
- Extracts all categorical columns (usually strings or objects) into `df_cat`.
- Extracts all numeric columns (integers, floats) into `df_num`.
- Displays the names of all categorical and numeric columns.
1.3.7. Sample Categorical & Numeric Features Value
Project Code: Here’s the categorical features preview.
df_cat.sample(2)
Here’s the output.

Now, numeric columns.
df_num.sample(2)

What It Does: Randomly selects and displays two rows from the categorical features dataframe `df_cat` and from the numeric features dataframe `df_num`.
1.3.8. Categorical Value Counts Summary
Project Code:
# Show all values and get count of them in every categorical column
for col in df_cat.columns[:-2]:
print('Column Name : ', col)
print(df_cat[col].value_counts())
print('\n-------------------------------------------------------------\n')
Here are the outputs.





What It Does:
- Iterates over all categorical columns except the last two.
- For each column, it:
- Prints the column name.
- Prints the frequency count of each unique value using `value_counts()`.
- Separates outputs with a visual delimiter line for readability.
1.3.9. Categorical Feature Distribution Plots
Project Code:
# Plot count of values in every columns
for col in df_cat.columns[:-2]:
sns.countplot(x = col,
data = dataset
)
plt.title(f'Show value counts for column {col}')
# Show the plot
plt.show()
Here are the outputs.





What It Does:
- Iterates through all categorical columns except the last two.
- For each column, it:
- Plots a bar chart using `sns.countplot()` to visualise how often each category appears.
- Sets a title for context.
- Displays the chart.
1.3.10. Preview Numerical Features
Project code:
df_num.head()
Here’s the output.

What It Does:
- Displays the first 5 rows of the numerical portion of the dataset.
- `df_num` contains only columns with numeric data types, extracted earlier via `select_dtypes()`.
1.3.11. Box Plot for Outlier Detection and Range Overview Using `plot()`
Project Code: We can now plot the numerical data from out project.
# Box Plot to show ranges of values and outliers
df_num.plot(kind='box', subplots=True, sharex=False, sharey=False,figsize=(22,10))
plt.show()
The additional arguments are:
- `subplots=True` – draws each box plot in a separate subplot
- `sharex=False` – each subplot gets its own x-axis scale (no shared x-axis)
- `sharey=False` – each subplot gets its own y-axis scale (no shared y-axis
Here’s the code output.

What It Does:
- Generates box plots for each numerical column in `df_num`.
- `subplots=True` draws one box per variable in separate subplots.
- The visualisation shows:
- Median (central line)
- Interquartile range (box)
- Potential outliers (points outside whiskers)
1.3.12. Numerical Data Statistics With `describe()`
Project Code:
# Get some statistics about numeric columns
df_num.describe()
Here’s the output.

What does this output tell us about numerical data?
That the maximum number of bathrooms is 21, but 75% of properties have two or fewer bathrooms. So that’s probably an outlier. The same conclusion applies to the following features: `floor`, `total_floor`, `property_size`, and `property_age`. We’ll probably need outlier removal or capping here.
The rent features, such as `rent`and `deposit`, are right-skewed, making them candidates for scaling or transformation.
Target variables – `request_day_within_3d` and `request_day_within_7d` – could also benefit from capping or binning.
What It Does:
- Generates summary statistics for each numerical column in the `df_num` DataFrame.
- Outputs include:
- Count (non-null values)
- Mean
- Standard deviation
- Min, 25%, 50% (median), 75%, and max values
We’ll now create a pairwise scatterplot grid using Seaborn’s pairplot to understand the data better. We will plot five variables on the X-axis (`property_age`, `property_size`, `rent`, `deposit`, `photo_count`) against the `request_day_within_3d` variable on the Y-axis. Why did we choose those five columns? Because they are numeric, interpretable, and we think they have a direct impact on how users interact with property listings.
1.3.13. Creating a Scatterplot in Seaborn for Relationship Exploration
Project Code: In the project, we first plot the scatter plot grid for the for the 3-day interaction label, then for the 7-day interaction label.
sns.pairplot(data=dataset,
x_vars=['property_age', 'property_size','rent', 'deposit', 'photo_count'],
y_vars=['request_day_within_3d']
)
plt.show()
Here’s the output.

From the plots, we see that most interactions happen for newer properties, mid-sized properties, with the rent around ₹10,000–₹20,000, with the lower deposit required, and with 5-15 photos.
Let’s now do the same for the 7-day interactions.
sns.pairplot(data=dataset,
x_vars=['property_age', 'property_size','rent', 'deposit', 'photo_count'],
y_vars=['request_day_within_7d']
)
plt.show()

The interpretation is more or less the same as earlier.
What It Does:
- Uses Seaborn's `pairplot()` to generate scatter plots for each selected `x_var` against the target variables `request_day_within_3d` and `request_day_within_7d`.
- Helps visualise bivariate relationships between numerical features and the target.
With this, we came to the end of the Data Collection & Preparation stage of the workflow. If it seems like it’s never-ending, well, this is what it looks like in reality. In most data science projects, you’ll spend most of your time gathering and preparing data, which shows how delicate and crucial this stage is.
2. Feature Engineering
In this step, you create new input variables or transform existing ones to enhance the model’s ability to learn relevant patterns.
For example, you might combine multiple features into interaction terms, extracting time-based components, applying log or polynomial transformations, binning continuous values, or encoding categorical variables more meaningfully.
Depending on the data, you might go through some of those or all of the feature engineering stages.

The whole idea behind this is to represent the data structure more effectively and enhance the model’s ability to learn relevant patterns.
Example
We’re continuing with our data project with these steps.
2.1. Removing Outliers
2.2. One-Hot Encoding
2.3. MinMaxScaler
2.1. Removing Outliers
As a first step of feature engineering, we’ll remove outliers using the interquartile range (IQR) method. Outliers can skew our model training, distort metrics like mean and standard deviation, and lead to poor generalization.
We’ll remove outliers in the following steps.
2.1.1. Define Outlier Removal Function With `quantile()`
2.1.2. List Numeric Columns
2.1.3. Copy Dataset for Cleaning
2.1.4. Apply Outlier Removal to Selected Columns
2.1.5. Capping Values
2.1.6. Apply Capping to Target Variables
2.1.7. Inspect Capped Target Distributions
2.1.8. Box Plot After Outlier Removal and Capping
2.1.9. Pairplot After Capping – Explore Final Feature Relationships
2.1.10. Drawing a Heat Map in Seaborn Using `heatmap()`
2.1.1. Define Outlier Removal Function With `quantile()`
Project Code:
# Function to remove outliers using quantiles
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3 - q1 #Interquartile range
fence_low = q1 - 2 * iqr
fence_high = q3 + 2 * iqr
df_out = df_in.loc[(df_in[col_name] <= fence_high) & (df_in[col_name] >= fence_low)]
return df_out
What It Does:
- Defines a reusable function that removes outliers from a DataFrame column based on the interquartile range (IQR).
- It excludes values that fall outside Q1 - 2*IQR and Q3 + 2*IQR.
2.1.2. List Numeric Columns
Project Code:
df_num.columns
Here’s the output.

What It Does: Displays all numeric columns in the dataset.
2.1.3. Copy Dataset for Cleaning
Project Code:
df = dataset.copy()
What It Does: Creates a copy of the original dataset to perform outlier removal without altering the original data.
2.1.4. Apply Outlier Removal to Selected Columns
Project Code:
for col in df_num.columns:
if col in ['gym', 'lift', 'swimming_pool', 'request_day_within_3d', 'request_day_within_7d']:
continue
df = remove_outlier(df , col)
What It Does: Loops through numeric columns and removes outliers from each, excluding binary indicator columns and label columns.
2.1.5. Capping Values
Project Code: As a next step, we cap interaction counts to avoid extremely large values, which reduces skewness in the target variables. This will help regression models focus on the common range and not be dominated by outliers.
We cap the 3-day interactions at 10 and the 7-day interactions at 20.
def capping_for_3days(x):
num = 10
if x > num:
return num
else :
return x
def capping_for_7days(x):
num = 20
if x > num:
return num
else :
return x
What It Does:
- `capping_for_3days(x)` limits (caps) the value of `x` to a maximum of 10.
- `capping_for_7days(x)` limits the value of `x` to a maximum of 20.
- If the value is already below or equal to the cap, it stays unchanged.
2.1.6. Apply Capping to Target Variables
Project Code:
df['request_day_within_3d_capping'] = df['request_day_within_3d'].apply(capping_for_3days)
df['request_day_within_7d_capping'] = df['request_day_within_7d'].apply(capping_for_7days)
What It Does:
- It creates two new columns (`_capping`) that contain the capped versions of the original request counts.
- The capping sets a maximum of 10 requests within 3 days and 20 within 7 days.
2.1.7. Inspect Capped Target Distributions
Project Code: We perform a frequency count for 3-day capped interactions.
df['request_day_within_3d_capping'].value_counts()
Here’s the output.

We do the same for 7-day capped interactions, but we have to explicitly limit the values to 10, as these interactions are capped at 20.
df['request_day_within_7d_capping'].value_counts()[:10]
Here’s the output.

What It Does:
- Counts how many times each value appears in the `request_day_within_3d_capping` and `request_day_within_7d_capping` columns.
- Provides a frequency distribution of the capped 3-day and 7-day interaction values.
2.1.8. Box Plot After Outlier Removal and Capping
Project Code:
df.plot(kind='box', subplots=True, sharex=False, sharey=False, figsize=(22,10))
plt.show()
Compared to the previous such plot, it seems we managed to decrease the number of outliers.

What It Does:
- Draws box plots for all numeric columns in the updated `df` DataFrame.
- Visualises the distribution, central tendency, and remaining outliers for each feature after cleaning (i.e. outlier removal and capping).
2.1.9. Pairplot After Capping – Explore Final Feature Relationships
Project Code: We draw scatter plots for the capped 3-day interactions
sns.pairplot(data=df,
x_vars=['property_age', 'property_size','rent', 'deposit', 'photo_count'],
y_vars=['request_day_within_3d_capping']
)
plt.show()
Here’s the output.

The insight that we can draw is that `photo_count` seems the most promising feature out of the five. The `rent` and `deposit` features may have a weak negative influence. There are less obvious patterns for `property_age` and `size`, but could still be valuable in combination with other features or with proper transformations.
We create the same plots for the 7-day interactions.
sns.pairplot(data=df,
x_vars=['property_age', 'property_size','rent', 'deposit', 'photo_count'],
y_vars=['request_day_within_7d_capping']
)
plt.show()
Here’s the output.

There’s no strong linear relationship in any feature, although there are some weak patterns in `photo_count`, `rent`, and `deposit`.
What It Does:
- Creates scatter plots to visualise the relationships between key numeric features (x-axis) and the capped 3-day and 7-day request counts (y-axes).
- Uses the cleaned and capped DataFrame (`df`), reflecting the current state of the dataset post-preprocessing.
2.1.10. Drawing a Heat Map in Seaborn Using `heatmap()`
Project Code: We can now apply what we learned on the project data.
# Show a correlation on a heat map.
plt.subplots(figsize=(10,10))
dataplot = sns.heatmap(df.corr(), cmap="YlGnBu", annot=True)
# displaying heatmap
plt.show()
Here’s the output.

What It Does:
- Computes the correlation matrix for all numerical columns in `df`
- Plots the matrix as a heatmap, where:
- Colour intensity shows the strength of correlation.
- `annot=True` displays the actual correlation values inside each cell.
2.2. One-Hot Encoding
One-hot encoding is a technique to convert categorical data (words/labels) into numerical data – because most machine learning models can’t work with strings.
In our project, one-hot encoding will involve these steps.
2.2.1. Data Snapshot Before One-Hot Encoding
2.2.2. Check Column Names Before Feature Selection
2.2.3. Dropping Label Columns to Isolate Features
2.2.4. Separating Categorical Column (Including Possible Nulls)
2.2.5. Separating Remaining (Non-Categorical) Features
2.2.6. Storing Label Columns Separately
2.2.7. Initialize Clean DataFrames for Numeric and Categorical Data
2.2.8. Filling Null Values in Numeric Columns With the Mean
2.2.9. Filling Null Values in Categorical Columns With the Mode
2.2.10. Checking for Null Values
2.2.11. Import and Initialize OneHotEncoder
2.2.12. Fit and Transform the Categorical Data
2.2.13. Generate Column Names for New Features
2.2.14. Flatten the New Feature Labels
2.2.15. Extend Final Column List
2.2.16. Create DataFrame With Encoded Values and Named Columns
2.2.17. Check Output
Here we go.
2.2.1. Data Snapshot Before One-Hot Encoding
Project Code:
df.sample(5)
Here’s the output.

What It Does: Randomly displays five rows from the cleaned and processed dataset `df`.
2.2.2. Check Column Names Before Feature Selection
Project Code:
df.columns
Here’s the output.

What It Does: Lists all column names in the DataFrame `df`.
2.2.3. Dropping Label Columns to Isolate Features
Project Code:
# One-Hot Encoder for categorical values
# dividing a data to categorical, numeric and label
X = df.drop(['request_day_within_7d', 'categories_7day', 'request_day_within_3d',
'categories_3day', 'request_day_within_3d_capping',
'request_day_within_7d_capping'] , axis=1)
What It Does: Removes target/label columns from the dataset `df` so you're left with only feature columns in `X`.
2.2.4. Separating Categorical Column (Including Possible Nulls)
Project Code:
x_cat_withNull= df[X.select_dtypes(include=['O']).columns]
What It Does:
- It selects all categorical (object-type) columns from the DataFrame `df` that are part of the feature set `X`.
- The result, `x_cat_withNull`, holds only the categorical input features that may still contain missing values.
2.2.5. Separating Remaining (Non-Categorical) Features
Project Code:
x_remain_withNull = df[X.select_dtypes(exclude=['O']).columns]
What It Does:
- It selects all non-categorical (non-'object' dtype) columns from `df` that are part of the feature set `X`.
- These columns typically include numerical features like integers and floats.
- The result is stored in `x_remain_withNull`, which contains numeric input features that may still contain null values.
2.2.6. Storing Label Columns Separately
Project Code:
y = df[['request_day_within_7d', 'categories_7day', 'request_day_within_3d',
'categories_3day', 'request_day_within_3d_capping',
'request_day_within_7d_capping']]
What it does:
- Selects specific columns from the full dataset `df` and stores them in a new DataFrame called `y`.
- These columns represent target variables that describe the number of user requests over time and their categorical groupings.
2.2.7. Initialize Clean DataFrames for Numeric and Categorical Data
Project Code:
x_remain = pd.DataFrame()
x_cat = pd.DataFrame()
What it does: Creates empty DataFrames to store cleaned numeric (`x_remain`) and categorical (`x_cat`) features.
2.2.8. Filling Null Values in Numeric Columns With the Mean
Project Code:
# Handling Null values
# if we having null values in a numeric columns fill it with mean (Avg)
for col in x_remain_withNull.columns:
x_remain[col] = x_remain_withNull[col].fillna((x_remain_withNull[col].mean()))
What it does:
- The loop goes through each numeric column in `x_remain_withNull` and fills any `NaN` (missing) values with the mean of that column.
- The result is stored in `x_remain`.
2.2.9. Filling Null Values in Categorical Columns With the Mode
Project Code:
# if we having null values in a categorical columns fill it with mode
for col in x_cat_withNull.columns:
x_cat[col] = x_cat_withNull[col].fillna(x_cat_withNull[col].mode()[0])
What It Does:
- This loop checks each categorical column in `x_cat_withNull` and fills any missing (`NaN`) values with the most frequent value (i.e., the mode) of that column.
- The cleaned data is saved into `x_cat`.
2.2.10. Checking for Null Values
Project Code:
x_remain.isna().sum()
Here’s the output. There are no `NULLs`.

What It Does: Verifies that all missing values have been successfully filled (imputed).
2.2.11. Import and Initialize OneHotEncoder
Project Code:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories='auto' , handle_unknown='ignore')
What It Does:
- Imports the `OneHotEncoder` from scikit-learn.
- Creates an instance of the encoder (`ohe`) with these settings:
- `categories='auto'`: Detects the unique values in each feature automatically.
- `handle_unknown='ignore'`: Ensures that if new (unseen) categories appear in future data, they won’t cause errors during encoding.
2.2.12. Fit and Transform the Categorical Data
Project Code:
feature_train = ohe.fit_transform(x_cat).toarray()
feature_labels = ohe.categories_
What It Does:
- Learns the unique categories in each column (`fit`)
- Applies one-hot encoding to each category (`transform`)
- Converts the sparse output to a full array (`toarray`)
- Stores the learned categories per column
2.2.13. Generate Column Names for New Features
Project Code:
new_features = []
for i,j in zip(x_cat.columns,feature_labels):
new_features.append(f"{i}_"+j)
What It Does: Combines original column names with category values.
2.2.14. Flatten the New Feature Labels
Project Code:
feature_labels = np.array(new_features, dtype=object).ravel()
What It Does: Flattens the list of new column names.
2.2.15. Extend Final Column List
Project Code:
f = []
for i in range(feature_labels.shape[0]):
f.extend(feature_labels[i])
What It Does: Builds the final list of flattened column names.
2.2.16. Create DataFrame With Encoded Values and Named Columns
Project Code:
df_features = pd.DataFrame(feature_train, columns=f)
What It Does: Converts the NumPy array into a pandas DataFrame with readable column names.
2.2.17. Check Output
Project Code:
print(df_features.shape)
df_features.sample(3)
Here are the outputs.


What It Does:
- Shows how many new features were created.
- Randomly samples a few rows to inspect encoding.
2.3. MinMaxScaler
We’ll scale features using the MinMaxScaler. It scales numeric features into a fixed range, typically [0, 1]. Doing this is important for models sensitive to input data scale, such as neural networks, k-nearest neighbors (KNN), and support vector machines (SVM).
Here’s a step-by-step guide with additional explanations.
2.3.1. Import Scaler
2.3.2. Apply MinMax Scaling to Numeric Features
2.3.3. Preview Target Columns
2.3.4. Concatenate All Feature Data
2.3.5. Drop Any Remaining Nulls
2.3.6. Check Final Dataset Shape
Let’s start with scaling.
2.3.1. Import Scaler
Project Code:
from sklearn.preprocessing import MinMaxScaler
What It Does: Imports `MinMaxScaler`, a common feature scaling method.
2.3.2. Apply MinMax Scaling to Numeric Features
Project Code:
sc = MinMaxScaler()
x_remain_scaled = sc.fit_transform(x_remain)
x_remain_scaled = pd.DataFrame(x_remain_scaled, columns=x_remain.columns)
What It Does:
- Initializes the scaler
- Fits and transforms the numeric data (`x_remain`) to scale it between 0 and 1. To do that it uses the following formula.
- Converts the resulting NumPy array back to a DataFrame with original column names.
2.3.3. Preview Target Columns
Project Code:
y.head(1)
Here’s the output.

What It Does: Displays the first row of the `y` DataFrame containing target columns related to request activity.
2.3.4. Concatenate All Feature Data
Project Code: We first concatenate 3-day interactions features.
data_with_3days = pd.concat([
df_features.reset_index(drop=True),
x_remain_scaled.reset_index(drop=True),
y[['request_day_within_3d','request_day_within_3d_capping','categories_3day']].reset_index(drop=True)
], axis=1)
Let’s now do the same for 7-day interactions.
# Concatenate data after applying One-Hot Encoding
data_with_7days = pd.concat([df_features.reset_index(drop=True),x_remain_scaled.reset_index(drop=True), y[['request_day_within_7d',
'request_day_within_7d_capping',
'categories_7day']].reset_index(drop=True)], axis=1)
What It Does:
- Combines:
- One-hot encoded categorical features (`df_features`)
- Scaled numeric features (`x_remain_scaled`)
- Target columns from `y`
- Uses `reset_index(drop=True)` to align row indices before concatenation
2.3.5. Drop Any Remaining Nulls
Project Code: We drop nulls for 3-day interactions…
data_with_3days.dropna(inplace=True)
…and 7-day interactions.
data_with_7days.dropna(inplace=True)
What It Does: Removes any rows with missing values.
2.3.6. Check Final Dataset Shape
Project Code:
data_with_3days.shape
Here’s the output for the 3-day interactions.

Now, the same code, but for the 7-day interactions…
data_with_7days.shape
…and the output.

3. Model Selection
The goal of model selection is to identify the most suitable algorithm(s) for a specific problem. This process is based on the data structure and the problem type, e.g., classification, regression, or ranking.
Many models are based on certain mathematical assumptions about data, so make sure that these assumptions align with the actual data you’re using. To choose the right model, you should evaluate multiple algorithms that might suit your purpose using cross-validation and performance metrics.
What performance metrics, you might ask? Again, they depend on the specific goal of the problem. In the table below, you can find suitable metrics for each problem type.

These metrics are used to compare multiple models to answer this question: “Which model performs best on this problem?” In essence, you’re benchmarking several options.
It’s important to make this distinction, as performance metrics are also used in the following workflow step, but for a different purpose.
Example
The model selection for this project will look like this.
3.1. Data Splitting
3.2. Evaluation Metrics
3.3. Regression Models
3.4. Classification Models
3.5. Gradient Boosting
Let’s start with splitting the dataset.
3.1. Data Splitting
Project Code:
from sklearn.model_selection import train_test_split
What It Does: Imports the function to split the dataset into training and testing subsets.
3.2. Evaluation Metrics
Project Code:
from sklearn.metrics import classification_report, mean_squared_error
What It Does:
- `classification_report` provides precision, recall, F1-score for classification tasks.
- `mean_squared_error` measures average squared error for regression tasks.
3.3. Regression Models
Project Code:
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
What It Does: Imports popular regression models.
- `LinearRegression`: For continuous output.
- `Lasso`: Like linear regression but performs feature selection via regularization.
- `KNeighborsRegressor`: Makes predictions based on nearby data points.
- `DecisionTreeRegressor`: Predicts with if-else logic using decision rules.
3.4. Classification Models
Project Code:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
What It Does: Imports classification models.
- `LogisticRegression`: A baseline classification algorithm.
- `RandomForestClassifier`: An ensemble method using multiple decision trees.
3.5. Gradient Boosting
Project Code:
import xgboost as xgb
What It Does: Imports the XGBoost library, a high-performance gradient boosting framework.
4. Model Training and Evaluation
Training the model means you’re adjusting the learning algorithm’s internal patterns to discover patterns or make decisions. How you do that depends on the machine learning type. For example, through labels in supervised machine learning or structural signals in unsupervised machine learning.
By model evaluation, we mean the assessment of the trained model’s generalization performance on the unseen dataset. This helps estimate the model’s performance on real-world data and suggests possible further improvements in the model.
To evaluate the model, we use the same metrics we mentioned in the previous stage. However, in this stage, the purpose is to evaluate a single model’s behavior, typically on a train/test data split or during cross-validation. That way, you can diagnose model overfitting or underfitting, check if it meets your performance thresholds, and understand class-level behavior (via confusion matrix, precision/recall, etc.).
Example
We will predict interactions within both 3 and 7 days in these steps.
4.1. Data insepction
4.2. Prepare features and labels
4.3. Regression models building
4.4. Classification models building
4.5. Deep learning
4.1. Data Inspection
Project Code: Here’s the code for inspecting the 3-day interaction data.
data_with_3days.sample()
Here’s the output. There are too many columns to show in the article, but you get the impression.

Now the same for the 7-day interactions. Code…
data_with_7days.sample()
…and the output.

What It Does: Displays a random row from the `data_with_3days` and `data_with_7days` DataFrames.
4.2. Prepare Features and Labels
Project Code: Here’s the code for 3-day interactions…
X = data_with_3days.drop(['request_day_within_3d',
'request_day_within_3d_capping',
'categories_3day'], axis=1)
y = data_with_3days[['request_day_within_3d', 'request_day_within_3d_capping', 'categories_3day']]