Data Preparation for Machine Learning Algorithm

  • By
  • November 1, 2021
  • Machine Learning

Data Preparation for Machine Learning Algorithm – 

The data can be used for business analytics, data visualization, to get input from date to develop the machine learning model but before we need to do the data preparation. The data preparation is defined as collection, combining, structuring and organizing the data for the required applications. Data preparation is necessary for data processing and analysis to get more accurate and consistent results. The data preparation can be done in two types first one is to extract the data from raw data by KPI calculations and second one is for data science algorithm. 

In this blog, we will concentration on second type to prepare data for machine learning algorithm. The data preparation can be done by using the four methos like normalization, conversion, missing value imputation and resampling. In this blog we will discuss the first three methods.  

To understand the data preparation technique, we will take the example of Churn Prediction. Here we will take the data from company’s CRM. In this data includes the features like customers behaviour, demographics and revenue etc. The target of this expel is to find the difference between the customers who is at risk of churning and other customers. This is the binary classification problem because in this data set only two possible outcomes like yes or no. So we need to train this data set to distinguish the customers in two classes like churn customers for this output is yes and not churn customers it is no. To do this classification we can use any classification algorithm like decision tree, random forest, logistic regression etc. 

For Free, Demo classes Call:  8983120543

Registration Link:Click Here!

 To use any classification algorithm, need to divide the data set in to two subsets like training data set and testing data set. These subsets are required because only by training the model we can not claim that model is good or it gives the high accuracy. To claim that model is good or giving the required information or giving the high accuracy then first we need to train that model by using training data set. When we will get the satisfactory results after training the model then need to test the model by using testing data set to check the results. Here we will use the logistic regression for this classification. 

Before applying the logistic regression algorithm, we need to clean the data set or we can say the data preparation is required to get the satisfactory results. The data preparation can be done by using following methods. 

1.Normalization –

 When the data preparation is done for machine learning algorithm then normalization is mostly used technique. During the normalization values from numeric columns changes to the common scale without distorting the ranges of values. Here for churn prediction, we are using the logistic regression algorithm for classification so we need the input data to be normalized into the interval [0,1]. In machine learning all algorithms not required normalization technique. This is required when the data set includes distances or variances. We can use the normalization for single column or multiple columns in the one data set. 

Example if any input data set contains one column values range from 0 to 1 and in another column values range from 1000 to 10000. This major difference in values cause the problems when we will combine the values of these two columns as a feature during application of model. So to avoid this problem we need to do the normalization of data set to take all values on common scale. 

To use this first we need to import the standard scalar from preprocessing then apply the fit and transform. 

std_scale=preprocessing.StandardScaler().fit(train_norm)
x_train_norm = std_scale.transform(train_norm)

For Free, Demo classes Call:  8983120543

Registration Link:Click Here!

 2.Conversion –

 Sometimes in the dataset having categorical variables as input features. To process on these categorical features, we need to convert that in to the numerical values. Here one categorical or nominal column can be converted in to the one or more numerical columns. So to convert this categorical to numerical values there are two methods as follows. 

The first method is the label or index encoding, in this method each nominal value is converted or mapped with one integer value.

Example suppose in the input variable we have names of cities in one column like Pune, Mumbai, Delhi, Kolkata etc. When we use the label encoder for this categorical column then each city will assign with one integer value like 0 – Pune, 1 – Mumbai, 2 – Delhi, 3 – Kolkata etc. to use this technique first we need to import the label encoder from preprocessing then need to apply fit and transform.  

from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

Another technique to convert categorical value to numerical value is one hot encoding. In one hot encoding each categorical value will create a new variable or we can say that each categorical value will create new column. These new columns will fill with 0 or 1 here 0 indicates the absence and 1 represent the presence. 

from sklearn import preprocessing

onehot_encoder = preprocessing.OnehotEncoder()

3. Missing Value Imputation – 

In the data many of the values are missing due to various reasons like data nor recorded, not filled by operator, observation etc. In data preparation need to handle missing values because when we develop the model using machine learning algorithm do not support the data with missing values. 

Missing values can be handled by deleting the data or by using other technique. If we delete the data then their will be loss of data so this method is not adopted in many cases. So to handle the missing values we will use mean, mode, median method. In mean will take the mean of all available values and put it in the missing values. In mode will count the values come in the variable and find the maximum time values come in the variable then that integer will put in missing values. In median will take the median of previous and next values and same values will put in place of missing value. 

Author:-

Madhuri Diwan
Call the Trainer and Book your free demo Class  Call now!!!

| SevenMentor Pvt Ltd.

© Copyright 2021 | Sevenmentor Pvt Ltd.

 

 

 

Submit Comment

Your email address will not be published. Required fields are marked *

*
*