Data Preprocessing 

  • By
  • August 3, 2021
  • Python

Data Preprocessing 

This first and most essential part has to be done before you feed your data to the learning algorithms. Your data has to be processed, cleaned, transformed to get better results or insights from the learning algorithm. 

Most of the time your data is messy. And as the data is huge it is necessary to preprocess your data. Data Scientist say that nearly 60 to 70 percent of the time goes in preprocessing the data. In this blog we shall look at different steps involved in this process. Let’s take a classic data set and practically go through each steps in detail one after the other.

 

For Free, Demo classes Call: 7507414653
Registration Link: Click Here!

For this purpose we shall be using python as the programming language. Python is a general purpose programming language which is 1. Easy to read 2. Easy to write, and 3. Easy to understand It also comes with huge libraries (someone else code) stored in a common repository to use. Let’s dive into the data preprocessing and look at various steps involved in the process 

Let us load the data set first and store it a variable called as data. To help us out in this process, we shall make use of widely well known pandas library (used for data manipulation). The code to do that is as given below 

[1]: # Importing the pandas library 

import pandas as pd 

The above code import pandas as pd is used to import the pandas library. The as pd refers to an alias name given for pandas 

[2]: # Loading the data set 

data = pd.read_csv(‘Data.csv’

The data stored in Data.csv is a comma seprated value. It is read with function read_csv() of pandas module or library which accepts name of the file as an argument in quotes. There are several arguments that you can pass to the read_csv() function. The rest arguments have a default value. So we leave it as it is. We are only interested in passsing the name of the file. That would serve the purpose. Let’s look into the data for what it holds. The code is written as below 

[3]: print(data) 

Country Age Salary Purchased 

0 France 44.0 72000.0 No 

1 Spain 27.0 48000.0 Yes 

2 Germany 30.0 54000.0 No 

1

3 Spain 38.0 61000.0 No 

4 Germany 40.0 61000.0 Yes 

5 France 35.0 NaN Yes 

6 Spain 38.0 52000.0 No 

7 France 48.0 79000.0 Yes 

8 Germany NaN 83000.0 No 

9 France 37.0 67000.0 Yes 

The data has 4 variables i.e columns (Country, Age, Salary, Purchased) and 10 rows. Each row is called as an observation. Now we can see that the Country and purchased column is categorical variable, which is in the form of text. For the algorithm purpose, we have to convert these values in numeric. We map each unique entry in the column by a number. For Example, France is mapped to 0, Germany is mapped to 1 and Spain is mapped to 2. To do this we can make use of the pandas libraries get_dummies function. It would create a dummy variable as they are nominal data. And to avoid the dummy variable trap we drop the first column. Lets see this in action for better clarity. 

[4]: data = pd.get_dummies(data, drop_first=True

Let’s have a look at the data. 

[5]: data \

For Free, Demo classes Call: 7507414653
Registration Link: Click Here!

 

[5]: Age Salary Country_Germany Country_Spain Purchased_Yes 0 44.0 72000.0 0 0 0 

1 27.0 48000.0 0 1 1 

2 30.0 54000.0 1 0 0 

3 38.0 61000.0 0 1 0 

4 40.0 61000.0 1 0 1 

5 35.0 NaN 0 0 1 

6 38.0 52000.0 0 1 0 

7 48.0 79000.0 0 0 1 

8 NaN 83000.0 1 0 0 

9 37.0 67000.0 0 0 1 

As we can see that the country and purchased column which had categorical values into them got converted to numeric and column name is pre fixed with Country_Germany, Country_Spain and Purchased_Yes. The first column of each categorical variable (Country and Purchased) got dropped. 

Now as we have taken care of encoding the categorical variable. Let us proceed towards the next task i.e handling missing values in Age and Salary column. We can use the scikit learn libraries SimpleImputer class. Let’s see how to do through code. 

[6]: # Importing the library 

from sklearn.impute import SimpleImputer 

# creating an object for the SimpleImputer class 

si = SimpleImputer() 

data = si.fit_transform(data) 

Now lets have a look at the data. Missing values are handled by the measures of central tendency 2

i.e mean, median or mode. Any one of this stratergy can be used to replace the missing values. The data is in now in array format. The returned data type is very crucial. Let’s convert it back to data frame. The code is as given below 

[7]: data = pd.DataFrame(data,␣ 

,→columns=[[‘Age’,‘Salary’,‘Country_Germany’,‘Country_Spain’,‘Purchased’]]) data 

[7]: Age Salary Country_Germany Country_Spain Purchased 0 44.000000 72000.000000 0.0 0.0 0.0 1 27.000000 48000.000000 0.0 1.0 1.0 2 30.000000 54000.000000 1.0 0.0 0.0 3 38.000000 61000.000000 0.0 1.0 0.0 4 40.000000 61000.000000 1.0 0.0 1.0 5 35.000000 64111.111111 0.0 0.0 1.0 6 38.000000 52000.000000 0.0 1.0 0.0 7 48.000000 79000.000000 0.0 0.0 1.0 8 37.444444 83000.000000 1.0 0.0 0.0 9 37.000000 67000.000000 0.0 0.0 1.0 

Now as we have our dataset in the format required by the learning algorithm, we can pass it to the learning algo. Before that we need to split the data into training set, where we give greater than 60 percent of the data to the training set and less than 40 percent data to the test set. By doing this we pass the training set data to the learning algo and use that trained model on test set and comare the results of the predicted with the actual value. This is an iterative process. We tune the hyper parameters and re train it. Let us look at the code to do the split the data into train and test sets. Before that lets separate the independent variable and dependent variable as X and y respectively. 

[11]: # Independent Variables 

X = data.loc[:,[‘Age’,‘Salary’,‘Country_Germany’,‘Country_Spain’]] y = data.loc[:,[‘Purchased’]] 

[12]:

[12]: Age Salary Country_Germany Country_Spain 

For Free, Demo classes Call: 7507414653
Registration Link: Click Here!

 

0 44.000000 72000.000000 0.0 0.0 

1 27.000000 48000.000000 0.0 1.0 

2 30.000000 54000.000000 1.0 0.0 

3 38.000000 61000.000000 0.0 1.0 

4 40.000000 61000.000000 1.0 0.0 

5 35.000000 64111.111111 0.0 0.0 

6 38.000000 52000.000000 0.0 1.0 

7 48.000000 79000.000000 0.0 0.0 

8 37.444444 83000.000000 1.0 0.0 

9 37.000000 67000.000000 0.0 0.0 

[13]:

3

[13]: Purchased 

0 0.0 

1 1.0 

2 0.0 

3 0.0 

4 1.0 

5 1.0 

6 0.0 

7 1.0 

8 0.0 

9 1.0 

Now X is our independent variables and y is the Dependent variable that we are inerested in predicting. Now let us split the data into train and test. The code is written below 

[15]: from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,,→random_state=42

[16]: X_train 

[16]: Age Salary Country_Germany Country_Spain 

5 35.0 64111.111111 0.0 0.0 

0 44.0 72000.000000 0.0 0.0 

7 48.0 79000.000000 0.0 0.0 

2 30.0 54000.000000 1.0 0.0 

9 37.0 67000.000000 0.0 0.0 

4 40.0 61000.000000 1.0 0.0 

3 38.0 61000.000000 0.0 1.0 

6 38.0 52000.000000 0.0 1.0 

[17]: X_test 

[17]: Age Salary Country_Germany Country_Spain 

8 37.444444 83000.0 1.0 0.0 

1 27.000000 48000.0 0.0 1.0 

[18]: y_train 

[18]: Purchased 

5 1.0 

0 0.0 

For Free, Demo classes Call: 7507414653
Registration Link: Click Here!

 

7 1.0 

2 0.0 

9 1.0 

4 1.0 

3 0.0 

6 0.0 

4

[19]: y_test 

[19]: Purchased 

8 0.0 

1 1.0 

Now we can see the data has been splitted into training test and test set. We can pass the X_train and y_train to the learning algo and evaluate it by passing the X_test and store the algo prediction in a variable, so that it can be compared with the actual values which is stored in y_test. 

This are the various steps involved in data preprocessing.

 

Author:

Newton Titus

SevenMentor Pvt Ltd.

Call the Trainer and Book your free demo Class for now!!!

 

Submit Comment

Your email address will not be published. Required fields are marked *

*
*