Handling Different Situations in Machine Learning

  • By
  • March 5, 2022
  • Machine Learning

Handling Different Situations in Machine Learning  –

While solving machine learning problems we encounter different problem and as we know, “the  better you are at problem solving, the better Data Scientist you are”.  

My aim is to make readers love Machine Learning before diving deep into it. It’s very easy to use  difficult words, unnecessary codes and make you feel overwhelmed but that wont help you.  

Machine Learning is a beautiful subject, it’s a world itself. So before loving it, before understanding  what this subject wants from us, before understanding the intuitions behind particular topics, you  won’t be able to solve problems of the real world. 

This subject is directly connected to everything you use these days. Your phone, Your car, Your  Favourite apps like Instagram, Facebook, Twitter etc. There are problems all around and Machine  Learning is solving them.  

So while solving problems there are particular situations where we can get stuck, this article will  give you some intuition about those situations and will guide you to deal accordingly.

For Free, Demo classes Call:  8983120543

Registration Link:Click Here!

1. IMBALANCED DATASET VS BALANCED DATASET  –

 

I think all of us are aware of balanced and unbalanced dataset, it simply means when we have  almost equal amount of data in each set, all sets will be called BALANCED and when we have  unequal sets, it is called IMBALANCED.  

Let’s say you are solving a binary classification problem in which you have two classes, positive  and negative. To be more precise let’s take an example of amazon food review model, lets say we  are making a model which predicts the nature of review whether it is positive or negative. Positive  is let’s say “1” and negative is “0”.  

Now lets say while training data we found that number of positive reviews (n1) is “500” and  number of negative reviews(n2) is “460”. Now this is the case of balanced data because n1 and n2  are almost equal.  

Let’s take an another example, if n1 is 500 and n2 is 80, now this is the case of imbalanced data and  there will be a problem in the model because if you think logically the model will be deviated to  positive reviews and it wont predict accurate results.  

For Free, Demo classes Call:  8983120543

Registration Link:Click Here!

WHAT TO DO THEN? 

There are two ways to handle it, it totally depends on the particular problem which to choose. 

 

 

1. UNDERSAMPLING –

It is a very simple approach, lets understand with our above example, lets say  

n1= 500 and n2= 80, if we take n2 as it is and take 80 random samples from n1 , data will be  balanced, now both n1 and n2 are “80”.  

DISADVANTAGE –

There is huge loss of information, as we can see in the above example, 420 samples were wasted. 

 

2. OVERSAMPLING  –

Let’s say n1 is 500 and n2 is 100, if we repeat every point of n2 5 times, they will be 500 points, its  again a simple technique, just by placing more points from minority class on dataset.  

So simply by repetition we can handle the problem of imbalance data. 

So these are very simple and logical way to handle imbalanced data.  

Now the second situation in which a data scientist can get stuck is  

“MULTI-CLASS CLASSIFICATION”  

For Free, Demo classes Call:  8983120543

Registration Link:Click Here!

3.MULTI-CLASS CLASSIFICATION –

 

 

A binary classifier is that in which we have only two classes, we denote it by “1” or “0”. For  example- if on Amazon food review is positive, class will be 1 and if review is negative, class will  be 0.  

So let’s say in the dataset D, we have data points{xi, yi) where i varies from 1 to n. if y belong to {0,1} then it is a binary classification problem.  

if y belongs to {0,1,2,3,4,5…}, then it is a multi-class classification problem.  

So to understand binary classification problem, lets think like this, suppose we have a black box  which has been trained by our training data, lets say there is function F(X) in that black box which  takes a query point Xq(for example- “a review by the new customer on the amazon website) and  predicts whether it is “1” (positive) or “0” (negative).  

Now in multi-class classification for a query point Xq, F(x) can predict “1”, “2”, “3” and so on…  

Lets assume that data is linearly separable and we know that for binary classification we have to  find a hyperplane to classify two classes but in multi-class classification, there are more than 2  classes so One hyperplane won’t solve the problem, we will need multiple hyperplanes.  

So one very common and widely used technique in industry is “ONE VS ALL”, Let me give you  the intuition behind “One VS All”. 

 

Lets say we have 3 classes, class-1, class-2 and class-3(please refer the above image), Now one  hyperplane can not separate these 3 classes together, so what we can do is we will make 3 binary  classifiers.  

First Classifier Class 1 vs {Class2 and Class 3}  

Second Classifier Class 2 vs {Class1 and Class 3}  

First Classifier Class 3 vs {Class1 and Class 2}  

When all of our binary classifiers are ready, just take a majority vote about the decision.

For Free, Demo classes Call:  8983120543

Registration Link:Click Here!

4.OVERFITTING AND UNDERFITTING  –

 

 

 

Overfitting and Under-fitting are very common problems in Machine Learning, Lets understand  Overfitting and Under-fitting.  

Now let’s say we have a model which is giving us a Train Accuracy of 99% and Test Accuracy of  70%, it means your model is OverFitting on your data or in other words your model is trying to  remember the data rather than understanding it and that is the reason at training it is giving us good  accuracy but the moment we come at testing, it is giving us bad results.  

Now let’s assume a case when Train Accuracy of the model is 50% and Test Accuracy is 48%, this  means model is Under-fit or in other words, model is not able to understand the relationships in data  at all.  

Now to handle the problem of overfitting we have some algorithms like 

  1. LOF(Local Outlier Factor)  
  2. IQR(Inter Quartile Range)  

To handle the problem of under-fitting, we just need to give more data to the model.  I hope you have understood the basic situations, Thanks for Reading.

Author:-

Nishesh Gogia

© Copyright 2021 | Sevenmentor Pvt Ltd.

Call the Trainer and Book your free demo Class  Call now!!!


| SevenMentor Pvt Ltd.

 

Submit Comment

Your email address will not be published. Required fields are marked *

*
*