Role of NLP in Data Science

  • By
  • August 27, 2021
  • Data Science

 Modern organizations work with huge amounts of data. That data can  available in various forms including documents, spreadsheets, audio  recordings, emails, JSON, then many, many more. One of the foremost  common ways in which such data is recorded is via text. That text is typically  quite almost like the tongue that we use from day-to-day. 

Natural Language Processing (NLP) is that the study of programming  computers to process and analyse large amounts of natural textual data. Knowledge of NLP is important for Data Scientists since text is such a simple to  use and customary container for storing data. 

Faced with the task of performing analysis and building models from textual  data, one must skills to perform the essential Data Science tasks. That includes  data cleaning, formatting, parsing, analysing, visualizing, and modelling the  text data. It’ll all require a couple of extra steps additionally to the standard  way these tasks are done when the info is formed from raw numbers. 

This will teach you the importance of NLP when utilized in Data Science. Here  we are going to use some most common techniques that you can use and  handle your text data it includes some code examples with NLTK that is Natural  Language Tool kit. 

 

For Free, Demo classes Call: 8983120543
Registration Link: Click Here!

 

Table of contents 

Tokenization 

Stop Word Removal 

Stemming 

Lemmatization 

Sentimental Analysis 

Tokenization: 

Tokenization is that the process of splitting or cutting sentences into words.  This is not as simple because it looks. for instance, the word “New York”  within the first example above was separated into two tokens. 

However, ny may be a pronoun and could be quite important in our  analysis. we’d be happier keeping it in only one token. As such, care must  be taken during this step. 

The main advantage of Tokenization is that it converts the text into a  format that is easier to convert to raw numbers, which may actually be  used for processing. It’s a natural initiative when analysing text data. 

We will consider simple example.  

 

For Free, Demo classes Call: 8983120543
Registration Link: Click Here!

 

Code: 

Import nltk 

sentence = “My name is Aniket and I love NLP” 

tokens = nltk.word_tokenizer(sentence) 

print(tokens) 

o/p : [‘My’ , ’name’ , ’is’ , ’Aniket’ , ’and’ , ‘I’ , ‘love’ , ‘NLP’ ] 

Stop word Removal: 

After tokenization we have to remove stop words. Stop Words Removal  features a similar goal as Tokenization convert the text data into a format  that’s more suitable for processing. during this case, stop words removal  removes common language prepositions like “and”, “the”, “a”, then on in  English. This way, once we analyse our data, we’ll be ready to traverse the  noise and focus in on the words that have actual real-world meaning. 

stop words removal are often easily done by removing words that are  during a pre-defined list. a crucial thing to notice is that there’s no universal  list of stop words. As such, the list is usually created from scratch and  tailored to the appliance being worked on. 

Code: 

import nltk 

from nltk.corpus import stopwords 

sentence = “This is a sentence for removing stop words”

tokens = nltk.word_tokenize(sentence) 

stop_words = stopwords.words(‘english’) 

filtered_tokens = [w for w in tokens if w not in stop_words] print(filtered_tokens) 

Output: 

[‘This’ , ‘sentence’ , ‘removing’ , ‘stop’ , ‘words’] 

 

For Free, Demo classes Call: 8983120543
Registration Link: Click Here!

 

Stemming: 

Stemming is nothing but cleaning up text data for processing. Stemming is  that the process of reducing words into their root form. the aim of this is  often to scale back words which are spelled slightly differently thanks to  context but have an equivalent meaning, into an equivalent token for  processing. for instance, think about using the word “cook” during a  sentence. There’s quite lot of the way we will write the word “cook”,  counting on the context: 

cook ===🡺 cook 

cooks ===🡺 cook 

cooked ===🡺 cook 

cooking ===🡺 cook 

In above example the common root word is “cook”. 

All of those different sorts of the word cook have essentially an  equivalent definition. So, ideally, when we’re doing our analysis, we’d want  them to all or any be mapped to an equivalent token. during this case, we  mapped all of them to the token for the word “cook”.

Code: 

import nltk 

snowball_stemmer = nltk.stem.SnowballStemmer(‘english’) s_1 = snowball_stemmer.stem(“cook”) 

s_2 = snowball_stemmer.stem(“cooks”) 

s_3 = snowball_stemmer.stem(“cooked”) 

s_4 = snowball_stemmer.stem(“cooking”) 

Output: 

s_1, s_2, s_3, s_4 all have the same result i.e. “cook”. 

Lemmatization: 

Lemmatization is that the process of grouping together the various inflected  sorts of a word in order that they are often analysed as one item. 

Lemmatization is analogous to stemming but it shows context to the words. So it links words with similar getting to one word. One major difference  with stemming is that lemmatize takes a neighbourhood of speech  parameter, “pos” If not supplied, the default is “noun.” 

 

 

For Free, Demo classes Call: 8983120543
Registration Link: Click Here!

 

Code: 

from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer() 

# Verb 

print(lemmatizer.lemmatize(‘playing’, pos=”v”))  

o/p: play

# Noun 

print(lemmatizer.lemmatize(‘playing’, pos=”n”)) 

o/p: playing 

# Ajective 

print(lemmatizer.lemmatize(‘playing’, pos=”a”)) 

o/p: playing 

# Adverb 

print(lemmatizer.lemmatize(‘playing’, pos=”r”)) 

o/p: playing 

Which one is better Stemming or Lemmatization?  

Stemming works on words without knowing its context and that’s why  stemming has lower accuracy and faster than lemmatization. Lemmatizing  is taken into account better than stemming. Word lemmatizing returns a  true word albeit it’s not an equivalent word, it might be a synonym, but a  minimum of it’s a true word. Sometimes you don’t care about this level of  accuracy and every one you would like is speed, during this case, stemming  is best. 

Sentiment Analysis: 

Sentiment Analysis may be a broad range of subjective analysis which uses  tongue processing techniques to perform tasks like identifying the  sentiment of a customer review, positive or negative feeling during a  sentence, judging mood via voice analysis or transcription analysis etc.

Example: 

“I did not like the chocolate milk-shake” – is a negative experience of  milk-shake. 

“I did not hate the chocolate milk-shake” – may be considered as a  neutral experience. 

For Free, Demo classes Call: 8983120543
Registration Link: Click Here!

 

Some approaches in sentimental Analysis: 

  1. Named entity recognition (NER): It involves determining the parts of a text which can be identified and categorized into pre-set groups. samples of such groups include names of people and names of  places. 
  2. Word sense disambiguation: It involves giving getting to a word supported the context.
  3. Natural language generation: It involves using databases to get semantic intentions and convert it into human readable language.

Why is NLP difficult? 

Natural Language processing is taken into account a difficult problem in  computing. It’s the character of the human language that creates NLP  difficult. 

The rules that dictate the passing of data using natural languages aren’t  easy for computers to know. 

Some of these rules are often high-level and abstract; for instance, when  someone uses a sarcastic remark to pass information. On the opposite  hand, a number of these rules are often low-level; for instance, using the  character “s” to suggest the plurality of things.

 

Author:

Thorave, Aniket

Company:  Seven Mentor Pvt. Ltd.

Call the Trainer and Book your free demo Class for now!!!

Submit Comment

Your email address will not be published. Required fields are marked *

*
*