The main motivation behind doing this project is to understand the different word embedding techniques.
In this project, the predictive model for sentiment analysis is developed. The sentiment are shown into three categories:- 1) positive, 2) neutral and 3) negative. The dataset for the project is taken from Yelp.
In this notebook, the whole processs from feature engineering to final model is shown step by step. The 4 different types of feature extration techinques such as 1) CounterVectorizer, 2)TF-IDF Vectorizer, 3)HashingVectorizer and 4) Word2Vec are applied and built the sentiment analysis model. The final model (the best one) is deployed in Heroku. You can play around with the model.
In addition, the positive words are removed from negative class and negative words are removed from postive class to make higher polarity/discriminatory between the classes. Thus, it gives higer prediction accuracy.
1. Data
1.1 Load Data
1.2 Data Overview
2. Feature Engineering
2.1 Text Pre-processing
2.2 Label Dataset
2.3 Exploratory Data Analysis(EDA)
2.3.1 Class distribution
2.3.2 Generate Word Cloud
2.3.3 Word Cloud in Each Class
2.3.4 Most Frequent Word
2.4 Handle Imblanced Data
2.5 Handle Duplicate and Missing Data
2.6 Remove Negative and Positive Words
3. Modeling
3.1 Hyperparameter Tunning
3.2 Feature Extraction
3.2.1. Count Vectorizer
3.2.2. TF-IDF
3.2.3. HashingVectorizer
3.2.4 Word2Vec
3.2.5 Doc2Vec
# basic libraries
import re
import json
import pandas as pd
import numpy as np
import joblib
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import pickle as pkl
from collections import Counter
from pprint import pprint
#Nltk for text processing
from langdetect import detect
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords, words, wordnet
from wordcloud import WordCloud, STOPWORDS
from nltk.stem import WordNetLemmatizer, PorterStemmer
#sklearn for modeling
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from yellowbrick.classifier import ConfusionMatrix
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline, Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.decomposition import PCA
#gensim
from gensim.models import Word2Vec, Doc2Vec
from keras.preprocessing.text import Tokenizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
#load json data
data = []
with open('review.json', 'r') as file:
for f in file:
data.append(json.loads(f))
#convert into pandas dataframe
df = pd.DataFrame(data)
df.head()
We have have noticed that there are over 6 millions documents and 9 columns. We will take only two columns- 'text' and 'stars' for our purpose.
#check shape
print ("Rows : " ,df.shape[0])
print ("Columns : " ,df.shape[1])
#takes only review_text and stars(rating) columns
text_data = df[['text','stars']]
text_data.head(3)
The following steps are applied for text processing:
1. Change to lower case
2. Expand contractions word, for example: 'can't'--> 'can not'
3. Remove punctuation and other than alphabetic words
4. Build customized stopwords, and remove stops word and words less than 3 letters
5. Take only noun, adverb and adjective
6. Lemmatize based on POS-tag
## list of contractions words
contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "can not", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}
#regex patterns for contraction words
contractions_re = re.compile('(%s)' % '|'.join(contraction_dict.keys())) # use keys
len(contraction_dict)
def replace_contractions(text, contractions):
"""
Function to expand contraction words into two-words(e.g. 'can't' -->> 'can not')
Parameters:
text(str): text to expand contraction words
contractions(dict): contractions words in dict type
Returns:
Text with expanded contraction words
"""
def replace(match_object):
"""
Parameters:
match_object: matched words in regex pattern 'contractions_re'
Return:
values for matched words
"""
return contractions[match_object.group(0)] # get dict values
return contractions_re.sub(replace, text) # sub with 'replace'
#find wordnet POS-tagging
def get_wordnet_pos(pos_tag):
"""
Parameters:
pos_tag: Word POS tag
Returns:
wordnet pos tag(for exmple, 'n','r','j')
"""
if pos_tag.startswith('J'):
return wordnet.ADJ
elif pos_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
#Basic text preprocessing
def text_preprocess(text):
"""
Function to clean documents
Parameter:
text(str): text for preprocessing
Returns:
The Clean Processed text
"""
#1. convert words to lower case
text = text.lower()
#2.replace contraction words using function - replace_contractions()
text = replace_contractions(text, contraction_dict)
#3.remove alphanumeric characters
text = re.sub(r"[^A-Za-z]", " ", text)
text = text.split()
#4.remove stop words and words with less than 3 letters
stops_word = set(stopwords.words("english"))
text = [w for w in text if w in ['no', 'not', 'too', 'so'] or (w not in stops_word and len(w) >= 3)] #include ['no', 'not', 'too', 'so']
#4.take only noun, adjective and adverb using POS-tag
noun_adj_adv = ['NN','NNS','NNP','NNPS', 'JJ','JJR','JJS','RB','RBR', 'RBS'] #inlcude only noun, adjective and adverb
#text = [word for word in text if pos_tag([word])[0][1] in noun_adj_adv]
#5.lemmatize words based on specific POS-tags
lema = WordNetLemmatizer()
lema_words = [lema.lemmatize(word, pos=get_wordnet_pos(pos_tag([word])[0][1])) for word in text if pos_tag([word])[0][1] in noun_adj_adv ]
text = " ".join(lema_words) # return string
return text
%%time
# apply text_preprocess() fucntion to column- text
text_data['clean_text'] = text_data.loc[:,'text'].apply(text_preprocess)
text_data.head()
I have excluded rating 4 and 2 from the data. This has been done to have clear discrimatroy/polarity words in each labels. Sentiment is labeled in following way:
1. Positive sentiment = 5 stars rating,
2. Neutral sentiment = 3 stars rating
3. Negative sentiment = 1 stars rating
#Exclude the rows having stars rating of 4.0 and 2.0
text_data = text_data.loc[text_data['stars'].isin([1.0, 3.0, 5.0 ])]
# Map stars with sentiments (1.0:'negative', 5.0:'positive', 3.0 :'neutral')
pd.options.mode.chained_assignment = None # warning
text_data['sentiment'] = text_data['stars'].map({1.0:'negative', 5.0:'positive', 3.0 :'neutral' })
text_data.head()
# check values counts based on labels
text_data.sentiment.value_counts()
#plot barchart of class
plt.bar(text_data.sentiment.value_counts().index, text_data.sentiment.value_counts().values )
plt.xlabel('Sentiment review')
plt.ylabel('Review Count')
plt.title('Barchart')
plt.xticks(rotation=45)
plt.show()
# By using panda directly
#text_data.groupby('sentiment').count()['stars'].plot.bar()
We can see that there is high imbalanced class. Positive reviews has the higest count, almost the double than other two classes. We have to take care(balance) this before fitting to model.
# define function to generate word cloud
def generate_wordcloud(text, title=None):
"""
Function to generate Word-Cloud image
Parameters:
text(str): collection of words to plot
title(str): title of figure
Returns:
The word cloud image
"""
cloud = WordCloud( stopwords= [x for x in list(STOPWORDS) if x not in ['not', 'so']],
background_color="white", scale=2,
collocations=False).generate(str(text))
plt.figure( figsize=(12,10)) # set size of figure
# if title is given use it
if title:
plt.title(title, fontdict={'size': 20,
'verticalalignment': 'bottom'})
plt.imshow(cloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout()
#call generate_wordcloud function
corpus = ' '.join(text_data['clean_text'])
generate_wordcloud(corpus, 'Most common words in whole Reviews')
# separate sentiment(class) and show in word cloud
negative_sentiment = ' '.join( review for review in text_data.loc[text_data['sentiment'] == 'negative']['text'])
postive_sentiment = ' '.join( review for review in text_data.loc[text_data['sentiment'] == 'positive']['text'])
neutral_sentiment = ' '.join( review for review in text_data.loc[text_data['sentiment'] == 'neutral']['text'])
%%time
#show negative Reviews in word cloud
titles = ['Most common words in Negative Reviews', 'Most common words in Positive Reviews', 'Most common words in Neutral Reviews']
words_sentiments = [negative_sentiment, postive_sentiment, neutral_sentiment]
for title, words_sent in zip(titles, words_sentiments):
#ng_token = [ word for word in words_sent.split()]
generate_wordcloud(words_sent, title)
print()
print()
# store words as dictionary keys, and their counts as vlaues
vocab = Counter()
for text in text_data.loc[:, 'clean_text']:
vocab.update(text.split())
#show in bar chart
most_freq_25 = vocab.most_common(25)#20 most frequent words
df_most_feq = pd.DataFrame(most_freq_25, columns = ['Word', 'Count'])
df_most_feq.plot.bar(x='Word', y ='Count', figsize=(10,5), title='Most 25 Frequent Words in Corpus')
plt.show()
There are many ways to hanlde imbalanced dataset, for example undersampling majority class or oversampling minority class or by designing a cost function that penalized the wrong classification of the minority class more than wrong classifications of the majority class. For this, work we are using RandomUnderSamper() from scikit-learn which randomly removes the samples from majority class. The drawback of this undersampling is loss of informaiton since some samples are removed, but we don't have to worry much because we have enough samples to learn by algorithms.
#resample all classes but the minority clas with out replacement
under_sample = RandomUnderSampler(sampling_strategy='not minority',random_state=2, replacement=False)
X_res,y_res = under_sample.fit_resample(text_data[['clean_text', 'stars', 'sentiment']], text_data['sentiment'])
Counter(y_res)
X_res[1]
#show in panda dataframe
final_data = pd.DataFrame(X_res, columns=['text','star','sentiment'])
final_data.head()
#check data types
final_data.dtypes
#change column 'star' in to 'int' type
final_data['star'] = final_data['star'].astype(int)
final_data.dtypes
#drop duplicates rows
final_data= final_data.drop_duplicates(['text'], keep= 'first')
final_data.shape
#count duplicates based on 'text' column
final_data_shuffle.duplicated(['text'], keep='first').sum()
Note: Since the dataset has huge number of documents, it required a lot time for paramter tunning as well as modeling, so I decided to take just 30 K for each class. So that I can afford computation.
#takes only 30k from each class
neagtive = final_data.iloc[:30000] #neagtive class
neutral = final_data.iloc[739280:739280+30000] # neutral class
postive = final_data.iloc[-30000:] # positive class
final_data_le= pd.concat([neagtive,neutral,postive], ignore_index=True) #concatenate
final_data_shuffle= final_data_le.sample(frac=1) #shuffle
The positive and negative words are removed from opposite classes to make more polarity between the classes.
Note: The negative and positive words are collected from here.
#remove negative and positive words
def remove_pos_neg_word(text, star, negative_word, positive_word):
"""
Function to remove
1. negative words from positve class,
2. positive words from negative class.
3. poitive and negative words from neutral class
Parameter:
text(str): document(rows in panda dataframe)
start(int): 1 or 3 or 5
negative_word(list): list of negative words
positive_word(list): list of postive words
"""
#split text into words
split_text = text.split()
#remove words from positive class
if star ==1:
text= ' '.join([word for word in split_text if word not in positive_word])
return text
#remove words from positive and negative class
if star ==3:
text= ' '.join([word for word in split_text if word not in negative_word and word not in positive_word])
return text
#remove words from negative class
if star ==5:
text= ' '.join([word for word in split_text if word not in negative_word])
return text
#positive and negative words
word_neag = pd.read_csv('negative-words.txt', names=['Neg_word'],header=None)
word_neag_list = list(word_neag.Neg_word)
word_posi = pd.read_csv('positive-words.txt', names=['Pos_word'],header=None)
word_posi_list = list(word_posi.Pos_word)
final_data_shuffle['new_text'] = final_data_shuffle.apply(lambda row: remove_pos_neg_word(row['text'], row['star'],word_neag_list,word_posi_list), axis=1)#remove words
final_data_shuffle.head()
# split train and test dataset with stratify fashion
X_train, X_test, y_train, y_test = train_test_split(final_data_shuffle['new_text'],
final_data_shuffle['star'],
test_size=0.2,
shuffle= True,
stratify= final_data_shuffle['star'],
random_state=37)
The two classifiers are used: 1. MultinomialNB and 2. Logistic Regression
# define function tosave model in the disk
def save_file(model, file_name):
'''
model = file/model to save
file_name = name given to model/file
'''
pkl.dump(model, open(file_name, 'wb'))
GridSearch with 5 fold cv is used for finding the best parameter of the algorithms. The helper function- param_tune() is defined to this purpose. The evaluation metric is default one(accuracy). In addition to this, the model evaluation metrics such as precision, recall and F1 score is shown. The confusion matrix is also displayed.
#parameter tunning with grid serach
def grid_param_tune( clf, clf_parameter, X_train, X_test, y_train, y_test, vect=None, vect_parameter=None):
'''
Function to find the best parameters
Parameters:
vect: vectorizer techniques(such as tf-idf, Countervectorizer)
vect_parameter: parameter for
clf: algorithms
clf_parameter: list of paramters
X_train: train set
y_train: train label
X_test: test set
y_test: test label
Reurns:
The best parameter, test score from best_estimator, classification report and confusion
matrix on test data
'''
#join parameters together
parameter = {}
pipeline=''
#check vect and vect_parameter
if vect is None and vect_parameter is None:
#update parameters together
parameter.update(clf_parameter)
pipeline = Pipeline([('clf', clf)])
else:
#update parameters together
parameter.update(vect_parameter)
parameter.update(clf_parameter)
# set of steps
pipeline = Pipeline([('vect', vect),
('clf', clf)])
#set grid search
grid_para = GridSearchCV(estimator= pipeline,
param_grid= parameter,
cv = 2)
grid_para.fit(X_train, y_train) # fit dataset
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps]) #show steps
print("parameters:")
pprint(parameter) # prety print dictionary
print()
#best cv score
print(f'Best cv score: {grid_para.best_score_:.3f}')
print()
#show best parameters
print(f'Best Parameter set:')
best_parameters= grid_para.best_estimator_.get_params()
for param_name in sorted(parameter.keys()): #iterates over parameters
print(f'\t{param_name}: {best_parameters[param_name]}') # takes best parameters
#test score from best_estimator_
print(f'Score on test data with best parameter: {grid_para.best_estimator_.score(X_test,y_test):.3f}')
print()
#classification report
print(f'Classification Reports on Test Data:')
predict= grid_para.best_estimator_.predict(X_test)
print(classification_report(y_test,predict))
#confusion matrix
con_matrix = confusion_matrix(y_test, predict, labels=[1,3,5]) # reorder[neagtive, netural and positive]
column = index =['Negative', 'Neutral', 'Positive'] #for columns and index
cm_df = pd.DataFrame(con_matrix, column, index) # pandas dataframe
# create figure for confusion matrix
fig, ax = plt.subplots(figsize=(7,5))
sns.heatmap(cm_df, annot=True, fmt=".1f") # show data value in each cell
plt.ylabel('True Label')# ylabel
plt.xlabel('Predicted Label') #xlabel
plt.title('Confusion Matrix') #title
plt.show()
return grid_para
Since it is impossible to read text data directly by machine learning algorithms, we have to convert them into number. This process is known as word embedding. For this project, we will explore and use three types of techniques 1) Count Vectorizer 2) TF-IDF Vector and 3) word2vec
In count vector method, the 'N' unique tokens(uni/bi/tri-gram) also known as vocabulary are extracted from the corpus. Then, the frequency is counted for each token (words) in each document that are appeared in vocabulary. The words in each document which are not occured in vocabualry are ignored. On the other hand, zero values are assigned to words if not found in document. This leads to feature vector as sparse. The size of each input feature must be equal to the size of vocabulary. Below is the example:
Lets consider Corpus: ['This is the first document.', 'This document is the second document.', 'And this is the third one.']
Unique tokens/Vocabulary: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Document_1: [0 1 1 1 0 0 1 0 1]
Document_2: [0 2 0 1 0 1 1 0 1]
Document_3: [1 0 0 1 1 0 1 1 1]
Here, you can see that the frequency count of word in each document are positioned based on word positon in vocabulary.
#set parameter for counter vectorizer
cvect_para = {
'vect__max_df': (0.95, 0.98, 0.99),
'vect__ngram_range':((1, 2),(2,2)),
'vect__min_df': (0.005, 0.0025, 0.01)
}
#CounterVectorizer
vectorizer = CountVectorizer( analyzer='word')
%%time
#smoothing parameter
para_NB= {
'clf__alpha': (0.75, 0.90)
}
# MultinomialNB
clf_NB= MultinomialNB()
#call param_tune() function
best_NB_pam = grid_param_tune(clf_NB, para_NB, X_train, X_test, y_train, y_test, vectorizer, cvect_para)
We got best parameter, so lets use them and build the final model of this algorihtm for prediction new documents
%%time
#use best parameter
best_vectorizer = CountVectorizer( max_df= 0.95, min_df= 0.0025, ngram_range= (1, 2),analyzer='word')
countVec_matrix = best_vectorizer.fit_transform(final_data_shuffle['new_text'])
#final MultinomialNB model with CountVectorizer
CountVec_mul_final = MultinomialNB( alpha= 0.75)
CountVec_mul_final.fit(countVec_matrix , final_data_shuffle['star'])
#new document to predict
final_data.iloc[30005].text
#convert new document into matrix
test_countVec_matrix = best_vectorizer.transform([final_data.iloc[30005].text])
#predict new document
CountVec_mul_final.predict(test_countVec_matrix)
%%time
#set parameter
para_log={
'clf__C': (0.25, 0.5, 1.0)
}
#logisticRegression for multi-class
clf_log = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg')
best_logic_param= grid_param_tune( clf_log, para_log, X_train, X_test, y_train, y_test,vectorizer, cvect_para)
#final logistic model with CountVectorizer
CountVec_log_best= LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg', C=0.5)
CountVec_log_best.fit(countVec_matrix , final_data_shuffle['star']) #fit with data
final_data.iloc[30000].text
#convert new document into matrix
test_countVec_matrix = best_vectorizer.transform([final_data.iloc[30000].text])
#predict new document
CountVec_log_best.predict(test_countVec_matrix)
Looks it doesn't predict this new document correclty since it should be negative (1)
#save model
save_file(test_countVec_matrix, 'Models/Count_Vectorize_final')
The donwside of count vectorizer is that it considers word frequency only in single document. Thus, the most frequently occuring common words in every documents for example, 'is', 'the', 'a', etc. are weighted with high weight eventhough these words do not have discriminatory information. To address such issue, TF-IDF takes account for the occurrence of a word in a single document as well as in the entire corpus.
The TF-IDF down weights the common words occurring in almost all documents and give more importance to words that appear in a subset of documents. TF-IDF is calcuated by multiplication of two terms TF and IDF.
TF(Term Frequency): (Number of times term t appears in a document)/(Number of terms in the document)
IDF(Inverse Document Frequency): log(N/n), where, N is the total number of documents and n is the number of documents a term t has appeared in.
Example is demonstrated below:
Lets consider Corpus: ['This is the first document', 'This document is the second document', 'And this is the third one']
TF-IDF(This, Document_1) = TF IDF = (1/5) log(3/3) = (1/8) 0 = 0
TF-IDF(first, Document_1) = TF IDF = (1/5) log(3/1) = (1/8) 0.477 = 0.095
From example, we can see that the word 'this' is highly penalized since it appears every document while word 'first' is given importance with some weight since it appears only in first document.
#set parameter for counter vectorizer
cvect_para = {
'vect__max_df': (0.95, 0.98, 0.99),
'vect__ngram_range':((1, 2),(2,2)),
'vect__min_df': (0.005, 0.0025, 0.01)
}
#tf-idf
tf_idf_vect = TfidfVectorizer(analyzer='word')
%%time
#MultinomialNB
best_NB_pam_tfidf = grid_param_tune( clf_NB, para_NB, X_train, X_test, y_train, y_test, tf_idf_vect, cvect_para)
%%time
#logisticRegression for multi-class
clf_log = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg')
best_logic_param_tfidf= grid_param_tune( clf_log, para_log, X_train, X_test, y_train, y_test, tf_idf_vect, cvect_para)
%%time
tfidf_vectorizer = TfidfVectorizer( max_df= 0.95, min_df= 0.0025, ngram_range= (1, 2),analyzer='word')
tfidf_matrix = tfidf_vectorizer.fit_transform(final_data_shuffle['new_text'])
#final logistic model with TF-IDF
clf_log_final = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg', C= 1.0)
clf_log_final.fit(tfidf_matrix, final_data_shuffle['star'])
#new document
final_data_shuffle.iloc[33363].text
#predict new document
clf_log_final.predict(tfidf_vectorizer.transform([final_data_shuffle.iloc[33363].text]))
The issue with the 'CounteVectorizer' and 'TF-IDF' is that they both set large number of vocabulary, meaning that high dimension of feature is formed which leads to large requirement on memory and slow down algorithms. But, in the haskingvectorizer method, the hashing trick is used to convert 'words' into feature 'integer'. No vocabulary is required to store as like 'CounteVectorizer' and 'TF-IDF', instead we can use arbitrary fixed length vector. Thus, it is more memory efficet, but the downside is once the text get vectorized, it can no longer be retrived.
#set parameter for HashingVectorizer
hasVec_para = {
'vect__ngram_range':[(1,1),(1,2)],
'vect__n_features': [2 ** x for x in(10, 13,15)],
}
#hashingVectorizer
has_vec = HashingVectorizer(analyzer = 'word')
%%time
#logisticRegression
best_hasing_vec = grid_param_tune( clf_log, para_log, X_train, X_test, y_train, y_test, has_vec, hasVec_para)
#hashing model with best parameter
has_vec_best = HashingVectorizer(n_features=32768,ngram_range=(1, 1), analyzer = 'word')
hasing_matrix = has_vec_best.fit_transform(final_data_shuffle['new_text']) #fit all data
#final logistic model with hashing model
clf_log_final_hasing = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg', C= 1.0)
clf_log_final_hasing.fit(hasing_matrix, final_data_shuffle['star'])
#predict new document
clf_log_final_hasing.predict(has_vec_best.transform([final_data_shuffle.iloc[30001].text]))
Hasing model correctly predict this new document
#save model
save_file(has_vec_best, 'Models/hasing_vec_best') #hashing
save_file(has_vec_best, 'Models/logistic_final_hasing') #logistic with hashing
This technique uses neural networks-based techniques to convert words into corresponding vectors in such a way that the semantically similar vectors are close to each other in N-dimensional space, where N refers to the dimensions of the vector.
Word2Vec model comes with two techniques: 1) Skip Gram Model and 2)Continuous Bag of Words Model (CBOW). In the Skip Gram model, the context words are predicted using the base word. For example, given a sentence "I love to dance in the rain", the skip gram model will predict "love" and "dance" given the word "to" as input.
On the contrary, the CBOW model will predict "to", if the context words "love" and "dance" are fed as input to the model. The model learns these relationships using neural networks.
Note: Word2Vec requires list of words (tokenized word)
#tokenized words in each document
final_data_shuffle['new_text_wordlist'] = final_data_shuffle.new_text.apply(lambda x : word_tokenize(x))
#final_data_shuffle['text_wordlist'] = final_data_shuffle.text.apply(lambda x : word_tokenize(x))
final_data_shuffle.head(3)
#train Word2Vec model
model = Word2Vec(final_data_shuffle['new_text_wordlist'], size=100, window=5, min_count=5, workers=4)# document with removed positve and negative words
#model_1 = Word2Vec(final_data_shuffle['text_wordlist'], size=100, window=5, min_count=5, workers=4)
#save model
save_file(model_1, 'word2v_model_1')
save_file(model, 'word2v_model')
print(model)
print(model_1)
#list of vocab
list(model.wv.vocab)
#show similarity words
X=model[model.wv.vocab] # vector of words
pca = PCA(n_components=2) #PCA for two dimensions
pca = PCA(n_components=2)
result = pca.fit_transform(X)
#display similarity words
plt.figure(figsize=(9,6))
plt.scatter(result[:50, 0], result[:50, 1]) #scatter plot
words = list(model.wv.vocab)[:50] # 50 words
for i, word in enumerate(words):
#print(word ,':', (result[i, 0], result[i, 1]))
plt.annotate(word, xy=(result[i, 0], result[i, 1])) #words with position(x,y)
plt.show()
#show top 10 simlar words
model.wv.most_similar('food', topn=10)
model.wv.most_similar('bad', topn=10)
# split train and test dataset with stratify fashion
X_train_w2v, X_test_w2v, y_train, y_test = train_test_split(final_data_shuffle['new_text_wordlist'],
final_data_shuffle['star'],
test_size=0.2,
shuffle= True,
stratify= final_data_shuffle['star'],
random_state=37)
X_train_w2v.head(3)
Note: We have to make final vector(document vector) to feed them for classifers. Thus, we do vector average of each words in each document and got the final vector ready to apply for classifers.
#average words vectors in each documnet
def compute_w2v_vector(word2vec, document):
"""
Average vectors for each word in the document
parameters:
word2vec : Word2VecKeyedVectors
document : text
return:
doucment vector
"""
list_of_word_vectors = [word2vec[w] for w in document if w in word2vec.vocab.keys()] #word vectors
if len(list_of_word_vectors) == 0: #check empty document
doucment_vec = [0.0]*100
else:
doucment_vec = np.sum(list_of_word_vectors, axis=0) / len(list_of_word_vectors) #average the vectors
return doucment_vec
#average the vectors for each word in the document
X_train_w2v1 = X_train_w2v.apply(lambda x: compute_w2v_vector(model.wv, x))
X_test_w2v1 = X_test_w2v.apply(lambda x: compute_w2v_vector(model.wv, x))
#show in dataframe
X_train_w2v = pd.DataFrame(X_train_w2v1.values.tolist(), index= X_train_w2v1.index)
X_test_w2v = pd.DataFrame(X_test_w2v1.values.tolist(), index= X_test_w2v1.index)
X_train_w2v.head(3)
%%time
#logisticRegression for multi-class
clf_log = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg')
best_logic_param_tfidf= grid_param_tune( clf_log, para_log, X_train_w2v, X_test_w2v, y_train, y_test)
Unlike word2vec, Doc2Vec converts each document directly into document vector. It comes with two implementations:
1).Paragraph Vector - Distributed Memory (PV-DM) and
2).Paragraph Vector - Distributed Bag of Words (PV-DBOW)
More details can be found Doc2Vec
Note: Doc2Vec requires each document should be tagged with number/tag. So, we have to tag each document before feeding to it.
#associate with tag/number for each document
tag_documents_train = [TaggedDocument(doc, [i]) for i, doc in enumerate(X_train)]
tag_documents_train[0]
%%time
#train Doc2Vec model
model_doc2vec_train =Doc2Vec(tag_documents_train, vector_size=100, window=5, min_count=5, workers=4, epochs=40)
#get doc vector for train and test before applying into classifiers
X_train_d2v = np.array([model_doc2vec_train.docvecs[x] for x in range(len(tag_documents_train))]) #get doc2vec and convert numpy
X_test_d2v = np.array([model_doc2vec_train.infer_vector(x[0]) for x in tag_documents_test]) #convert test data into doc2vec
%%time
#logisticRegression for multi-class
clf_log = LogisticRegression(multi_class= 'multinomial', solver= 'lbfgs')
best_logic_param_tfidf= grid_param_tune( clf_log, para_log, X_train_d2v, X_test_d2v, y_train, y_test)
#new document
final_data.iloc[30001]
#tags all doucume(train data +test data)
tag_documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(final_data_shuffle['new_text_wordlist'])]
#train Doc2Vec model
model_doc2vec_final =Doc2Vec(tag_documents, vector_size=100, window=5, min_count=5, workers=4, epochs=40)
#get doc2vec and convert numpy
X_train_d2v_final = np.array([model_doc2vec_final.docvecs[x] for x in range(len(tag_documents))])
#logisticRegression final model with doc2vec
log_final_doc2vec = LogisticRegression(multi_class= 'multinomial', solver= 'lbfgs', C= 0.25)
log_final_doc2vec.fit(X_train_d2v_final, final_data_shuffle.star) #fit with all data
#predict new document
doc2vec_newdocument= model_doc2vec_final.infer_vector([final_data.iloc[30001].text]) #convert into doc2vec
log_final_doc2vec.predict([doc2vec_newdocument]) #predict
In this project, the different word embedding(convert word into number)techniques are applied for sentiment classificaiton. The hashing and Word2Vec/Doc2Vec seems good, but nevertheless the countvectoriser and tf-idf also did very well after removing the negative and positive words from opposite class.
The final model is built with TF-IDF with LogisticRegression. It is used for web demo purpose. You can find it from here: Web Demo