Sentiment Analysis:

The main motivation behind doing this project is to understand the different word embedding techniques.

In this project, the predictive model for sentiment analysis is developed. The sentiment are shown into three categories:- 1) positive, 2) neutral and 3) negative. The dataset for the project is taken from Yelp.

In this notebook, the whole processs from feature engineering to final model is shown step by step. The 4 different types of feature extration techinques such as 1) CounterVectorizer, 2)TF-IDF Vectorizer, 3)HashingVectorizer and 4) Word2Vec are applied and built the sentiment analysis model. The final model (the best one) is deployed in Heroku. You can play around with the model.

In addition, the positive words are removed from negative class and negative words are removed from postive class to make higher polarity/discriminatory between the classes. Thus, it gives higer prediction accuracy.

In [ ]:
 

Import Necessary Libraries

In [3]:
# basic libraries 
import re
import json
import pandas as pd
import numpy as np
import joblib
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import pickle as pkl
from collections import Counter
from pprint import pprint


#Nltk for text processing
from langdetect import detect
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords, words,  wordnet
from wordcloud import WordCloud, STOPWORDS
from nltk.stem import WordNetLemmatizer, PorterStemmer

#sklearn for modeling
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from yellowbrick.classifier import ConfusionMatrix
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline, Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.decomposition import PCA

#gensim
from gensim.models import Word2Vec, Doc2Vec
from keras.preprocessing.text import Tokenizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
/Users/gangalingden/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.metrics.classification module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
  warnings.warn(message, FutureWarning)
In [ ]:
 
In [26]:
#load json data 
data = []
with  open('review.json', 'r') as file:
    for f in file:
        data.append(json.loads(f))

#convert into pandas dataframe        
df = pd.DataFrame(data)
df.head()
Out[26]:
review_id user_id business_id stars useful funny cool text date
0 Q1sbwvVQXV2734tPgoKj4Q hG7b0MtEbXx5QzbzE6C_VA ujmEBvifdJM6h6RLv4wQIg 1.0 6 1 0 Total bill for this horrible service? Over $8G... 2013-05-07 04:34:36
1 GJXCdrto3ASJOqKeVWPi6Q yXQM5uF2jS6es16SJzNHfg NZnhc2sEQy3RmzKTZnqtwQ 5.0 0 0 0 I *adore* Travis at the Hard Rock's new Kelly ... 2017-01-14 21:30:33
2 2TzJjDVDEuAW6MR5Vuc1ug n6-Gk65cPZL6Uz8qRm3NYw WTqjgwHlXbSFevF32_DJVw 5.0 3 0 0 I have to say that this office really has it t... 2016-11-09 20:09:03
3 yi0R0Ugj_xUx_Nek0-_Qig dacAIZ6fTM6mqwW5uxkskg ikCg8xy5JIg_NGPx-MSIDA 5.0 0 0 0 Went in for a lunch. Steak sandwich was delici... 2018-01-09 20:56:38
4 11a8sVPMUFtaC7_ABRkmtw ssoyf2_x0EQMed6fgHeMyQ b1b1eb3uo-w561D0ZfCEiQ 1.0 7 0 0 Today was my second out of three sessions I ha... 2018-01-30 23:07:38
In [ ]:
 

1.2 Data Overview

We have have noticed that there are over 6 millions documents and 9 columns. We will take only two columns- 'text' and 'stars' for our purpose.

In [27]:
#check shape
print ("Rows     : " ,df.shape[0])
print ("Columns  : " ,df.shape[1])
Rows     :  6685900
Columns  :  9
In [190]:
#takes only  review_text and stars(rating) columns
text_data = df[['text','stars']]
text_data.head(3)
Out[190]:
text stars
0 Total bill for this horrible service? Over $8G... 1.0
1 I *adore* Travis at the Hard Rock's new Kelly ... 5.0
2 I have to say that this office really has it t... 5.0
In [ ]:
 

2.1 Text Pre-processing

The following steps are applied for text processing:
 1. Change to lower case
 2. Expand contractions word, for example: 'can't'--> 'can not'
 3. Remove punctuation and other than alphabetic words
 4. Build customized stopwords, and remove stops word and words less than 3 letters
 5. Take only noun, adverb and adjective
 6. Lemmatize based on POS-tag

In [ ]:
 
In [42]:
## list of contractions  words
contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "can not", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}

#regex patterns for contraction words
contractions_re = re.compile('(%s)' % '|'.join(contraction_dict.keys())) # use keys
len(contraction_dict)
Out[42]:
120
In [ ]:
 
In [43]:
def replace_contractions(text, contractions):
    """ 
    Function to expand contraction words into two-words(e.g. 'can't' -->> 'can not')
    
    Parameters:
       text(str): text to expand contraction words
       contractions(dict): contractions  words in dict type
    
    Returns:
       Text with expanded contraction words
    
    """
    
    def replace(match_object):
        """
        Parameters:
            match_object: matched words in  regex pattern 'contractions_re'
        
        Return:
            values for matched words
        """
        return contractions[match_object.group(0)] # get dict values
    
    return contractions_re.sub(replace, text) # sub with 'replace'
In [ ]:
 
In [44]:
#find wordnet POS-tagging
def get_wordnet_pos(pos_tag):
    """
    Parameters:
        pos_tag: Word POS tag
         
    Returns:
        wordnet pos tag(for exmple, 'n','r','j')
    """
    
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    
    else:
        return wordnet.NOUN
In [ ]:
 
In [45]:
#Basic text preprocessing
def text_preprocess(text):
    """
    Function to clean documents
    
    Parameter:
       text(str): text for preprocessing
       
    Returns:
        The Clean Processed text
    
    """
    
    #1. convert words to lower case 
    text = text.lower()
    
    #2.replace contraction words using function - replace_contractions()
    text = replace_contractions(text, contraction_dict)
    
    #3.remove alphanumeric characters
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.split()
    
    #4.remove stop words and words with less than 3 letters
    stops_word = set(stopwords.words("english"))
    text = [w for w in text if w in ['no', 'not', 'too', 'so'] or (w not in stops_word and len(w) >= 3)] #include ['no', 'not', 'too', 'so'] 
    
    
    #4.take only noun, adjective and adverb using POS-tag
    noun_adj_adv = ['NN','NNS','NNP','NNPS', 'JJ','JJR','JJS','RB','RBR', 'RBS'] #inlcude only noun, adjective and adverb
    #text = [word for word in text  if pos_tag([word])[0][1] in noun_adj_adv]
   
    #5.lemmatize words based on specific POS-tags
    lema = WordNetLemmatizer()
    lema_words = [lema.lemmatize(word, pos=get_wordnet_pos(pos_tag([word])[0][1]))   for word in text   if pos_tag([word])[0][1] in noun_adj_adv ]
    text = " ".join(lema_words) # return string
   
    return text
In [ ]:
 
In [206]:
%%time
# apply text_preprocess() fucntion to  column- text
text_data['clean_text'] = text_data.loc[:,'text'].apply(text_preprocess)
text_data.head()
/Users/gangalingden/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
CPU times: user 1d 6h 3min 33s, sys: 1h 56min 52s, total: 1d 8h 25s
Wall time: 1d 8h 49min 18s
Out[206]:
text stars clean_text
0 Total bill for this horrible service? Over $8G... 1.0 total bill horrible service crook actually ner...
1 I *adore* Travis at the Hard Rock's new Kelly ... 5.0 adore travis hard rock new kelly cardenas salo...
2 I have to say that this office really has it t... 5.0 office really together so friendly phillipp gr...
3 Went in for a lunch. Steak sandwich was delici... 5.0 lunch steak sandwich delicious caesar salad ab...
4 Today was my second out of three sessions I ha... 1.0 today second session paid first session well t...
In [ ]:
 

2.2 Label Dataset

I have excluded rating 4 and 2 from the data. This has been done to have clear discrimatroy/polarity words in each labels. Sentiment is labeled in following way:
  1. Positive sentiment = 5 stars rating,
  2. Neutral sentiment = 3 stars rating
  3. Negative sentiment = 1 stars rating

In [220]:
#Exclude the rows having stars rating of  4.0 and 2.0
text_data = text_data.loc[text_data['stars'].isin([1.0, 3.0, 5.0 ])]

# Map stars with sentiments (1.0:'negative', 5.0:'positive', 3.0 :'neutral')
pd.options.mode.chained_assignment = None # warning
text_data['sentiment'] = text_data['stars'].map({1.0:'negative', 5.0:'positive', 3.0 :'neutral' })
text_data.head()
Out[220]:
text stars clean_text sentiment
0 Total bill for this horrible service? Over $8G... 1.0 total bill horrible service crook actually ner... negative
1 I *adore* Travis at the Hard Rock's new Kelly ... 5.0 adore travis hard rock new kelly cardenas salo... positive
2 I have to say that this office really has it t... 5.0 office really together so friendly phillipp gr... positive
3 Went in for a lunch. Steak sandwich was delici... 5.0 lunch steak sandwich delicious caesar salad ab... positive
4 Today was my second out of three sessions I ha... 1.0 today second session paid first session well t... negative
In [ ]:
 
In [ ]:
 
In [222]:
# check values counts based on labels
text_data.sentiment.value_counts()
Out[222]:
positive    2933082
negative    1002159
neutral      739280
Name: sentiment, dtype: int64
In [223]:
#plot barchart of class
plt.bar(text_data.sentiment.value_counts().index, text_data.sentiment.value_counts().values )
plt.xlabel('Sentiment review')
plt.ylabel('Review Count')
plt.title('Barchart')
plt.xticks(rotation=45)
plt.show()

# By using panda directly
#text_data.groupby('sentiment').count()['stars'].plot.bar()

Obeservation:

We can see that there is high imbalanced class. Positive reviews has the higest count, almost the double than other two classes. We have to take care(balance) this before fitting to model.

In [ ]:
 
In [ ]:
# define function to generate word cloud
def generate_wordcloud(text, title=None):
    
    """
    Function to generate Word-Cloud image
    
    Parameters:
       text(str): collection of words to plot
       title(str): title of figure
    
    Returns:
       The word cloud  image
    """
    
    cloud = WordCloud( stopwords= [x for x in list(STOPWORDS) if x not in ['not', 'so']],
                      background_color="white", scale=2,
                     collocations=False).generate(str(text))
    
    plt.figure( figsize=(12,10)) # set size of figure
    
    # if title is given use it
    if title: 
        plt.title(title, fontdict={'size': 20,  
                                  'verticalalignment': 'bottom'})
     
    plt.imshow(cloud, interpolation='bilinear')
    plt.axis("off")
    plt.tight_layout()
    
In [ ]:
 
In [525]:
#call generate_wordcloud function
corpus = ' '.join(text_data['clean_text'])
generate_wordcloud(corpus, 'Most common words in whole Reviews')
In [ ]:
 
In [523]:
# separate sentiment(class) and  show in word cloud
negative_sentiment = ' '.join( review for review in text_data.loc[text_data['sentiment'] == 'negative']['text'])
postive_sentiment = ' '.join( review for review in text_data.loc[text_data['sentiment'] == 'positive']['text'])
neutral_sentiment = ' '.join( review for review in text_data.loc[text_data['sentiment'] == 'neutral']['text'])
In [527]:
%%time
#show negative Reviews in word cloud
titles = ['Most common words in Negative Reviews', 'Most common words in Positive Reviews', 'Most common words in Neutral Reviews']
words_sentiments = [negative_sentiment, postive_sentiment, neutral_sentiment]

for  title, words_sent in zip(titles, words_sentiments):
    #ng_token = [ word for word in words_sent.split()]
    generate_wordcloud(words_sent, title)
    print()
    print()





CPU times: user 12min 36s, sys: 11min 39s, total: 24min 16s
Wall time: 41min 50s
Compiler : 112 ms
In [ ]:
 
In [20]:
# store words as dictionary keys, and their counts as vlaues
vocab = Counter()
for text in text_data.loc[:, 'clean_text']:
    vocab.update(text.split())
        
In [31]:
#show in bar chart
most_freq_25 = vocab.most_common(25)#20 most frequent words
df_most_feq = pd.DataFrame(most_freq_25, columns = ['Word', 'Count'])
df_most_feq.plot.bar(x='Word', y ='Count', figsize=(10,5), title='Most 25 Frequent Words in Corpus')
plt.show()
In [ ]:
 

2.4 Handle Imblanced Data

There are many ways to hanlde imbalanced dataset, for example undersampling majority class or oversampling minority class or by designing a cost function that penalized the wrong classification of the minority class more than wrong classifications of the majority class. For this, work we are using RandomUnderSamper() from scikit-learn which randomly removes the samples from majority class. The drawback of this undersampling is loss of informaiton since some samples are removed, but we don't have to worry much because we have enough samples to learn by algorithms.

In [32]:
#resample all classes but the minority clas with out replacement
under_sample = RandomUnderSampler(sampling_strategy='not minority',random_state=2, replacement=False) 
X_res,y_res = under_sample.fit_resample(text_data[['clean_text', 'stars', 'sentiment']], text_data['sentiment'])
In [33]:
Counter(y_res)
Out[33]:
Counter({'negative': 739280, 'neutral': 739280, 'positive': 739280})
In [34]:
X_res[1]
Out[34]:
array(['super vega hotel reroute entire strip so luxor strip hotel bad check june less pleasant front desk agent check venetian conference checked fee room pretty much bad stayed monte carlo so lot room damp musty suspect pyramid window actually day front desk line consistently people not long something told move too furniture worn picture closet not sure rust blood light fixture bathroom never prison suspect shower head superior luxor shower door loud difficult move not year pyramid large amount extra space room random table chair comfy couch large desk area even fridge water bottle leftover truly waste space pool open odd sunny warm least also pool open close half pool open honest best thing luxor gym place money spent dubya left office pharaoh tomb disservice luxor reputation',
       1.0, 'negative'], dtype=object)
In [61]:
#show in panda dataframe
final_data = pd.DataFrame(X_res, columns=['text','star','sentiment'])
final_data.head()
Out[61]:
text star sentiment
0 bought new tire rim year back june nug lug obv... 1 negative
1 super vega hotel reroute entire strip so luxor... 1 negative
2 short numerous people great detail food low qu... 1 negative
3 large group mother day brunch think good thing... 1 negative
4 dollar beer cod fry coleslaw fish kind frozen ... 1 negative
In [36]:
#check data types
final_data.dtypes
Out[36]:
text         object
star         object
sentiment    object
dtype: object
In [62]:
#change column 'star' in to 'int' type
final_data['star'] = final_data['star'].astype(int)
final_data.dtypes
Out[62]:
text         object
star          int64
sentiment    object
dtype: object
In [ ]:
 
In [9]:
#drop duplicates rows
final_data= final_data.drop_duplicates(['text'], keep= 'first')
final_data.shape
Out[9]:
(2213146, 3)
In [27]:
#count duplicates based on 'text' column
final_data_shuffle.duplicated(['text'], keep='first').sum()
Out[27]:
0

Take only 30K in each class

Note: Since the dataset has huge number of documents, it required a lot time for paramter tunning as well as modeling, so I decided to take just 30 K for each class. So that I can afford computation.

In [48]:
#takes only 30k from each class
neagtive = final_data.iloc[:30000] #neagtive class
neutral = final_data.iloc[739280:739280+30000] # neutral class
postive = final_data.iloc[-30000:] # positive class

final_data_le= pd.concat([neagtive,neutral,postive], ignore_index=True) #concatenate
final_data_shuffle= final_data_le.sample(frac=1) #shuffle 
In [ ]:
 

2.5 Remove Negative and Positive words

The positive and negative words are removed from opposite classes to make more polarity between the classes.
Note: The negative and positive words are collected from here.

In [7]:
#remove negative and positive words
def remove_pos_neg_word(text, star, negative_word, positive_word):
    """
    Function to remove 
         1. negative words from positve class, 
         2. positive words from negative class.
         3. poitive and negative words from neutral class
         
    Parameter:
       text(str): document(rows in panda dataframe)
       start(int): 1 or 3 or 5
       negative_word(list): list of negative words
       positive_word(list): list of postive words
       
    """
    
    #split text into words
    split_text = text.split()
    
    #remove words from positive class
    if star ==1:
        text= ' '.join([word for word in split_text if word not in positive_word])
        return text
       
    #remove words from positive and negative class
    if star ==3:
        text= ' '.join([word for word in split_text if word not in negative_word and word not in positive_word])
        return text
    
    #remove words from negative class
    if star ==5:
        text= ' '.join([word for word in split_text if word not in negative_word])
        return text
    
           
In [ ]:
 
In [8]:
#positive and negative words
word_neag = pd.read_csv('negative-words.txt', names=['Neg_word'],header=None)
word_neag_list = list(word_neag.Neg_word)
word_posi = pd.read_csv('positive-words.txt', names=['Pos_word'],header=None)
word_posi_list = list(word_posi.Pos_word)
final_data_shuffle['new_text'] = final_data_shuffle.apply(lambda row: remove_pos_neg_word(row['text'], row['star'],word_neag_list,word_posi_list), axis=1)#remove words
final_data_shuffle.head()
Out[8]:
text star sentiment new_text
78757 delight cleveland nice indeed able sample deli... 5 positive delight cleveland nice indeed able sample deli...
41233 star burger place burger solid nice flavor fre... 3 neutral star burger place burger flavor lettuce onion ...
73907 best strip hotel vega hand stayed several time... 5 positive best strip hotel vega hand stayed several time...
33363 not grand opening still hour line minute drink... 3 neutral opening still hour line minute drink hey place...
59445 place little expensive service little seat peo... 3 neutral place little service little seat people sczech...
In [ ]:
 
In [ ]:
 

Train-Test split

In [486]:
# split train and test dataset with stratify fashion
X_train, X_test, y_train, y_test = train_test_split(final_data_shuffle['new_text'],
                                                    final_data_shuffle['star'], 
                                                    test_size=0.2,
                                                    shuffle= True,
                                                    stratify= final_data_shuffle['star'],
                                                    random_state=37)
In [ ]:
 
In [ ]:
 

3 Modeling

The two classifiers are used: 1. MultinomialNB and 2. Logistic Regression

In [471]:
# define function tosave model in the disk
def save_file(model, file_name):
    '''
    model = file/model to save
    file_name = name given to model/file
    '''
    pkl.dump(model, open(file_name, 'wb'))
 
In [ ]:
 

3.1. Hyperparameter Tunning

GridSearch with 5 fold cv is used for finding the best parameter of the algorithms. The helper function- param_tune() is defined to this purpose. The evaluation metric is default one(accuracy). In addition to this, the model evaluation metrics such as precision, recall and F1 score is shown. The confusion matrix is also displayed.

In [487]:
#parameter tunning with grid serach
def grid_param_tune( clf, clf_parameter, X_train, X_test, y_train, y_test, vect=None,  vect_parameter=None):
    '''
    Function to find the best parameters
    
    Parameters:
        vect: vectorizer techniques(such as tf-idf, Countervectorizer)
        vect_parameter: parameter for
        clf: algorithms
        clf_parameter:  list of paramters
        X_train: train set
        y_train: train label
        X_test: test set
        y_test: test label
        
    Reurns:
        The best parameter, test score from best_estimator, classification report and confusion 
        matrix on test data
    
    '''
    
    #join parameters together
    parameter = {}
    pipeline=''
    
    
    #check vect and vect_parameter
    if vect is None and vect_parameter is None:
        #update parameters together
        parameter.update(clf_parameter)
        pipeline = Pipeline([('clf', clf)])
        
    else:
        #update parameters together
        parameter.update(vect_parameter)
        parameter.update(clf_parameter)
        
        # set of steps
        pipeline = Pipeline([('vect', vect),
                         ('clf', clf)])
   
                       
    #set grid search
    grid_para = GridSearchCV(estimator= pipeline, 
                             param_grid= parameter, 
                             cv = 2)
    grid_para.fit(X_train, y_train) # fit dataset
    
    
    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps]) #show steps
    print("parameters:")
    pprint(parameter) # prety print dictionary
    print()
    
    
    #best cv score
    print(f'Best cv score: {grid_para.best_score_:.3f}') 
    print()
    
    #show best parameters
    print(f'Best Parameter set:')
    best_parameters= grid_para.best_estimator_.get_params()
    for param_name in sorted(parameter.keys()): #iterates over parameters
        print(f'\t{param_name}: {best_parameters[param_name]}') # takes best parameters
    
    
    #test score from best_estimator_
    print(f'Score on test data with best parameter: {grid_para.best_estimator_.score(X_test,y_test):.3f}') 
    print()
    
    #classification report
    print(f'Classification Reports on Test Data:')
    predict= grid_para.best_estimator_.predict(X_test)
    print(classification_report(y_test,predict)) 
    
    
    #confusion matrix
    con_matrix = confusion_matrix(y_test, predict, labels=[1,3,5]) # reorder[neagtive, netural and positive]
    column = index =['Negative', 'Neutral', 'Positive'] #for columns and index
    cm_df = pd.DataFrame(con_matrix, column, index) # pandas dataframe
    
    # create figure for confusion matrix
    fig, ax = plt.subplots(figsize=(7,5)) 
    sns.heatmap(cm_df, annot=True, fmt=".1f") # show data value in each cell
    plt.ylabel('True Label')# ylabel
    plt.xlabel('Predicted Label') #xlabel
    plt.title('Confusion Matrix') #title
    plt.show()
    
    return grid_para
 
    
 
In [ ]:
 
In [ ]:
 

3.2. Feature Extraction

Since it is impossible to read text data directly by machine learning algorithms, we have to convert them into number. This process is known as word embedding. For this project, we will explore and use three types of techniques 1) Count Vectorizer 2) TF-IDF Vector and 3) word2vec

3.2.1 Count Vectorizer

In count vector method, the 'N' unique tokens(uni/bi/tri-gram) also known as vocabulary are extracted from the corpus. Then, the frequency is counted for each token (words) in each document that are appeared in vocabulary. The words in each document which are not occured in vocabualry are ignored. On the other hand, zero values are assigned to words if not found in document. This leads to feature vector as sparse. The size of each input feature must be equal to the size of vocabulary. Below is the example:

  Lets consider Corpus: ['This is the first document.', 'This document is the second document.', 'And this is the third one.']
  Unique tokens/Vocabulary: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
  Document_1: [0 1 1 1 0 0 1 0 1]
  Document_2: [0 2 0 1 0 1 1 0 1]
  Document_3: [1 0 0 1 1 0 1 1 1]

Here, you can see that the frequency count of word in each document are positioned based on word positon in vocabulary.

In [564]:
#set parameter for counter vectorizer
cvect_para = {
    
    'vect__max_df': (0.95, 0.98, 0.99),
    'vect__ngram_range':((1, 2),(2,2)),
    'vect__min_df': (0.005, 0.0025, 0.01)
    }

#CounterVectorizer
vectorizer = CountVectorizer( analyzer='word')
In [ ]:
 
In [ ]:
 

MultinomialNB Classifier

In [26]:
%%time
#smoothing parameter
para_NB= {   
    'clf__alpha': (0.75, 0.90) 
    }


# MultinomialNB
clf_NB= MultinomialNB()

#call param_tune() function
best_NB_pam = grid_param_tune(clf_NB, para_NB, X_train, X_test, y_train, y_test, vectorizer, cvect_para)
Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__alpha': (0.75, 0.9),
 'vect__max_df': (0.95, 0.98, 0.99),
 'vect__min_df': (0.005, 0.0025, 0.01),
 'vect__ngram_range': ((1, 2), (2, 2))}

Best cv score: 0.949

Best Parameter set:
	clf__alpha: 0.75
	vect__max_df: 0.95
	vect__min_df: 0.0025
	vect__ngram_range: (1, 2)
Score on test data with best parameter: 0.951

Classification Reports on Test Data:
              precision    recall  f1-score   support

           1       0.92      0.97      0.94      6000
           3       0.96      0.90      0.93      6000
           5       0.98      0.98      0.98      6000

    accuracy                           0.95     18000
   macro avg       0.95      0.95      0.95     18000
weighted avg       0.95      0.95      0.95     18000

CPU times: user 33min 45s, sys: 1min 1s, total: 34min 46s
Wall time: 37min 2s

We got best parameter, so lets use them and build the final model of this algorihtm for prediction new documents

In [ ]:
 

Predict new documnet(review)

In [565]:
%%time
#use best parameter
best_vectorizer = CountVectorizer( max_df= 0.95, min_df= 0.0025, ngram_range= (1, 2),analyzer='word')
countVec_matrix = best_vectorizer.fit_transform(final_data_shuffle['new_text'])
CPU times: user 19 s, sys: 949 ms, total: 20 s
Wall time: 21.9 s
In [567]:
#final MultinomialNB model with CountVectorizer
CountVec_mul_final = MultinomialNB( alpha= 0.75)
CountVec_mul_final.fit(countVec_matrix , final_data_shuffle['star'])
Out[567]:
MultinomialNB(alpha=0.75, class_prior=None, fit_prior=True)
In [578]:
#new document to predict
final_data.iloc[30005].text
Out[578]:
'food bland drink good glass wall men woman bathroom figure dude crapper pee glass wall horrible design put real wall'
In [579]:
#convert new document into matrix
test_countVec_matrix = best_vectorizer.transform([final_data.iloc[30005].text])

#predict new document
CountVec_mul_final.predict(test_countVec_matrix)
Out[579]:
array([1])
In [ ]:
 
In [ ]:
 

LogisticRegression

In [212]:
%%time

#set parameter
para_log={
    'clf__C': (0.25, 0.5, 1.0)
    }


#logisticRegression for multi-class
clf_log = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg')
best_logic_param= grid_param_tune( clf_log, para_log, X_train, X_test, y_train, y_test,vectorizer, cvect_para)
Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__C': (0.25, 0.5, 1.0),
 'vect__max_df': (0.95, 0.98, 0.99),
 'vect__min_df': (0.005, 0.0025, 0.01),
 'vect__ngram_range': ((1, 2), (2, 2))}

Best cv score: 0.971

Best Parameter set:
	clf__C: 0.5
	vect__max_df: 0.95
	vect__min_df: 0.0025
	vect__ngram_range: (1, 2)
Score on test data with best parameter: 0.972

Classification Reports on Test Data:
              precision    recall  f1-score   support

           1       0.99      0.96      0.97      6000
           3       0.94      0.99      0.96      6000
           5       1.00      0.97      0.99      6000

    accuracy                           0.97     18000
   macro avg       0.97      0.97      0.97     18000
weighted avg       0.97      0.97      0.97     18000

CPU times: user 1h 10min 47s, sys: 1min 10s, total: 1h 11min 57s
Wall time: 1h 12min 25s
In [ ]:
 

Predict new document

In [587]:
#final logistic model with CountVectorizer
CountVec_log_best= LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg', C=0.5)
CountVec_log_best.fit(countVec_matrix , final_data_shuffle['star']) #fit with data
Out[587]:
LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)
In [600]:
final_data.iloc[30000].text
Out[600]:
'airline truly criminal airline industry try flight compare airline price flight destination airline cost less call well price even bunch criminal'
In [598]:
#convert new document into matrix
test_countVec_matrix = best_vectorizer.transform([final_data.iloc[30000].text])

#predict new document
CountVec_log_best.predict(test_countVec_matrix)
Out[598]:
array([5])

Looks it doesn't predict this new document correclty since it should be negative (1)

In [ ]:
 
In [601]:
#save model
save_file(test_countVec_matrix, 'Models/Count_Vectorize_final')
In [ ]:
 
In [ ]:
 

3.2.2 TF-IDF Vectorizer

The donwside of count vectorizer is that it considers word frequency only in single document. Thus, the most frequently occuring common words in every documents for example, 'is', 'the', 'a', etc. are weighted with high weight eventhough these words do not have discriminatory information. To address such issue, TF-IDF takes account for the occurrence of a word in a single document as well as in the entire corpus.

The TF-IDF down weights the common words occurring in almost all documents and give more importance to words that appear in a subset of documents. TF-IDF is calcuated by multiplication of two terms TF and IDF.
TF(Term Frequency): (Number of times term t appears in a document)/(Number of terms in the document)
IDF(Inverse Document Frequency): log(N/n), where, N is the total number of documents and n is the number of documents a term t has appeared in.

Example is demonstrated below:
  Lets consider Corpus: ['This is the first document', 'This document is the second document', 'And this is the third one']

  TF-IDF(This, Document_1) = TF IDF = (1/5) log(3/3) = (1/8) 0 = 0
  TF-IDF(first, Document_1) = TF
IDF = (1/5) log(3/1) = (1/8) 0.477 = 0.095

From example, we can see that the word 'this' is highly penalized since it appears every document while word 'first' is given importance with some weight since it appears only in first document.

In [29]:
#set parameter for counter vectorizer
cvect_para = {
    
    'vect__max_df': (0.95, 0.98, 0.99),
    'vect__ngram_range':((1, 2),(2,2)),
    'vect__min_df': (0.005, 0.0025, 0.01)
    }

#tf-idf
tf_idf_vect = TfidfVectorizer(analyzer='word')
In [ ]:
 

MultinomialNB

In [216]:
%%time
#MultinomialNB
best_NB_pam_tfidf = grid_param_tune( clf_NB, para_NB, X_train, X_test, y_train, y_test, tf_idf_vect, cvect_para)
Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__alpha': (0.75, 0.9),
 'vect__max_df': (0.95, 0.98, 0.99),
 'vect__min_df': (0.005, 0.0025, 0.01),
 'vect__ngram_range': ((1, 2), (2, 2))}

Best cv score: 0.941

Best Parameter set:
	clf__alpha: 0.75
	vect__max_df: 0.95
	vect__min_df: 0.0025
	vect__ngram_range: (1, 2)
Score on test data with best parameter: 0.940

Classification Reports on Test Data:
              precision    recall  f1-score   support

           1       0.90      0.95      0.93      6000
           3       0.93      0.90      0.91      6000
           5       0.99      0.97      0.98      6000

    accuracy                           0.94     18000
   macro avg       0.94      0.94      0.94     18000
weighted avg       0.94      0.94      0.94     18000

CPU times: user 36min 37s, sys: 44.5 s, total: 37min 21s
Wall time: 38min 5s
In [ ]:
 
In [ ]:
 

LogisticRegression

In [31]:
%%time
#logisticRegression for multi-class
clf_log = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg')
best_logic_param_tfidf= grid_param_tune( clf_log, para_log, X_train, X_test, y_train, y_test, tf_idf_vect, cvect_para)
Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__C': (0.25, 0.5, 1.0),
 'vect__max_df': (0.95, 0.98, 0.99),
 'vect__min_df': (0.005, 0.0025, 0.01),
 'vect__ngram_range': ((1, 2), (2, 2))}

Best cv score: 0.969

Best Parameter set:
	clf__C: 1.0
	vect__max_df: 0.95
	vect__min_df: 0.0025
	vect__ngram_range: (1, 2)
Score on test data with best parameter: 0.969

Classification Reports on Test Data:
              precision    recall  f1-score   support

           1       0.99      0.95      0.97      6000
           3       0.93      0.99      0.96      6000
           5       1.00      0.97      0.98      6000

    accuracy                           0.97     18000
   macro avg       0.97      0.97      0.97     18000
weighted avg       0.97      0.97      0.97     18000

CPU times: user 59min 23s, sys: 1min 24s, total: 1h 47s
Wall time: 1h 56s
In [113]:
 

Use Best Parameters

In [48]:
%%time
tfidf_vectorizer = TfidfVectorizer( max_df= 0.95, min_df= 0.0025, ngram_range= (1, 2),analyzer='word')
tfidf_matrix = tfidf_vectorizer.fit_transform(final_data_shuffle['new_text'])
CPU times: user 17.7 s, sys: 1.74 s, total: 19.5 s
Wall time: 21 s
In [49]:
#final logistic model with TF-IDF
clf_log_final = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg', C= 1.0)
clf_log_final.fit(tfidf_matrix, final_data_shuffle['star'])
Out[49]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)
In [618]:
#new document
final_data_shuffle.iloc[33363].text
Out[618]:
'nice location nice price horrible service nail technician extremely rude friend customer shellac blotchy not evenly friend acrylic cut thicker nail others never back'
In [619]:
#predict new document
clf_log_final.predict(tfidf_vectorizer.transform([final_data_shuffle.iloc[33363].text]))
Out[619]:
array([1])
In [ ]:
 
In [ ]:
 

3.2.3 HashingVectorizer

The issue with the 'CounteVectorizer' and 'TF-IDF' is that they both set large number of vocabulary, meaning that high dimension of feature is formed which leads to large requirement on memory and slow down algorithms. But, in the haskingvectorizer method, the hashing trick is used to convert 'words' into feature 'integer'. No vocabulary is required to store as like 'CounteVectorizer' and 'TF-IDF', instead we can use arbitrary fixed length vector. Thus, it is more memory efficet, but the downside is once the text get vectorized, it can no longer be retrived.

In [33]:
#set parameter for HashingVectorizer
hasVec_para = {
    'vect__ngram_range':[(1,1),(1,2)],
    'vect__n_features': [2 ** x for x in(10, 13,15)],
    }

#hashingVectorizer
has_vec = HashingVectorizer(analyzer = 'word')
In [ ]:
 
In [ ]:
 

logisticRegression

In [34]:
%%time
#logisticRegression
best_hasing_vec = grid_param_tune( clf_log, para_log, X_train, X_test, y_train, y_test, has_vec, hasVec_para)
Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__C': (0.25, 0.5, 1.0),
 'vect__n_features': [1024, 8192, 32768],
 'vect__ngram_range': [(1, 1), (1, 2)]}

Best cv score: 0.966

Best Parameter set:
	clf__C: 1.0
	vect__n_features: 32768
	vect__ngram_range: (1, 1)
Score on test data with best parameter: 0.968

Classification Reports on Test Data:
              precision    recall  f1-score   support

           1       0.99      0.95      0.97      6000
           3       0.92      0.99      0.95      6000
           5       1.00      0.97      0.98      6000

    accuracy                           0.97     18000
   macro avg       0.97      0.97      0.97     18000
weighted avg       0.97      0.97      0.97     18000

CPU times: user 22min 38s, sys: 27.2 s, total: 23min 5s
Wall time: 15min 39s
In [22]:
 

Predict new document

In [605]:
#hashing model with best parameter
has_vec_best = HashingVectorizer(n_features=32768,ngram_range=(1, 1),  analyzer = 'word')
hasing_matrix = has_vec_best.fit_transform(final_data_shuffle['new_text']) #fit all data
In [621]:
#final logistic model with hashing model
clf_log_final_hasing = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg', C= 1.0)
clf_log_final_hasing.fit(hasing_matrix, final_data_shuffle['star'])
Out[621]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)
In [622]:
#predict new document
clf_log_final_hasing.predict(has_vec_best.transform([final_data_shuffle.iloc[30001].text]))
Out[622]:
array([1])

Hasing model correctly predict this new document

In [ ]:
 
In [625]:
#save model
save_file(has_vec_best, 'Models/hasing_vec_best') #hashing
save_file(has_vec_best, 'Models/logistic_final_hasing') #logistic with hashing
In [ ]:
 
In [ ]:
 

3.2.4 Word2Vec

This technique uses neural networks-based techniques to convert words into corresponding vectors in such a way that the semantically similar vectors are close to each other in N-dimensional space, where N refers to the dimensions of the vector.

Word2Vec model comes with two techniques: 1) Skip Gram Model and 2)Continuous Bag of Words Model (CBOW). In the Skip Gram model, the context words are predicted using the base word. For example, given a sentence "I love to dance in the rain", the skip gram model will predict "love" and "dance" given the word "to" as input.

On the contrary, the CBOW model will predict "to", if the context words "love" and "dance" are fed as input to the model. The model learns these relationships using neural networks.

Note: Word2Vec requires list of words (tokenized word)

In [ ]:
 
In [491]:
#tokenized words in each document
final_data_shuffle['new_text_wordlist'] = final_data_shuffle.new_text.apply(lambda x : word_tokenize(x))
#final_data_shuffle['text_wordlist'] = final_data_shuffle.text.apply(lambda x : word_tokenize(x))
In [539]:
final_data_shuffle.head(3)
Out[539]:
text star sentiment new_text new_text_wordlist text_wordlist
78757 delight cleveland nice indeed able sample deli... 5 positive delight cleveland nice indeed able sample deli... [delight, cleveland, nice, indeed, able, sampl... [delight, cleveland, nice, indeed, able, sampl...
41233 star burger place burger solid nice flavor fre... 3 neutral star burger place burger flavor lettuce onion ... [star, burger, place, burger, flavor, lettuce,... [star, burger, place, burger, solid, nice, fla...
73907 best strip hotel vega hand stayed several time... 5 positive best strip hotel vega hand stayed several time... [best, strip, hotel, vega, hand, stayed, sever... [best, strip, hotel, vega, hand, stayed, sever...
In [ ]:
 
In [501]:
#train Word2Vec model
model = Word2Vec(final_data_shuffle['new_text_wordlist'], size=100, window=5, min_count=5, workers=4)# document with removed positve and negative words
#model_1 = Word2Vec(final_data_shuffle['text_wordlist'], size=100, window=5, min_count=5, workers=4)
In [224]:
#save model
save_file(model_1, 'word2v_model_1')
save_file(model, 'word2v_model')
In [ ]:
 
In [503]:
print(model)
Word2Vec(vocab=17672, size=100, alpha=0.025)
In [504]:
print(model_1)
Word2Vec(vocab=18241, size=100, alpha=0.025)
In [505]:
#list of vocab
list(model.wv.vocab)
Out[505]:
['delight',
 'cleveland',
 'nice',
 'indeed',
 'able',
 'sample',
 'good',
 'curry',
 'yes',
 'decor',
 'appeal',
 'people',
 'clean',
 'friendly',
 'young',
 'lady',
 'taster',
 'help',
 'mind',
 'service',
 'pleasant',
 'quality',
 'food',
 'excellent',
 'brit',
 'need',
 'taste',
 'home',
 'place',
 'star',
 'burger',
 'flavor',
 'lettuce',
 'onion',
 'tomato',
 'mayo',
 'mustard',
 'ketchup',
 'ive',
 'read',
 'lot',
 'discussion',
 'guy',
 'west',
 'coast',
 'there',
 'really',
 'much',
 'fight',
 'many',
 'others',
 'crucial',
 'thing',
 'always',
 'look',
 'kind',
 'especially',
 'picture',
 'price',
 'paid',
 'fry',
 'soda',
 'double',
 'set',
 'back',
 'conclusion',
 'craving',
 'dont',
 'joint',
 'close',
 'best',
 'strip',
 'hotel',
 'vega',
 'hand',
 'stayed',
 'several',
 'time',
 'mile',
 'half',
 'away',
 'grandparent',
 'room',
 'fabulous',
 'acceptable',
 'bed',
 'so',
 'comfortable',
 'bathroom',
 'luxurious',
 'marble',
 'counter',
 'top',
 'shower',
 'little',
 'draw',
 'multiple',
 'option',
 'bedside',
 'spotlight',
 'reading',
 'overhead',
 'office',
 'lamp',
 'desk',
 'ambient',
 'light',
 'makeup',
 'mirror',
 'comforter',
 'great',
 'wish',
 'identical',
 'house',
 'enough',
 'heavy',
 'keep',
 'warm',
 'air',
 'ton',
 'resort',
 'family',
 'movie',
 'theater',
 'bowling',
 'alley',
 'foodie',
 'restaurant',
 'course',
 'spa',
 'huge',
 'list',
 'opening',
 'still',
 'hour',
 'line',
 'minute',
 'drink',
 'hey',
 'new',
 'cookie',
 'water',
 'oreo',
 'mcflurry',
 'totally',
 'choose',
 'tho',
 'havent',
 'stop',
 'slows',
 'seat',
 'beef',
 'noodle',
 'soup',
 'spicy',
 'seafood',
 'thought',
 'dish',
 'friend',
 'order',
 'wonton',
 'try',
 'okay',
 'also',
 'potstickers',
 'skin',
 'potsticker',
 'too',
 'thick',
 'review',
 'bank',
 'usaa',
 'found',
 'deal',
 'truck',
 'dealership',
 'husband',
 'sent',
 'email',
 'information',
 'kelly',
 'quick',
 'respond',
 'kept',
 'not',
 'sure',
 'saw',
 'even',
 'open',
 'giant',
 'sign',
 'tent',
 'employee',
 'heater',
 'finally',
 'meet',
 'less',
 'personable',
 'test',
 'drive',
 'entire',
 'paper',
 'change',
 'number',
 'tell',
 'whole',
 'reason',
 'talk',
 'nothing',
 'glass',
 'already',
 'optional',
 'paperwork',
 'financing',
 'department',
 'manager',
 'something',
 'cash',
 'amount',
 'treat',
 'stupid',
 'bought',
 'car',
 'idiot',
 'oil',
 'carwashes',
 'feel',
 'king',
 'sound',
 'ultimate',
 'laugh',
 'ridiculous',
 'sat',
 'indifferent',
 'left',
 'barely',
 'word',
 'obnoxious',
 'think',
 'sooner',
 'social',
 'security',
 'credit',
 'rather',
 'due',
 'terrorist',
 'act',
 'long',
 'want',
 'hurt',
 'constantly',
 'bad',
 'yet',
 'brake',
 'issue',
 'live',
 'tucson',
 'told',
 'next',
 'day',
 'hear',
 'year',
 'eve',
 'call',
 'morning',
 'wound',
 'sort',
 'felt',
 'sale',
 'excuse',
 'tomorrow',
 'trouble',
 'someone',
 'holiday',
 'uhm',
 'yeah',
 'pick',
 'deliver',
 'never',
 'vehicle',
 'question',
 'anyone',
 'actually',
 'gotten',
 'appointment',
 'book',
 'hold',
 'heard',
 'laser',
 'week',
 'can',
 'maintain',
 'sequence',
 'suggest',
 'ask',
 'groupon',
 'experience',
 'date',
 'stay',
 'groupons',
 'problem',
 'everyone',
 'else',
 'check',
 'unfiltered',
 'suddenly',
 'end',
 'anyway',
 'message',
 'january',
 'voice',
 'mail',
 'full',
 'today',
 'jan',
 'bare',
 'couple',
 'response',
 'ago',
 'reach',
 'business',
 'greedy',
 'put',
 'cap',
 'offer',
 'handle',
 'demand',
 'google',
 'coupon',
 'various',
 'site',
 'money',
 'customer',
 'scramble',
 'machine',
 'inform',
 'operation',
 'guess',
 'high',
 'volume',
 'minimum',
 'month',
 'please',
 'update',
 'post',
 'pan',
 'nov',
 'cancellation',
 'transfer',
 'request',
 'rogers',
 'cable',
 'infrastructure',
 'cik',
 'note',
 'least',
 'calendar',
 'phone',
 'almost',
 'internet',
 'cancel',
 'turn',
 'reschedule',
 'dec',
 'send',
 'idea',
 'explicitly',
 'future',
 'mere',
 'mean',
 'instruct',
 'way',
 'advance',
 'hope',
 'min',
 'mbps',
 'write',
 'upload',
 'download',
 'sunday',
 'speed',
 'first',
 'youtube',
 'video',
 'provider',
 'move',
 'hopefully',
 'rating',
 'fellow',
 'yelpers',
 'current',
 'matter',
 'hole',
 'wall',
 'type',
 'general',
 'state',
 'pay',
 'bill',
 'delivery',
 'driver',
 'rude',
 'situation',
 'spoke',
 'large',
 'store',
 'policy',
 'understand',
 'mention',
 'wrong',
 'wife',
 'love',
 'hidden',
 'gem',
 'brett',
 'bar',
 'tender',
 'rick',
 'wonderful',
 'portion',
 'size',
 'perfect',
 'fact',
 'entertainment',
 'weekend',
 'small',
 'dance',
 'floor',
 'definitely',
 'side',
 'town',
 'enjoy',
 'lounge',
 'closer',
 'worth',
 'cheer',
 'pork',
 'fork',
 'easily',
 'bbq',
 'northwest',
 'valley',
 'building',
 'scent',
 'meat',
 'let',
 'right',
 'walk',
 'met',
 'easy',
 'menu',
 'board',
 'highly',
 'recommend',
 'sampler',
 'everything',
 'feature',
 'chicken',
 'sausage',
 'link',
 'far',
 'ever',
 'glaze',
 'complement',
 'mac',
 'cheese',
 'favorite',
 'meal',
 'shade',
 'fed',
 'twice',
 'potato',
 'dough',
 'bread',
 'bowl',
 'panera',
 'bacon',
 'however',
 'chewy',
 'caught',
 'teeth',
 'turkey',
 'avocado',
 'focaccia',
 'thrown',
 'together',
 'cut',
 'sandwich',
 'french',
 'dip',
 'attention',
 'detail',
 'warranty',
 'company',
 'nation',
 'apartment',
 'coverage',
 'person',
 'anything',
 'hsa',
 'faulty',
 'system',
 'fix',
 'speak',
 'simply',
 'care',
 'bottom',
 'use',
 'rooftop',
 'night',
 'decorate',
 'area',
 'coconut',
 'candle',
 'palm',
 'tree',
 'fixture',
 'tiki',
 'ate',
 'shrimp',
 'skewer',
 'hawaiian',
 'pizza',
 'drank',
 'cocktail',
 'glassware',
 'fun',
 'oishi',
 'kimchee',
 'rice',
 'miso',
 'shoyu',
 'bite',
 'lunch',
 'break',
 'cold',
 'confusion',
 'kid',
 'serious',
 'training',
 'process',
 'evaluation',
 'wendy',
 'location',
 'hit',
 'visit',
 'team',
 'army',
 'staff',
 'wait',
 'waitress',
 'seem',
 'particularly',
 'unfilled',
 'pepper',
 'salad',
 'pasta',
 'atmosphere',
 'minus',
 'tv',
 'last',
 'unremarkable',
 'flour',
 'calamari',
 'seal',
 'completely',
 'likely',
 'return',
 'alert',
 'married',
 'party',
 'bus',
 'bridal',
 'wedding',
 'extreme',
 'heat',
 'practically',
 'temperature',
 'absolutely',
 'conditioning',
 'unfair',
 'shady',
 'conditioner',
 'condition',
 'owner',
 'different',
 'occasion',
 'warn',
 'guest',
 'photo',
 'appearance',
 'addition',
 'uncomfortable',
 'plain',
 'unsafe',
 'law',
 'limo',
 'operating',
 'nevada',
 'cab',
 'living',
 'mountain',
 'edge',
 'hurry',
 'work',
 'errand',
 'etc',
 'rush',
 'spacious',
 'cute',
 'petite',
 'energy',
 'woman',
 'adorable',
 'normal',
 'drip',
 'coffee',
 'drinker',
 'us',
 'pour',
 'method',
 'cup',
 'fresh',
 'press',
 'rich',
 'super',
 'flavorful',
 'maybe',
 'intricate',
 'simple',
 'often',
 'job',
 'sometimes',
 'private',
 'space',
 'massage',
 'relaxed',
 'old',
 'beat',
 'continue',
 'sit',
 'relax',
 'boy',
 'pedicure',
 'chair',
 'professional',
 'hygienic',
 'nail',
 'salon',
 'vaughan',
 'shellac',
 'colour',
 'powder',
 'thief',
 'hard',
 'broke',
 'exam',
 'brought',
 'corporation',
 'george',
 'university',
 'campus',
 'computer',
 'laptop',
 'part',
 'technician',
 'display',
 'component',
 'college',
 'street',
 'scammer',
 'avoid',
 'simon',
 'tremendous',
 'pain',
 'biopsy',
 'breast',
 'dark',
 'purple',
 'sends',
 'icky',
 'hung',
 'begin',
 'unbelievable',
 'truly',
 'stand',
 'hire',
 'tonight',
 'rain',
 'miserable',
 'weather',
 'dinner',
 'saturday',
 'website',
 'checked',
 'fruitless',
 'adventure',
 'disappointed',
 'advertised',
 'horrible',
 'presume',
 'yell',
 'value',
 'moly',
 'korean',
 'rib',
 'crowd',
 'unfortunately',
 'trash',
 'est',
 'literally',
 'front',
 'casually',
 'start',
 'listen',
 'desire',
 'scream',
 'haha',
 'terrible',
 'overall',
 'afternoon',
 'postino',
 'special',
 'sangria',
 'hip',
 'door',
 'roll',
 'spectacular',
 'expensive',
 'furniture',
 'mover',
 'held',
 'damage',
 'broken',
 'item',
 'weight',
 'measure',
 'federal',
 'regulation',
 'stole',
 'blanket',
 'fantastic',
 'beautiful',
 'balloon',
 'ride',
 'gourmet',
 'field',
 'cooked',
 'breakfast',
 'availability',
 'intake',
 'mucho',
 'burrito',
 'eat',
 'anywhere',
 'incompetent',
 'skimpy',
 'mexican',
 'server',
 'prepared',
 'yikes',
 'uber',
 'become',
 'regular',
 'destination',
 'tip',
 'bye',
 'poutine',
 'sun',
 'hot',
 'dog',
 'greasiest',
 'spoon',
 'product',
 'decade',
 'passing',
 'montreal',
 'online',
 'late',
 'show',
 'quite',
 'described',
 'tech',
 'control',
 'downstairs',
 'fan',
 'output',
 'replacement',
 'cost',
 'alone',
 'install',
 'diagnostic',
 'card',
 'info',
 'charge',
 'circuit',
 'fee',
 'folk',
 'memorable',
 'regularly',
 'stood',
 'incredible',
 'steak',
 'later',
 'bento',
 'box',
 'chocolate',
 'souffle',
 'dessert',
 'attentive',
 'empty',
 'dietary',
 'insurance',
 'adjuster',
 'dispatcher',
 'plumber',
 'matt',
 'quickly',
 'plumbing',
 'fuku',
 'spring',
 'jones',
 'jazz',
 'asian',
 'fusion',
 'refill',
 'table',
 'plenty',
 'parking',
 'nickname',
 'mine',
 'garlic',
 'gravy',
 'sauce',
 'standard',
 'american',
 'style',
 'definately',
 'bring',
 'court',
 'countless',
 'main',
 'vendor',
 'lasagna',
 'breadstick',
 'swiss',
 'chalet',
 'basic',
 'normally',
 'pita',
 'healthier',
 'wrap',
 'particular',
 'forever',
 'coworker',
 'touch',
 'non',
 'surface',
 'glove',
 'straight',
 'previous',
 'encounter',
 'hygiene',
 'safety',
 'regimen',
 'afterall',
 'hospital',
 'hate',
 'visitor',
 'related',
 'illness',
 'unclean',
 'procedure',
 'extra',
 'dollar',
 'entree',
 'version',
 'ingredient',
 'limit',
 'chance',
 'potential',
 'safer',
 'wow',
 'torta',
 'thinly',
 'hoagie',
 'chip',
 'guacamole',
 'somewhat',
 'phoenix',
 'outstanding',
 'solid',
 'group',
 'format',
 'action',
 'leader',
 'verbal',
 'communication',
 'event',
 'sport',
 'worthwhile',
 'soft',
 'equipment',
 'military',
 'gear',
 'oak',
 'underwhelmed',
 'match',
 'honestly',
 'tasty',
 'presentation',
 'interior',
 'salty',
 'slightly',
 'spending',
 'forgettable',
 'steakhouse',
 'average',
 'solely',
 'scottish',
 'girl',
 'comped',
 'appetizer',
 'scope',
 'busy',
 'pickle',
 'butter',
 'pie',
 'blowing',
 'hubby',
 'rob',
 'roy',
 'beer',
 'selection',
 'whiskey',
 'snack',
 'bit',
 'purchase',
 'answer',
 'sushi',
 'teppanyaki',
 'sens',
 'basically',
 'screw',
 'gyoza',
 'rainbow',
 'katsu',
 'dry',
 'ramen',
 'scallop',
 'mussel',
 'yelp',
 'frankly',
 'aside',
 'sister',
 'doc',
 'choice',
 'sourdough',
 'rediculous',
 'instead',
 'whatevs',
 'sandwhich',
 'forgot',
 'smokey',
 'joe',
 'broccoli',
 'cheddar',
 'serve',
 'crust',
 'point',
 'lick',
 'plate',
 'finish',
 'soon',
 'dijon',
 'finger',
 'lol',
 'peanut',
 'banana',
 'longer',
 'piece',
 'fruit',
 'kiddos',
 'hungry',
 'stuff',
 'personal',
 'speaks',
 'quote',
 'picked',
 'somewhere',
 'luggage',
 'big',
 'gift',
 'dresser',
 'packed',
 'professionally',
 'mattress',
 'fold',
 'crack',
 'mark',
 ...]
In [ ]:
 
In [506]:
#show similarity words
X=model[model.wv.vocab] # vector of words
pca = PCA(n_components=2) #PCA for two dimensions

pca = PCA(n_components=2)
result = pca.fit_transform(X) 
/Users/gangalingden/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  
In [513]:
#display similarity words
plt.figure(figsize=(9,6))
plt.scatter(result[:50, 0], result[:50, 1]) #scatter plot
words = list(model.wv.vocab)[:50] # 50 words
for i, word in enumerate(words):
    #print(word ,':', (result[i, 0], result[i, 1]))
    plt.annotate(word, xy=(result[i, 0], result[i, 1])) #words with position(x,y)
    
plt.show()
In [ ]:
 
In [514]:
#show top 10 simlar words
model.wv.most_similar('food', topn=10)
Out[514]:
[('meal', 0.5858632922172546),
 ('restaurant', 0.5699456930160522),
 ('takeout', 0.5561284422874451),
 ('eat', 0.5066172480583191),
 ('dish', 0.4767340421676636),
 ('sushi', 0.47122249007225037),
 ('quantity', 0.46635472774505615),
 ('diner', 0.4634217619895935),
 ('hungry', 0.4623253345489502),
 ('dinner', 0.45887452363967896)]
In [516]:
model.wv.most_similar('bad', topn=10)
Out[516]:
[('terrible', 0.7569934129714966),
 ('horrible', 0.7525511980056763),
 ('awful', 0.7255657911300659),
 ('unfortunately', 0.6952295899391174),
 ('sad', 0.6716634035110474),
 ('pathetic', 0.6319237351417542),
 ('poor', 0.6158181428909302),
 ('lousy', 0.6153324842453003),
 ('crappy', 0.6118960380554199),
 ('slowest', 0.6104698777198792)]
In [ ]:
 
In [ ]:
 

Split Train and Test Data

In [543]:
# split train and test dataset with stratify fashion
X_train_w2v, X_test_w2v, y_train, y_test = train_test_split(final_data_shuffle['new_text_wordlist'],
                                                    final_data_shuffle['star'], 
                                                    test_size=0.2,
                                                    shuffle= True,
                                                    stratify= final_data_shuffle['star'],
                                                    random_state=37)
In [544]:
X_train_w2v.head(3)
Out[544]:
46119    [so, today, lunch, first, thing, notice, walk,...
46411    [chipotle, food, definitely, food, chipotle, t...
52034    [tiabi, new, aliante, location, quite, speed, ...
Name: new_text_wordlist, dtype: object
In [ ]:
 

Note: We have to make final vector(document vector) to feed them for classifers. Thus, we do vector average of each words in each document and got the final vector ready to apply for classifers.

In [545]:
#average words vectors in each documnet
def compute_w2v_vector(word2vec, document):
    """
    Average vectors for each word in the document
    
    parameters:
        word2vec : Word2VecKeyedVectors
        document : text
        
    return:
        doucment vector
         
    """
    
    list_of_word_vectors = [word2vec[w] for w in document if w in word2vec.vocab.keys()] #word vectors
    
    if len(list_of_word_vectors) == 0: #check empty document
        doucment_vec = [0.0]*100
    else:
        doucment_vec = np.sum(list_of_word_vectors, axis=0) / len(list_of_word_vectors) #average the vectors
        
    return doucment_vec
In [ ]:
 
In [546]:
#average the vectors for each word in the document
X_train_w2v1 = X_train_w2v.apply(lambda x: compute_w2v_vector(model.wv, x))
X_test_w2v1 = X_test_w2v.apply(lambda x: compute_w2v_vector(model.wv, x))
In [ ]:
 
In [547]:
#show in dataframe
X_train_w2v = pd.DataFrame(X_train_w2v1.values.tolist(), index= X_train_w2v1.index)
X_test_w2v = pd.DataFrame(X_test_w2v1.values.tolist(), index= X_test_w2v1.index)
In [548]:
X_train_w2v.head(3)
Out[548]:
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
46119 0.070542 0.044412 -0.466224 -0.111124 -0.564145 -0.43579 0.747601 0.057139 0.614724 0.650845 ... 0.110486 0.027037 -0.518195 -0.744523 -0.504551 -0.189931 0.535778 0.056806 -0.156436 -0.079884
46411 -0.114609 0.475144 0.184554 0.079465 -0.874726 0.16165 0.599514 -0.192039 1.483526 0.186847 ... 0.639947 0.015811 -0.639189 -0.618532 -0.194666 0.236109 -0.176950 0.399007 0.383095 -0.281874
52034 -0.042295 0.218893 -0.350009 -0.327345 -0.369176 -0.41402 0.026938 -0.072279 0.523898 -0.153059 ... 0.308277 0.199971 -0.571325 -0.504209 -0.174392 -0.342263 0.417420 -0.134307 -0.067929 0.192630

3 rows × 100 columns

In [ ]:
 

logisticRegression

In [549]:
%%time
#logisticRegression for multi-class
clf_log = LogisticRegression(multi_class= 'multinomial', solver= 'newton-cg')
best_logic_param_tfidf= grid_param_tune( clf_log, para_log, X_train_w2v, X_test_w2v, y_train, y_test)
Performing grid search...
pipeline: ['clf']
parameters:
{'clf__C': (0.25, 0.5, 1.0)}

Best cv score: 0.961

Best Parameter set:
	clf__C: 0.5
Score on test data with best parameter: 0.960

Classification Reports on Test Data:
              precision    recall  f1-score   support

           1       0.97      0.95      0.96      6000
           3       0.93      0.95      0.94      6000
           5       0.98      0.97      0.98      6000

    accuracy                           0.96     18000
   macro avg       0.96      0.96      0.96     18000
weighted avg       0.96      0.96      0.96     18000

CPU times: user 48 s, sys: 596 ms, total: 48.6 s
Wall time: 26.5 s
In [ ]:
 

3.2.5 Doc2Vec

Unlike word2vec, Doc2Vec converts each document directly into document vector. It comes with two implementations: 1).Paragraph Vector - Distributed Memory (PV-DM) and
2).Paragraph Vector - Distributed Bag of Words (PV-DBOW)

More details can be found Doc2Vec

Note: Doc2Vec requires each document should be tagged with number/tag. So, we have to tag each document before feeding to it.

In [552]:
#associate with tag/number for each document
tag_documents_train = [TaggedDocument(doc, [i]) for i, doc in enumerate(X_train)] 
tag_documents_train[0]
Out[552]:
TaggedDocument(words=['so', 'today', 'lunch', 'first', 'thing', 'notice', 'walk', 'salad', 'pre', 'prepared', 'case', 'much', 'salad', 'menu', 'meatball', 'sandwich', 'tomato', 'basil', 'soup', 'financier', 'pick', 'food', 'basket', 'meatball', 'sandwich', 'meatball', 'bread', 'tomato', 'basil', 'soup', 'similar', 'tomato', 'sauce', 'meatball', 'sandwich', 'so', 'part', 'meal', 'financier', 'choose', 'pastry', 'never', 'really', 'food', 'sandwich', 'big', 'eat', 'however', 'lot', 'bird', 'bee', 'ant', 'hill', 'etc', 'customer', 'afterwards', 'morning', 'cafe', 'really', 'busy'], tags=[0])
In [ ]:
 
In [457]:
%%time
#train Doc2Vec model
model_doc2vec_train =Doc2Vec(tag_documents_train, vector_size=100, window=5, min_count=5, workers=4, epochs=40)
CPU times: user 10min 57s, sys: 1min 23s, total: 12min 20s
Wall time: 7min 46s
In [ ]:
 
In [469]:
#get doc vector for train and test before applying into classifiers
X_train_d2v = np.array([model_doc2vec_train.docvecs[x] for x in range(len(tag_documents_train))]) #get doc2vec and convert numpy
X_test_d2v = np.array([model_doc2vec_train.infer_vector(x[0]) for x in tag_documents_test]) #convert test data into doc2vec
In [ ]:
 

Logistic Regression

In [531]:
%%time
#logisticRegression for multi-class
clf_log = LogisticRegression(multi_class= 'multinomial', solver= 'lbfgs')
best_logic_param_tfidf= grid_param_tune( clf_log, para_log, X_train_d2v, X_test_d2v, y_train, y_test)
Performing grid search...
pipeline: ['clf']
parameters:
{'clf__C': (0.25, 0.5, 1.0)}

Best cv score: 0.947

Best Parameter set:
	clf__C: 0.25
Score on test data with best parameter: 0.948

Classification Reports on Test Data:
              precision    recall  f1-score   support

           1       0.98      0.92      0.95      6000
           3       0.92      0.94      0.93      6000
           5       0.94      0.98      0.96      6000

    accuracy                           0.95     18000
   macro avg       0.95      0.95      0.95     18000
weighted avg       0.95      0.95      0.95     18000

CPU times: user 9.05 s, sys: 194 ms, total: 9.24 s
Wall time: 4.8 s
In [ ]:
 

Predict new document(review)

In [648]:
#new document
final_data.iloc[30001]
Out[648]:
text         bathroom dirty dead bug spend entire clean rig...
star                                                         1
sentiment                                             negative
Name: 30001, dtype: object
In [ ]:
 
In [ ]:
#tags all doucume(train data +test data)
tag_documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(final_data_shuffle['new_text_wordlist'])] 

#train Doc2Vec model
model_doc2vec_final =Doc2Vec(tag_documents, vector_size=100, window=5, min_count=5, workers=4, epochs=40)
In [642]:
#get doc2vec and convert numpy
X_train_d2v_final = np.array([model_doc2vec_final.docvecs[x] for x in range(len(tag_documents))]) 
In [646]:
#logisticRegression final model with doc2vec
log_final_doc2vec = LogisticRegression(multi_class= 'multinomial', solver= 'lbfgs', C= 0.25)
log_final_doc2vec.fit(X_train_d2v_final, final_data_shuffle.star) #fit with all data
Out[646]:
LogisticRegression(C=0.25, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [650]:
#predict new document
doc2vec_newdocument= model_doc2vec_final.infer_vector([final_data.iloc[30001].text]) #convert into doc2vec
log_final_doc2vec.predict([doc2vec_newdocument]) #predict
Out[650]:
array([1])
In [ ]:
 

4. Conclusion

In this project, the different word embedding(convert word into number)techniques are applied for sentiment classificaiton. The hashing and Word2Vec/Doc2Vec seems good, but nevertheless the countvectoriser and tf-idf also did very well after removing the negative and positive words from opposite class.

The final model is built with TF-IDF with LogisticRegression. It is used for web demo purpose. You can find it from here: Web Demo

In [ ]:
 
In [ ]: