IDS 572: Text Mining Project

Introduction

Yelp is a platform through which users can review their experiences with a wide variety of businesses. Each review consists of a text portion, as well as a star rating using a 1 to 5 scale. This project takes a subset of the restaurant reviews on Yelp and attempts to draw conclusions about the relationship between various words in the text portion of the review and the star rating through using text mining. Specifically, we will utilize the "bad of words" approach to text mining and apply three individual dictionaries. The objective is to identify which reviews are positive and negative based on the context of the text portion of the review.

Data Exploration and Preparation

To begin, we explored the data in order to determine some basic information about the ratings in the provided dataset. The star ratings are distributed somewhat unevenly throughout the dataset, as demonstrated in the following histogram.

In order to perform sentiment analysis, the star ratings must be transformed into a binary classification. Two classes indicating positive and negative reviews will be required. To do so, we will eliminate all reviews which fall into the 3 star ratings. These reviews are considered neutral rather than positive or negative. Ratings of 1 and 2 stars will be considered negative, while ratings of 4 and 5 stars will be considered positive.

Before continuing to the sentiment analysis, though, we will examine a few words which are present in the text reviews and see if they relate to specific star ratings. Specifically, we will focus on the words 'funny', 'cool' and 'useful', all of which we would expect to be related to the positive reviews

It's evident in the above plot that the word 'funny' is most commonly used in 4 star reviews. It's not very common in the negative reviews, which makes sense considering funny is generally a positive quality.

Similarly, 'cool' is generally related with positive reviews. It's very interesting that this word seems to be used in 3 star ratings even more than 5, but it's clearly most common in the 4 star ratings.

Finally, the word useful is also commonly used within the positive reviews. The patterns are similar to the previous words.

Before performing sentiment analysis, we also did some modification to the data. We removed all reviews which did not come from a 5-digit postal code. We then tokenized the reviews, which converts the reviews from one long string of text to individual words. This is done in order to prepare the data for sentiment analysis, as each individual word from the review will need to be compared to the dictionary. The order of the words is not relevant since we will use the "bag of words" approach.

Next, we removed stop words because they are not helpful in understanding the meaning behind the review. Stop words include words such as 'and', 'a', and 'the' which are present across all reviews regardless of the review's content. Removal of the stop words decreased the total number of tokens (unique words) from 68,204 to 67,505

We also wanted to remove any additional words which are either present in most reviews, or in very few reviews. Rare words included several words with numbers such as '12pm', as well as some words in other languages and words that are not very relevant to restaurants such as 'courthouse.' The most common word used in over 20,000 reviews is 'food', which does not indicate a positive or negative sentiment since the review could either compliment or complain about the food. Therefore, we removed all words used in less than 10 reviews or more than 15,000 reviews.

We also removed any other numeric words since they don't have much meaning in terms of sentiment analysis. All of this brought down the number of tokens to 8,649. This is a significant reduction in tokens from the initial set of 68,204.

Data Analysis

Then, we analyzed the frequency of words in each rating and then calculated the proportion of word occurrence by star ratings. We checked the proportion of 'love' (positive sentiment) and 'worst' (negative sentiment) among reviews with rating stars. We can clearly say that while rating 4 and 5 represent positive reviews, rating 1 and 2 are more related to negative reviews.

Afterwards, we computed the number of occurrences and probability of each word by rating. We plotted the top 20 words in each rating to understand the difference between ratings. According to the plot shown below, there are common words among ratings which are 'service', 'restaurant', 'menu', 'table', 'people' and 'time'. We have to prune a set of common words from the token list since these words are not useful for understanding differences among reviews. As expected, ratings 1 and 2 include negative words such as 'bad', 'worst', 'horrible' and 'wait'. On the other hand, higher ratings (4 and 5) consist of positive words which are 'delicious', 'amazing', 'pretty' and 'nice'.

To understand which words are generally related to higher and lower ratings, we calculated the average stars associated with each word and then summed up the star ratings with reviews where each word occurs in. Based on that, top 20 words with the highest ratings include general words ('restaurant', 'service', 'menu') and positive words ('nice', 'delicious', 'love' and 'friendly').

Nevertheless, review of lowest star rating generally obtains negative words which are 'disgust', 'disrespectful', 'unwilling', 'bullshit'. This analysis states the fundamental difference between ratings.

To understand which words are generally related to higher and lower ratings, we calculated the average stars associated with each word and then summed up the star ratings with reviews where each word occurs in. Based on that, the top 20 words with the highest rating includes general words ('restaurant', 'service', 'menu') and positive words ('nice', 'delicious', 'love' and 'friendly').

Nevertheless, review of lowest star rating generally obtains negative words which are 'disgust', 'disrespectful', 'unwilling', 'bullshit'. This analysis states the fundamental difference between ratings.

As a result of exploratory data analysis, we removed common words from the token list such as 'food', 'chicken', 'service', 'time' and 'restaurant'. The reason why we eliminate is to enhance the performance of sentiment analysis. These words do not make a difference in identifying positive and negative sentiments among documents.

Stemming and Lemmatization

After tokenizing and removing stopwords, we are able to perform stemming or lemmatization processes in text mining. When stemming helps us achieve the root forms of inflected words. Moreover, lemmatization is the process of grouping together the diverse inflected forms of a word so they can be analyzed as a single item. While converting any word to the root-base word, stemming can create non-existent work but lemmatization generates real dictionary words. This table shows difference between stemming and lemma words:

Original Stemming Lemma
night night night
friends friend friend
improved improv improve
servers server server
attentive attent attentive
previous previou previous
experience experi experience
recommended recommend recommend
clams clam clam
lemon lemon lemon
pepper pepper pepper
seasoning season season
salty salti salty
louisiana louisiana louisiana
style style style
tasted tast taste
hot hot hot
juicy juici juicy
seasoning season season
pretty pretti pretty
recommend recommend recommend
hot hot hot

Term-Frequency

We carried out lemmatization instead of stemmed words and filtered out less than 3 characters and more than 15 characters to decrease the number of tokens. Then, we computed tf-idf scores in order to run sentiment analysis. Tf-idf is a statistic which reflects how important a word is to a document in a collection of groups. Term frequency (tf) identifies the frequency of individual terms within a document. Also, we need to understand the importance that words provide within and across documents. Inverse document frequency (idf) which decreased the weight for commonly used words and increased the weight for words that are not used very much in a collection of documents.Tf-idf score calculated by multiplying these two scores. The table below demonstrates tf-idf scores of the first review.

review_id stars word n tf idf tf_idf
--9qM_dRW4rrKTWO_SX_qQ 1 buffet 1 0.1111111 4.075566 0.4528407
--9qM_dRW4rrKTWO_SX_qQ 1 copper 1 0.1111111 6.862339 0.7624822
--9qM_dRW4rrKTWO_SX_qQ 1 kettle 1 0.1111111 7.373165 0.8192406
--9qM_dRW4rrKTWO_SX_qQ 1 price 1 0.1111111 1.919269 0.2132522
--9qM_dRW4rrKTWO_SX_qQ 1 soo 1 0.1111111 6.048240 0.6720266
--9qM_dRW4rrKTWO_SX_qQ 1 star 1 0.1111111 2.466480 0.2740534
--9qM_dRW4rrKTWO_SX_qQ 1 suck 1 0.1111111 4.652496 0.5169440
--9qM_dRW4rrKTWO_SX_qQ 1 tempe 1 0.1111111 5.565657 0.6184064
--9qM_dRW4rrKTWO_SX_qQ 1 top 1 0.1111111 2.768948 0.3076609
--9vqlJ0xGKY2L1Uz-L9Eg 3 appetizer 1 0.0212766 3.145984 0.0669358

Sentiment Analysis

In the sentiment analysis, we applied 3 different dictionaries: - Bing - Nrc - Afinn

Bing Dictionary

We focused on the 'bing' dictionary which includes 6786 words and their sentiments. Sentiments are described as positive or negative. Most of the words (4781) belong to negative sentiments.

To determine sentiments to words in documents, we applied 'bing' dictionary by doing inner join. There were 935 individual words in the review data matching with the 'bing' dictionary. Then, we also added occurrences of positive and negative sentiment words in reviews. The plot demonstrates the most positive and negative words in reviews. While, 'love', 'nice', 'delicious', 'friendly', 'pretty' are the most popular positive words, 'bad', 'disappoint', 'die', 'hard', 'cold' represents negative sentiments.

We have analyzed overall sentiment across reviews until this point, now we concentrate on sentiment by review to understand how review relates to review's star ratings. For each review, we computed positive and negative words, then created a probability of being positive and negative. Lastly, we created a sentiment score by taking the absolute value of difference of positive and negative score of review.

stars avgPos avgNeg avgSentiSc
1 0.2924336 0.7075664 -0.4151328
2 0.4464544 0.5535456 -0.1070913
3 0.5964308 0.4035692 0.1928615
4 0.7373631 0.2626369 0.4747263
5 0.8109949 0.1890051 0.6219898

By using the sentiment score of reviews, we computed the average of positive and negative scores for each rating. According to the table above, star ratings 1 and 2 represent negative reviews since average sentiment score is below than zero. Nonetheless, star rating 4 and 5 points out positive reviews.

We built a document-term matrix of reviews dataset. In a document-term matrix, rows show reviews and columns correspond to words. Then, we filtered out reviews whose rating is 3 since sentiment score of rating 3 is positive, but lower (includes both negative and positive reviews). Then, when the star rating of review is less than 2, we assigned these reviews as class -1 and others belong to class 1. Based on the table, the most of reviews are assigned to class 1 and only 6671 reviews correspond to negative reviews. The document-term matrix has 27473 rows and 937 columns.

hiLo n
-1 6670
1 20803

NRC Dictionary

For a second dictionary choice, we used the NRC dictionary and an inner join. Rather than just identifying words as positive and negative, this dictionary assigns a more specific sentiment to each word. For example, some words may portray 'anger', 'trust' and 'negative'. Within the dictionary, the most common sentiment class for the words in 'negative'.

The dictionary had 2831 matching terms with the review data. The occurrences of these sentiments in the data set are summarized in the following table. 'Count' is the number of unique words associated with that sentiment in the dataset, and 'sum n' is a sum of all of the occurrences of those words across the different reviews.

sentiment count sumn
anger 186 30079
anticipation 249 71521
disgust 164 23847
fear 201 27001
joy 237 87548
negative 499 76956
positive 636 168393
sadness 182 31661
surprise 145 33290
trust 332 85883

We checked in more detail the words which were associated with the different emotions in the dataset. Here are the top three words for each of the sentiments within the reviews. It's obvious that some of the words are associated with multiple sentiments.
- Anger: bad, hot, disappoint - Anticipation: wait, friendly, pretty - Disgust: bad, disappoint, finally - Fear: bad, die, war - Joy: love, delicious, friendly - Negative: wait, bad, bite - Positive: love, delicious, friendly - Sadness: bad, die, leave - Surprise: amaze, leave, sweet - Trust: friendly, pretty, star

We can consider some of these emotions such as anger, disgust, fear, sadness and negative to be 'bad', while positive, joy, anticipation and trust are 'good.' This way, we can group together all of the positive and negative emotions to determine which words in the review are most commonly good and bad. After grouping the emotions together, we can view the most positive and negative reviews in the same way as with the Bing dictionary.

It can be seen that some of the words are both positive and negative. This is possible with the NRC dictionary since words can fall into more than one sentiment, as demonstrated above. For example, the word 'wait' could be used in multiple contexts. One review could say 'the restaurant had a long wait' and leave a negative star rating. Alternatively, someone could say 'I cannot wait to eat here again!' and leave 5 stars. This dictionary helps to acknowledge this fact.

Once again, we created a document-term matrix using the terms in the NRC dictionary and filtered out the reviews with the rating of 3. We again assigned ratings of 1 and 2 stars to the -1 class and of 4 and 5 stars to the 1 class. Below is the distribution of the classes now, which is still highly imbalanced as in the Bing dictionary. The document-term matrix has 28061 rows and 1310 columns.

hiLo n
-1 6893
1 21168

AFINN Dictionary

The third dictionary that we will consider is the AFINN dictionary. Rather than simply labeling the words as positive or negative in this dictionary, AFINN assigns a score from -5 to 5 to every word. This way, a word scored at -5 can be thought of as more negative than -3, and 5 as more positive than 3. For example, the word 'breathtaking' is very positive with a rating of 5, and 'ability' is still positive with a score of 2, but less positive than the 5. As is the case with the other two dictionaries, this one contains more words in the negative scores than the positive ones, with -2 being the most frequent score within the dictionary.

We used an inner join for this dictionary, and there were 518 matching terms with the reviews. Within these specific reviews, we can see which values are most commonly used in the following table. Despite the fact that the dictionary contains the most words with the -2 score, words with a score of 2 are most commonly used within the reviews with a total of 40,681 occurrences. This is expected, since the distribution of the reviews is skewed as shown earlier, and there are more positive reviews than negative within the dataset.

value count sumn
-5 1 13
-4 9 927
-3 46 11867
-2 134 17098
-1 67 14476
1 64 20967
2 125 40681
3 57 31282
4 13 7747
5 2 696

The values can be used to create an overall score for the review. This can occur by adding up the values associated with each of the individual words within the review. Then, we can examine the average score for each of the star ratings.

Like the other dictionaries, we checked which words were associated with the most positive and negative sentiments. To do so, we designated all words with a value of -1 to -5 as 'bad', and all reviews with a value between 1 and 5 as 'good'. Then, the sentiment analysis is below. One interesting observation is that 'love' is the most positive and 'bad' is the most negative word across all three of the dictionaries.

Stars Avg Length Avg Sentiment Score
1 4.003976 -2.3493326
2 4.313061 0.7138584
3 4.314566 3.1422384
4 4.246678 5.5467348
5 3.963570 6.4570960

Something surprising is that the 2-star reviews have a very slightly positive average score. As expected, though, the average score increases consistently from 1 to 5 star reviews.

Finally, we completed one more document-term matrix and assigned reviews to the two classes. For AFINN, we removed the 2-star reviews, as their average sentiment score was rather neutral at only 0.71. Then, we assigned the 1-star reviews to the '-1' class, and the 3, 4, and 5-star reviews to the '1' class. This leads to a less balanced dataset than with the other dictionaries, but is the best choice considering the average sentiment score across the different star ratings. The document-term matrix has 28297 rows and 520 columns.

hiLo n
-1 3521
1 24776

Model Design

For each dictionary, we inner join the dictionary with the terms in the ratings to form a document-term matrix, which we used to assign the data to two classes. We used tf-idf scores of each word in each review as independent variables.Tf-idf is a statistic which represents how important a word is to a document in a collection of groups. This score is a multiplication of tf and idf scores. We considered both ' frequency of individual terms within a document' and ' importance of words across documents' by applying tf-idf score. The importance grows proportionally to the number of times a word appears in the document but it is offset by the frequency of the word in the corpus. Then, we then took a sample from the original datasets to decrease complexity. We used a sample size of 12000. Then, we divided sample data into train and test data at the ratio of 65:35.

We built several models including random forest, generalized linear model and support vector machine. For each model, we present results of the model on train and test dataset.

Machine Learning Models

Bing Dictionary

Random Forest

We implemented a random forest model by using different numbers of trees such as 70, 120, 180 trees. Then, we realized that there is no significant improvement (just 0.5%) when we increase the number of trees so we decided on 120 trees. The table shows confusion matrix of train dataset (thresh=0.5):

##       preds
## actual FALSE TRUE
##     -1  1629  221
##     1     54 5896

The following table shows confusion matrix of test dataset:

##       preds
## actual FALSE TRUE
##     -1   694  361
##     1    152 2993

This table demonstrates performance metrics of test dataset:

Metric Value
Accuracy 0.878
Recall 0.952
Precision 0.892

This plot below indicates the ROC curve of the random forest model on train and test dataset. When the blue line represents the model performance on training data, the red line corresponds to test data.

Then, we determined the best threshold ( 0.6193 ) by using the ROC curve and so created the confusion matrix on test data. It can be clearly seen that a random forest model with the optimal thresholds separates reviews which have different classes much better.

##       preds
## actual FALSE TRUE
##     -1   772  283
##     1    253 2892

The table demonstrates performance metrics of test dataset based on optimal threshold:

Metric Value
Accuracy 0.872
Recall 0.92
Precision 0.911

We checked which variables carried more weights on the model to predict class of review. According to importance specified by random forest, the most 10 significant words comprise both positive and negative words,but the majority of them are related to positive words. ('bland', 'fresh', 'friendly','fun','awesome',etc...)

Generalized Linear Model

We ran a generalized linear model by using different parameters. When we apply different regularization parameters, we observed that the model with Lasso performs slightly better than the model with Ridge. Also, we executed this model by using 5 fold cross validation to avoid overfitting. Then, we selected the best regularized model in the test data. The table demonstrates confusion matrix of the train dataset:

##       preds
## actual FALSE TRUE
##     -1  1260  590
##     1    107 5843

This table demonstrates confusion matrix of the test dataset:

##       prediction
## actual   -1    1
##     -1  666  389
##     1   132 3013

The following table demonstrates performance metrics of the model on test dataset:

Metric Value
Accuracy 0.876
Recall 0.958
Precision 0.886

This plot below indicates the ROC curve of the generalized linear model on train and test dataset. When the blue line represents the model performance on training data, the red line corresponds to test data.

Naive-Bayes Model

The Naive Bayes Classifier is a simple probabilistic classifier which is based on Bayes theorem. We implemented Naive Bayes classifier with Laplace smoothing. The table demonstrates confusion matrix of the train dataset (thresh=0.5):

##       pred
## actual FALSE TRUE
##     -1  1525  325
##     1   3929 2021

The table demonstrates confusion matrix of the test dataset:

##       preds
## actual FALSE TRUE
##     -1   928  127
##     1   2089 1056

This plot below indicates the ROC curve of the Naive Bayes Classifier on train and test dataset. When the blue line represents the model performance on training data, the red line corresponds to test data.

Then, we determined the best threshold () by using the ROC curve and so created the confusion matrix on test data. It can be clearly seen that a Naive Bayes model with the optimal threshold separates reviews which have different classes much better.

The table demonstrates performance metrics of the model on test dataset:

Metric Value
Accuracy 0.71
Recall 0.727
Precision 0.864

NRC dictionary

Random Forest

We once again tried the random forest model with three sizes: 70, 120 and 180 trees in order to compare the performance between the three. The performance between the three is essentially the same, with accuracy differences of only 0.3%, so we selected the 70 trees as the best model because it requires less computation time than the others. The confusion matrix of this model on the training data when using the 0.5 threshold for the cutoff between classes:

##       preds
## actual FALSE TRUE
##     -1  1702  222
##     1     43 6001

The confusion matrix from this model on the testing data:

##       preds
## actual FALSE TRUE
##     -1   616  489
##     1    172 3012

On the testing data, the model with 70 trees has the following performance metrics:

Metric Value
Accuracy 0.846
Precision 0.946
Recall 0.86

In order to determine the best threshold value rather than just using 0.5, we can plot the ROC curve and use that to determine the optimal value to use. We plotted together the ROC curves of the training and testing:

This determines that the optimal threshold value is 0.6356, so we will use this value as the cutoff in the confusion matrix rather than 0.5 as we did originally. When using the optimal threshold on the testing data, we get the following precision matrix:

##       preds
## actual FALSE TRUE
##     -1   782  323
##     1    354 2830

And the performance metrics change:

Metric Value
Accuracy 0.842
Recall 0.889
Precision 0.898

The precision is increased heavily now with the new threshold. Finally, we checked the variable importance for the best random forest model of 70 trees. The most important word is 'suck.'

Generalized Linear Model

We next ran the GLM model on the NRC dictionary. We ran both ridge and lasso, but we found that lasso had better performance across all of the performance metrics that we used. Therefore, this is the best GLM model of the two.

Using lasso, and 5 fold cross validation, the resulting training confusion matrix:

##       preds
## actual FALSE TRUE
##     -1  1167  757
##     1    114 5930

And the testing confusion matrix:

##       preds
## actual FALSE TRUE
##     -1   570  535
##     1    121 3063

The performance metrics on the testing data for this model:

Metric Value
Accuracy 0.847
Recall 0.962
Precision 0.851

We also plotted the ROC curves for this model for both the training and testing datasets. The blue line represents the training, and the red line the testing data.

Naive-Bayes Model

Finally, we ran Naive Bayes as a third model for the NRC dictionary. Using the threshold of 0.5, and including Laplace smoothing, this is the confusion matrix for the training data:

##       pred
## actual FALSE TRUE
##     -1  1639  285
##     1   4767 1277

And for the testing data:

##       preds
## actual FALSE TRUE
##     -1   983  122
##     1   2486  698

Upon looking at the testing confusion matrix, it seems that the model predicts the negative class well, but is not very good as predicting the positive class. We can further observe the ROC curves for training and testing sets, shown in blue and red respectively.

We can use the ROC curve to determine a better threshold in an attempt to improve the performance of the classification. The new confusion matrix for the testing data using the optimal cutoff threshold is:

##       preds
## actual FALSE TRUE
##     -1   703  402
##     1   1015 2169

And the performance metrics:

Metric Value
Accuracy 0.67
Recall 0.681
Precision 0.844

AFINN dictionary

Random Forest

For the AFINN dictionary, we tried three random forest models: 70, 120, and 180 trees in order to compare the performance between the three. The accuracy of all three models is the same, and the precision and recall vary only by 0.1% between the three, so we chose 70 trees as the best.

For the training data, here is the confusion matrix when using a 0.5 threshold:

##       preds
## actual FALSE TRUE
##     -1   767  258
##     1     48 6961

And for the testing data:

##       preds
## actual FALSE TRUE
##     -1   262  289
##     1     69 3706

The performance metrics are very high. This is likely because we have more information from this dictionary than we do with the other ones, as instead of just a sentiment, we have a sentiment score indicating how strong the sentiment is.

Metric Value
Accuracy 0.917
Precision 0.982
Recall 0.928

And the ROC curves:

We still can try to improve the performance, though, by using the optimal threshold obtained from the ROC curves, which is 0.747 and changes the testing confusion matrix are as follows:

##       preds
## actual FALSE TRUE
##     -1   387  164
##     1    293 3482

This improved the precision of the model, resulting in the following metrics:

Metric Value
Accuracy 0.894
Recall 0.922
Precision 0.955

We also checked the variable importance for this model, and found that the most important variables were 'love', 'suck', and 'leave'.

Generalized Linear Model

We also ran a GLM model on the third dictionary. We found again that lasso has better performance than ridge, so we used the lasso regression with 5 fold cross validation. This resulted in the following confusion matrix using the training data.

##       preds
## actual FALSE TRUE
##     -1   410  615
##     1     63 6946

And using the testing data:

##       prediction
## actual FALSE TRUE
##     -1   198  353
##     1     48 3727

On the validation data, the GLM performance metrics for the AFINN dictionary are also very high.

Metric Value
Accuracy 0.907
Recall 0.987
Precision 0.913

You can see the corresponding ROC curves for the training model in blue and the testing model in red.

Naive-Bayes Model

Finally, we ran a third model of naive bayes on the AFINN dictionary to compare the performance. We chose to use Laplace smoothing again in order to achieve the best results. The confusion matrix using the training data and a threshold of 0.5:

##       pred
## actual FALSE TRUE
##     -1   834  191
##     1   5130 1879

And using the same threshold of 0.5, the confusion matrix for the testing data:

##       preds
## actual FALSE TRUE
##     -1   498   53
##     1   2768 1007

The accuracy and recall of this model using the cutoff threshold value of 0.5 are very low, demonstrating the low performance of the model using this threshold.

Metric Value
Accuracy 0.348
Recall 0.267
Precision 0.95

Instead, we can use the ROC curve to determine a better threshold value in order to reclassify some of the cases in the testing set. In the graph below, the ROC curve of the training set is shown in blue and the ROC curve of the testing set in red.

Because the optimal threshold value based on the ROC curve is so different from 0.5, and is in fact 1.195512 e-68, the new confusion matrix for the testing data using this cutoff is very different from the original one.

##       preds
## actual FALSE TRUE
##     -1   375  176
##     1    980 2795

The correctly identified members of the negative class have gone down slightly which lowers the precision a little bit, but many more occurrences of the positive class are now correctly identified, bringing up the overall accuracy and recall scores a lot.

Metric Value
Accuracy 0.733
Recall 0.74
Precision 0.941

Combined Dictionaries

We updated the data by using a combination of dictionaries to train all three models once again: random forest, GLM, and Naive Bayes.

Random Forest

For random forest, we again trained three individual random forest models using 70, 120 and 180 trees to compare the performance. We found that the performance metrics (accuracy, precision, and recall) were identical for 70 and 120 trees, and increased only a small amount when using 180 trees. Therefore, we chose 70 trees as the best model, because the 180 trees is very time intensive to run yet does not lead to much better results (only 0.3% increase in accuracy).

Then, using the 70 trees model and a threshold of 0.5 for the classification, this is the confusion matrix for the training data:

##       preds
## actual FALSE TRUE
##     -1   837  172
##     1     12 7450

And for the testing data:

##       preds
## actual FALSE TRUE
##     -1   210  362
##     1     42 3946

And the performance metrics for this model are very high on the testing data:

Metric Value
Accuracy 0.911
Precision 0.989
Recall 0.916

We also checked the variable importance for this random forest model, and we found that the word 'suck' is by far the most important word, with words such as 'star' and 'concern' coming behind.

Next, we wanted to see if we can use a better threshold to improve the performance of the model, even though the performance metrics were already very high. To do this, we used the ROC curves, which are shown below. The ROC curve for training is blue, and the one for testing is red.

We found that the best threshold isn't 0.5 but it is actually 0.7600604, so we created the testing confusion matrix again using the cutoff with the following result.

##       preds
## actual FALSE TRUE
##     -1   367  205
##     1    216 3772

We can visually see from the confusion matrix that the number of correct -1 class predictions increased, but there are slightly less correct positive class predictions. The data is very imbalanced with few negative class instances to begin with, so this is an improvement to be able to predict the negative class slightly better. You can see this in the performance metrics, because the precision of the model increased.

Metric Value
Accuracy 0.908
Recall 0.946
Precision 0.948

Generalized Linear Model

Next, we ran GLM on the combined dictionary. We ran both lasso and ridge with 5-fold cross validation, but the performance was better with lasso, so as with the other dictionaries, we used lasso as the best model for GLM. The confusion matrix for the training data:

##       preds
## actual FALSE TRUE
##     -1   607  402
##     1     42 7420

And the testing data:

##       prediction
## actual FALSE TRUE
##     -1   276  296
##     1     83 3905

The performance metrics for the GLM model also seem to be very high when using the combined dictionaries. This makes sense since we have more information from the combined dictionaries than we do from any of the single individual ones, so we would expect for the model trained on the combined dictionaries to outperform the individual dictionaries.

Metric Value
Accuracy 0.917
Recall 0.979
Precision 0.93

We also plotted the ROC curves, with the training in blue and the testing in red.

Naive-Bayes Model

The third model we ran on the combined dictionaries is the Naive Bayes model. We ran the model using Laplace smoothing in order to get the best results, and then obtained the following confusion matrix on the training dataset when using a threshold of 0.5.

##       pred
## actual FALSE TRUE
##     -1   975   34
##     1   7331  131

And using the same threshold on the testing data:

##       preds
## actual FALSE TRUE
##     -1   572    0
##     1   3914   74

The performance measures on the testing data are as follows:

Metric Value
Accuracy 0.142
Recall 0.019
Precision 1

Looking at the ROC curves, the testing curve is in red and the training in blue.

We can use the ROC curve to find a better threshold than 0.5, and see if this can increase the performance of the model. The ROC curve determined that the best threshold to use for the cutoff is 3.259095 e-307, which is very different from 0.5. If we use this threshold instead, the confusion matrix for testing changes:

##       preds
## actual FALSE TRUE
##     -1   526   46
##     1   3011  977

The false negative count is slightly better, but as a sacrifice the precision is no longer perfect.

Metric Value
Accuracy 0.33
Recall 0.245
Precision 0.955

Performance Evaluation

We worked on a text mining problem and implemented 3 different algorithms in this dataset. Then, we computed several performance evaluation metrics to compare models:

  • Accuracy
  • Precision
  • Recall
  • F1 score

Since the data is so imbalanced, instead of accuracy, we considered F1 score which takes into account both the precision and the recall.

Of the three individual dictionaries, the 'AFINN' and 'combined' dictionaries provide the best performance consistently across all three models: random forest, generalized linear model, and naive bayes classifier. These dictionaries have the highest performance metrics for all three models and for all three performance measures that we calculated: accuracy, precision, and recall, with the exception of the Naive Bayes classifier for the combined dictionary which did not perform as well as the others.

Conclusion

In this context, we analyzed Yelp reviews related to restaurants and applied 4 different dictionaries to a sentiment score of the document such as Bing, Afinn, Nrc and combined. Besides, we built Random forest, Generalized Linear model and Naive Bayes classifier. Overall, the generalized linear model performs slightly better than others. Performance of the models with Afinn dictionary is much better than other dictionaries.

As a next step, we are going to focus other dictionaries to improve performance of the model. We concentrate on improving Naive Bayes classifier. Besides, we will try to work on multi classification (positive, negative and neutral) since many words do not make a significant difference to differentiate positives from negatives.

Avatar
Britney Scott
Masters Candidate

I am a Masters Candidate at the University of Illinois at Chicago. I will graduate with a Masters of Science in Business Analytics in December 2020. I earned a Bachelor of Science in Statistics and a Bachelor of Science in Marketing in December 2019.