Amazon Alexa Reviews

Comments · 863 Views

Can we correctly predict the negative feedback and gather the information from customer reviews to improve the quality of the product design and user experience?

Programming Language: Python

Packages: Scikit-Learn, Keras

Methods: Bag of Words, Word2Vec

Models: Logistic Regression, SVM, XGBoost


Alexa is a cloud-based voice service of Amazon.  Users can experience natural voice with the devices that allow them a more intuitive way to interact with this new technology in their daily lives.  Although there are many people think that those voice assistants are making users lazier step by step, I believe that the benefits from using them are much more than the negative effects.  According to a survey and statistical study from Kunst (2019), over 50% of respondents use Amazon Alexa devices daily.  With that, I would like to do some research about it and perform a data mining with related machine learning algorithms on a public dataset called “Amazon Alexa Reviews.”


Problem Statement

As we know, customer experience and customer satisfaction are the core missions of Amazon.  Thus, the question that I set for this project is “Can we correctly predict the negative feedback and gather the information from customer reviews to improve the quality of product design and user experience?”  There will be a number of sections in this report to perfectly answer the question.


Data Analysis Process

I would like to start with an overview of the dataset and a series of Exploratory Data Analysis (EDA) to have a better understanding of the data and see if I am able to discover valuable patterns and insights.  After that, I will perform several predictive models by using Machine Learning Algorithms and Natural Language Processing (NLP) methods.  Ultimately, comparing the performance of those models and implementing an error analysis will help me optimize and improve my future works.


Dataset Overview with EDA

The dataset is from Kaggle.  It has 5 columns and 3150 observations.  There are no missing values in this dataset (Figure 1).   The second table shows the first few rows of the original dataset (Figure 2).

Figure 1. Missing Values


Figure 2. Original Dataset


Overall, there are 16 products with the Alexa function in the dataset.  In fact, the products named Black Echo Dot were the most popular products in the review and over 500 reviews were talking about Black Echo Dot (Figure 3).  If I group by “rating”, there will be over 2200 reviews received 5 stars from users.  I re-assigned those classes into just two groups.  One and two stars went to negative feedback and rest of them are positive feedback (Figure 4).  Apparently, it’s an imbalanced data and I will adjust it later.

Figure 3. Distribution by Products


Figure 4. Distribution by Ratings


In the text preprocessing, I removed the feature “date” because I don’t need it.  In addition, the “feedback” column is useless, so I rewrote it with 0 and 1, which are positive and negative feedback as my label column.  Finally, I created some new features for the later analysis and modeling.


In the Boxplot (Figure 5), it shows that negative feedback tends to have more words in each review, which is good for this case because more words in the reviews, more information we can use for our analysis and prediction models.

Figure 5. Length of Review by Feedback


In the WordCloud section, I use a general stop-words set with a series of additional stop-words that I added based on my test and error (Figure 6).  In the WordCloud of positive feedback (Figure 7), we can see the words like love, good, and great in the reviews.  Plus, we also can see the words “music” or “speaker” in the review.  It shows that users might be satisfied with some features on the device.  On the other hand, users might also have some problems with echo dot, screen, light, and speaker based on the WordCloud of negative feedback (Figure 8).

Figure 6. The List of Stop-words


Figure 7. WordCloud of Positive Feedback


Figure 8. WordCloud of Negative Feedback


Predictive Models

Before fitting the models, I split the data into training data and test data, 80% and 20%.  Basically, I fit three algorithms, which are Logistic Regression, SVM, and XGBoost.  For the word embeddings I used Bag of Words with and without TF-IDF.  Finally, I also tried Word2Vec with Logistic Regression since Logistic Regression is slightly better than others in the first run, so I just keep this one for Word2Vec.  Therefore, I have 7 models in total (Figure 9).

Figure 9. Machine Learning Models


As we can see on the pie chart, 92% of data are positive reviews and only 8% of data are negative reviews.  We also can see the visualization of word embeddings that was scaled down to 2 dimensions for a reference (Figure 10).  Ultimately, I got a much-balanced dataset by using Synthetic Minority Oversampling Technique (SMOTE).

Figure 10. Dealing with Imbalanced Data


Predictive Models

Let’s take a look at the comparison.  Since my goal is to correctly predict the negative feedback, which means I am trying to get as high as possible “Ture Positive” and as low as possible “False Positive” in the confusion matrix.  Therefore, Recall is the major measurement in this case.  As a result, Word2Vec with Logistic Regression is the best model with the recall of 81.6% (Figure 11).

Figure 11. The Comparison of the Models


We can also see the comparison with the ROC curves and AUC scores.  The best model has an AUC score of 0.93 (Figure 12).

Figure 12. The ROC Curves and AUC Scores


After the model comparison, I was interested in the difference between Bag of Words and Word2Vec in terms of the performance of word embedding.  As we can see the plots below, those embeddings with Bag of Words are mixed together and it is hard to separate these two classes.  On the contrary, the right pattern shows that those two classes could be easily separated with Word2Vec (Figure 13).

Figure 13. The ROC Curves and AUC Scores


Error Analysis

In this section, I did some error analysis with Local Interpretable Model-Agnostic Explanations (LIME).  Sometimes we don’t know if we can trust a machine learning prediction, especially those advanced algorithms because it is hard to explain with them.  However, LIME can help us understand the reasons behind a model.


Those two cases below (Figure 14) correctly predict the reviews are positive feedback.  As we can see the text here are awesome, love, amazing, and great.

Figure 14. LIME with Correct Prediction on Positive Feedback


Those two examples below (Figure 15) correctly predict the reviews are negative feedback.  The sentences including cheap, not, reset, and tried.  It is not only showing some negative words, but also telling you some insights.  For instance, users might have some issue with the product or service because they reset and tried some functions again and again.

Figure 15. LIME with Correct Prediction on Negative Feedback


This case below (Figure 16) incorrectly predicts the review.  In fact, true class shows it is a negative feedback, but prediction shows it is a positive feedback.  In this situation, we can analyze the error from the text.  In fact, the user used some positive words to describe other functions, but this user still gave a negative feedback to the major function that he or she was arguing about.  That is the reason why the model got wrong with this review.  With this case, we can notice that getting a long review might also generate more noise to impact the accuracy of our prediction.  With that, we should improve the text preprocessing section to avoid those kinds of issues.

Figure 16. LIME with Incorrect Prediction on Negative Feedback



In the EDA section, music and speaker might be the most favorite functions based on this review.  However, some users might have some issues with speaker and screen on some devices.  In this dataset, we can predict the negative feedback and gather the information from customer reviews to improve the quality of the products and user experience.  With Word2Vec plus Logistic Regression model, we got a good recall of 81.6% with the AUC score of 0.93.  In the error analysis, however, the review might generate more noise in the training data to impact the accuracy of our prediction in a negative way when we got more information from a long review.  Therefore, we could improve the accuracy on predicting negative reviews by fine tuning the text preprocessing.


Future Works

Ultimately, there are some future works that I could keep working on.  First of all, fine tuning the text preprocessing for the training data to increase the accuracy of the predictions from models.  Additionally, adjusting the parameters in those models that I created might be slightly helpful on the accuracy of the prediction.  Moreover, applying some pre-trained Neural Networks could be a good selection for the next step.



Kunst, A. (2019). How often do you use Alexa? Statista. Retrieved from Statista database.