Home Insights When life gives you lemons: Analyzing negative reviews to improve your mobile app

When life gives you lemons: Analyzing negative reviews to improve your mobile app

Jiaye Pan

Aug 21, 2019 • 6 min read

When life gives you lemons: Analyzing negative reviews to improve your mobile app

Table of Contents

Break down sentences and lemmatize everything
Time to fit the model
Results: The crash rate is key, among other things...
Conclusion
Limitations and future improvements

Your product is good. Your mobile app has great features, but your app store rating is low. What happened?

In a previous blog post we demonstrated that high crash rates can cause low app ratings. In this blog, we explain how we analyzed negative reviews to determine what motivated customers to write them.

We collected the most relevant negative app reviews, 2 stars or less, from six major retailers in the US on one of the major mobile applications platforms. Then, we employed natural language processing to analyze the reviews and distill specific topics that caused the complaints. We carried out this topic modeling phase with Latent Dirichlet Allocation to better understand the distribution of negative reviews with respect to common topics that users tended to bemoan.

So let’s dive into the details on how we did our negative review analysis with NLP.

Break down sentences and lemmatize everything

The first step in topic modeling is to convert sentences to individual words and phrases. These words and phrases will then be fed to the model to generate common topics. Typically, sentence breakdown consists of two parts: lemmatization, and removal of stop words and punctuation.

One key feature in human language (the English language in particular) is that one word can have a number of derivations. For instance, “I ran five miles yesterday”, “I will run five miles tomorrow”, “She is running”, and “He runs” all convey the same action. The English language changes the form of the word “run” to indicate the time at which the event takes/took/will take place and the persons pertaining to the action. To a computer, all variations of “run” should point to the same action, i.e. “run”. In this case, “run” is a lemma word for “runs”, “ran”, “running”, etc. The process of converting all derivative words to their lemma is called lemmatization. Fortunately, Python NLTK provides WordNetLemmatizer that uses its corpus database to lookup lemmas for words.

Having lemmatized the reviews, we wish to remove stop words, words which have a high frequency but do not contribute meaning to the sentence. Stop words include, but are not limited to, “and”, “but”, “as”, “whom”, and “at”. Similarly, we generally want to omit punctuation as well for model fitting. Usually, what is left after removing stop words and punctuation is a range of verbs and adjectives, which generally confer more meaning percentage-wise than having all the stop words and punctuation in.

Below is a code snippet that demonstrates lemmatization and removal of stop words and punctuation:

Time to fit the model

Now, we are ready to conduct topic modeling with Latent Dirichlet Allocation. This can be done with the help of the Gensim library available in Python. There are several crucial steps in LDA as follows:

Create a dictionary from the processed review data. A dictionary is an aggregation of all the words from a collection of text.

Convert the dictionary to a bag-of-words corpus and save the dictionary as well as the corpus for analyses. A bag-of-words model is a simplified representation that transforms a collection of text into unique words in a dictionary, as well as their frequencies with which they appear, or multiplicities. For instance, “I like running. You like running.” becomes {“I”:1, “you”:1, “like”:2, “running”:2}. The bag-of-words model is ideal for analyses because it can be easily represented in a matrix form. That is, {“I”:1, “you”:1, “like”:2, “running”:2} becomes [1, 1, 2, 2], which each entry corresponding to a unique word in the dictionary in a known order. If desired, one can easily combine different bag-of-words corpora by carrying out matrix operations. This conversion can be done in the following way:

Run LDA and call topics. Here, we call the top 20 topics extracted from all the negative user reviews. For instance, the first topic, topic 0, consists of keywords “app”, “card”, “pay”, “can’t”, “get”, “make”. This sounds like users are having difficulties making online payments using debit/credit card on the mobile application. Similarly, topic 2 with “slow”, “app”, “very”, “make”, “app”, “crash” indicates sub-par app performance and high crash rate. Since these topics are auto-generated, some are less intuitive than others. This can be remedied by tuning hyper-parameters such as the number of topics and the number of algorithm iterations needed to come up with one topic. In any case, we can have a rough idea of some of the main topics that unsatisfied users frequently bring up in-app reviews.

It is possible to zero in on a specific user review, and LDA can classify that piece of review as one of the many topics generated. For instance, one user review reads:

“inputed[sic] first Macy’s card in and no issues to pay or check balance. once card was upgraded it would not show any balance, or able to mqke[sic] payment.”

LDA classifies this review as Dominant_Topic 0.0 (aforementioned credit card issues) with 78% certainty. We can manually double-check and have a rough idea of how well the algorithm performs.

Results: The crash rate is key, among other things…

According to our results from topic modeling, 64.4% of all negative reviews are caused by three broad categories:

Crashes / Slow response
Buggy checkout experience
Missing items in shopping cart

Here is the complete breakdown of the issues:

Crashes and slow response time accounted for 29.3% of all negative reviews. Obviously, crash rates should be as low as technically possible. Psychologically, a negative review will carry more weight than a positive review, so the industry-accepted 1% may still be too high for large enterprises. We generally work with customers to ensure that the crash rate is 0.2% or lower so as to prevent crashes almost entirely and prevent lower app ratings.

The checkout process generated 18.4% of the negative reviews. This includes long and complicated checkout processes, a bug-ridden payment experience, etc. Users become extremely frustrated when they have selected the items of their liking only to find out that they cannot smoothly and successfully complete the transaction. Features like a smooth, one-page checkout not only close the proverbial circle for potential customers, but reflects the brand image and the company’s attention to detail.

Shopping cart issues accounted for 16.7% of negative reviews. Unsatisfied users report the disappearance of shopping cart items, issues with non-inventory items, etc. According to the Baymard Institute, around 70% of virtual carts are abandoned before purchase due to unexpected taxes and fees, having to create an account, or that the checkout process is too complicated.

Conclusion

Negative feedback occurs on all sites. This investigation was conducted using data from several large, national retailers. However, the results are likely typical of most mobile retailers. Retailers should conduct a detailed analysis of their own negative reviews, using techniques used in this blog post, to find the causes. This “free QA” is particularly valuable because it is unsolicited and the writers expect nothing in return. They tend to represent the “naked truth”. Fixing issues discovered in this analysis addresses user’s concerns head-on, ultimately leading to higher app ratings, a salvaged brand reputation and increased revenue.

Limitations and future improvements

Topic Modeling with Latent Dirichlet Allocation is a powerful tool to uncover hidden commonalities from a vast swath of information. It helps identify hidden topics and relations between a sentence and the topic it is most closely related. In this study, we have condensed over 4000 negative reviews into eight categories. It is important to note, however, that LDA does not take into account the correlation between topics. For instance, we may insist that discounts and promotions be grouped with the payment process. After all, discounts are applied at the time of payment. Discounts and promotions have a high correlation to problems with payment. However, there is no readily available approach for LDA to recognize the correlation between these topics.

Additionally, the bag-of-words model on which LDA functions primarily concerns unique words and their respective occurrences; it cannot understand higher-level concepts such as semantic structures. Lastly, LDA is an unsupervised algorithm, which may not be the best option for training and testing tagged dataset.

We hope to revisit mobile application reviews using other natural language processing algorithms in the future to better parse and understand users’ intent with each review. In any case, we hope our current study proves illuminating for you and your business so that you can reach more users with the help of mobile applications and boost your mobile revenue in the years to come.

For assistance conducting analysis on your site, please contact Grid Dynamics.

Tags

Digital engagement

Retail

User interface and mobile app engineering

Abstract commerce scene with workers, carts, and parcels visualizing orchestrated agentic shopping journeys.

Article

The trust architecture: Why most agentic commerce pilots fail, and what separates the ones that don’t

Retail

Article The trust architecture: Why most agentic commerce pilots fail, and what separates the ones that don’t

The gap between a working demo and a system that survives real customers is the most expensive distance in the enterprise right now. It's also widening. Boards are writing checks for agentic commerce based on demos that won't last a week against actual shoppers. The receipts are already in. Air...

Retail

Article

Shift auto parts search into high gear with Google Cloud and Grid Dynamics

Automotive

Article Shift auto parts search into high gear with Google Cloud and Grid Dynamics

Auto parts e-commerce is booming, but complexity risks revenue. Think fitment accuracy, interchange precision, catalog and PDP content standardization, and omnichannel expectations. One misfit leads to a lost sale, and can even jeopardize customer safety. Auto parts search is in a dif...

Automotive

Isometric visualization of AI-powered data flows connecting enterprise product catalog systems

Article

Six reasons your product catalog needs a makeover in 2026—and how to get it right

Retail

Article Six reasons your product catalog needs a makeover in 2026—and how to get it right

Once upon a time, your enterprise product catalog was a backend concern. A necessary system of record. Something teams updated quietly while the real “experience” work happened elsewhere. Today, that separation no longer exists. Research shows that 87% of shoppers rate product data as “extremely...

Retail

Distributed computing infrastructure with interconnected blocks and data streams in red, green, and amber, representing the hybrid deep learning architecture connecting cloud-based Azure Databricks with on-premises NVIDIA DGX systems for deep learning workloads.

Article

Hybrid deep learning with Azure Databricks and on-prem NVIDIA DGX

Financial services

Article Hybrid deep learning with Azure Databricks and on-prem NVIDIA DGX

Modern enterprises increasingly rely on deep learning to power mission-critical workflows such as global demand forecasting, inventory optimization, supply chain prediction, video-based defect detection, and financial risk modeling. These workloads demonstrate rapidly increasing GPU requirements, g...

Financial services

AI demand forecasting model comparison visualization showing pixelated human figures with data blocks representing Time Series Foundation Models and predictive analytics

Article

Time-series foundation models: AI demand forecasting comparison

Manufacturing

Article Time-series foundation models: AI demand forecasting comparison

Predictive analytics is undergoing a major transformation. This AI demand forecasting model comparison reveals significant performance gaps between traditional and modern approaches. Demand forecasting has long guided decisions in retail and manufacturing, but today’s data volumes and volatility ar...

Manufacturing

Stylized shoppers and digital devices illustrating agentic payments.

Article

What the ACP vs AP2 agentic payments comparison means for you

Retail

Article What the ACP vs AP2 agentic payments comparison means for you

Agentic commerce is in the midst of a defining moment. Instead of a customer navigating a checkout flow, AI shopping agents can now autonomously purchase goods, renew subscriptions, or restock supplies, executing payments entirely on the customer’s behalf through agentic payments protocols. It’s...

Retail

Inventory management system featuring a central storefront surrounded by delivery vans, shopping carts, stacked packages, and digital screens. The scene depicts the integration of online and physical retail, logistics, and automated inventory processes, all connected within a seamless, technology-driven supply chain

Article

Beyond multichannel: The competitive edge of omnichannel order management

Retail

Article Beyond multichannel: The competitive edge of omnichannel order management

You know the feeling: you walk into a store only to find out that the product you saw online is out of stock! This is one of the most common and problematic experiences for customers who shop multichannel retail. The problem for you? Disconnected sales channels, lost income, frustrated custom...

Retail

When life gives you lemons: Analyzing negative reviews to improve your mobile app

Break down sentences and lemmatize everything

Time to fit the model

Results: The crash rate is key, among other things…

Conclusion

Limitations and future improvements

Tags

You might also like

Let's talk

Thank you!

Thank you for reaching out!

Something went wrong...

CONTACTS

SECTIONS

FOLLOW US

When life gives you lemons: Analyzing negative reviews to improve your mobile app

Break down sentences and lemmatize everything

Time to fit the model

Results: The crash rate is key, among other things…

Conclusion

Limitations and future improvements

Tags

You might also like

Subscribe to Grid Dynamics insights now

Let's talk

Thank you!

Thank you for reaching out!

Something went wrong...

Subscribe to Grid Dynamics
insights now