Home Insights Articles The next big thing in customer service, a deep learning question-answering system

The next big thing in customer service, a deep learning question-answering system

Mariia Fedorova

Sep 04, 2019 • 6 min read

The next big thing in customer service, a deep learning question-answering system

Table of Contents

Question answering background
Transfer learning applied to question answering
You shall know a word by the company it keeps
Using BERT and XLNet for question answering
From theory to practice
Conclusion

“It takes months to find a customer, but only seconds to lose one.”

Maintaining a high-quality customer service experience while minimizing costs is high on the list of any e-Commerce enterprise. An AI-based question-answering system can do just that. But how would one approach building such a system? Recent advancements in deep learning and natural language processing (NLP) hold much promise in this field.

Question answering background

Products may contain a collection of structured and unstructured attributes such as catalog descriptions, specifications, or customer reviews. This information can be used to find any query’s relevant information in the form of a text snippet. This task is known as Question Answering (or QA) within the domain of natural language processing.

Much research has already been done in this field. Some datasets have been developed, such as WikiQA, TriviaQA and others. This Papers With Code site contains a review of datasets. Models based on the state-of-the-art Transformer architecture like BERT, GPT-2, XLNet, or SpanBERT show impressive performance. The best results are achieved by ensembling these models with models of other architectures.

Transfer learning applied to question answering

Question Answering requires large datasets for training. One of the most popular datasets for training is SQuAD (Stanford Question Answering Dataset), a dataset developed at Stanford University. SQuAD contains more than one hundred thousand questions and answers in the form of text snippets from articles derived from Wikipedia. All questions and answers in the dataset were selected and formed by humans. There are two versions of SQuAD. Version 1.1 derives answers for all the questions directly from the text snippets. Version 2.0 allows answers to be inferred, but not directly answered, from the text snippets.

Transfer learning is a machine learning technique initially developed in computer vision. The main idea of transfer learning is to reuse data learned on large-scal generic tasks for different, often more specific, tasks. Transfer learning allows us to successfully apply a deep learning approach to the domains and problems where dataset sizes are relatively limited. Both BERT (Bidirectional Encoder Representations from Transformers) and XLNet utilize transfer learning, with BERT producing more accurate results.

In deep learning, every consecutive layer of the neural network relies on the representation learned by the previous layer in the effort to achieve a training goal. The starting layers of the network learn somewhat abstract and generic data representations, while final layers are learning representations specifically for the task at hand and therefore not very useful for the different tasks. This technique, called fine-tuning, is a common practice in transfer learning to retrain the final, task-specific layers of the network for a new task while keep starting, generic layers intact.

You shall know a word by the company it keeps

A deep learning approach to natural language processing relies on a language model based on the principle, “you shall know a word by the company it keeps.” This principle states that, in human language, it is possible to predict a word by looking in its context, and while doing that it is possible to learn the semantics of words.

Language models (LM) are trained on a large-scale unannotated dataset, such as Wikipedia, to capture statistical relations between words, character and word n-grams, or even sentences. In NLP, fine-tuning is done by retraining the generic language model to a data of a particular domain of interest, thus modifying the learned distribution of the sequences of words. For example, the word “mouse” has different semantic, and therefore context, in Wikipedia and on a computer store site. Further fine-tuning of a language model occurs when training a new model in the context of a domain-specific task, such as QA.

Using BERT and XLNet for question answering

Modern NLP architectures, such as BERT and XLNet, employ a variety of tricks to train the language model better. When predicting a word based on its context, it is essential to avoid self-interference. The words should not be able to “see themselves.”

In the case of BERT, some words in the input are masked, and other words are conditioned to predict them. A unique token or a random word is used as a mask. Since not only relations between words are important, but also the relations between sentences, BERT is also trained to predict if a sentence is the next sentence for another one. The paper “Pre-training of Deep Bidirectional Transformers for Language Understanding” and the article “The Illustrated BERT, ELMo, and co.” contain more detailed explanations of BERT. Open source BERT implementations and pre-trained models are available both for TensorFlow and PyTorch.

Instead of predicting masked words, XLNet maximizes the expected log likelihood of a sequence for possible word permutations so that the model “sees” both left and right contexts. While BERT suffers from the pre-train fine-tune discrepancy (masks are not present in fine-tuning data), XLNet does not. XLNet implementation and pre-trained models are available on GitHub.

In the case of QA, the dataset typically consists of the question, document text, and starting and ending indices within the document text related to the correct answer to the question. The trained QA model, therefore, predicts where to cut a continuous snippet from the document based on the question asked.

So, how to apply this theory to practice? Let’s dive into a concrete example.

From theory to practice

In our example implementation, we use the DeepPavlov library, an open-source NLP library developed at the Moscow Institute of Physics and Technology that contains many pre-trained NLP models with a common API, including some Question Answering models. There are two models available for the SQuAD 1.1 task, BERT and R-Net. The DeepPavlov team reported BERT provided more accurate answers than R-Net.

We collected a dataset of users’ reviews, questions and answers about laptops from the online catalog and labeled it according to SQuAD 2.0 standards. We started by testing some pre-trained models on it: one from the DeepPavlov library and one described in the Xu et al. 2019 paper. We proceeded by fine-tuning XLNet on SQuAD 2.0, as no fine-tuned for Question Answering XLNet weights are publicly available.

The Xu et al.’s model (called BERT for RRC, or Review Reading Comprehension) was trained first on SQuAD 1.1 and then on a dataset of users reviews for laptops.

All three models make reasonable predictions used as pre-moderated answers to customers’ questions. For example, to the question “How long is the battery life?” a model outputs “Battery life is strong.” To the question “What version of Windows comes with this laptop?” a model outputs, “it comes with Windows 10 Home” (words in bold are model predictions, and the rest is their context. Including context into the answer makes it easier to understand).

The table shows more examples of outputs from the three models:

Question	DeepPavlov’s BERT answer	BERT-for-RRC answer	XLNet answer	User answer
Is this model backlit?	It doesn’t have a backlit keyboard	It doesn’t have a backlit keyboard	It doesn’t have a backlit keyboard	No
Can I game on this?	Kudos to Dell on this one.*	It is not recommended for gaming or editing of any kind.	It is not recommended for gaming or editing of any kind.	You can’t ‘game’ much at all… this is slower than a pixel phone.
Does this model have a fingerprint sensor?	There is a problem that it doesn’t have a separate numeric pad	laptops / asus – vivobook – s 15-S510UN / features / ) , it does NOT include a fingerprint sensor.	laptops/ASUS-VivoBook-S15-S510UN/Features/), it does NOT include a fingerprint sensor.	None that I have found.

* DeepPavlov’s model gave a wrong answer because Wikipedia contains an article about a game called Kudos.

BERT that was post-trained on a dataset for a particular task (not only on SQuAD) provided more accurate answers. XLNet outperforms BERT even without fine-tuning on a domain-specific dataset. The problem of underrepresentation of some domain-specific knowledge in SQuAD can be solved by post-training on a small domain-specific dataset (like in the case of BERT-for-RRC) or by adding much more data while training the language model (like in the case of XLNet). XLNet handled long sequences better and produced longer answers than BERT. Document texts in our dataset are typically longer than SQuAD’s because we merged multiple reviews of the same product into one. This merging, however, did not affect performance.

Conclusion

Neural network architectures based on Transformer have been reported to outperform humans on the SQuAD 2.0 dataset. Real-life datasets can be more challenging than datasets developed in laboratories for NLP competitions as customers have unlimited imagination on what they can ask. A well-designed question answering system can offer significant assistance to customer services augmenting support personnel. As these systems continue to improve, they may soon be capable of performing as well as, or even better, then their human counterparts.

Tags

AI-driven search and experiences

Artificial intelligence

Deep learning

Digital engagement

Retail

Isometric visualization of AI-powered data flows connecting enterprise product catalog systems

Article

Six reasons your product catalog needs a makeover in 2026—and how to get it right

Retail

Article Six reasons your product catalog needs a makeover in 2026—and how to get it right

Once upon a time, your enterprise product catalog was a backend concern. A necessary system of record. Something teams updated quietly while the real “experience” work happened elsewhere. Today, that separation no longer exists. Research shows that 87% of shoppers rate product data as “extremely...

Retail

Distributed computing infrastructure with interconnected blocks and data streams in red, green, and amber, representing the hybrid deep learning architecture connecting cloud-based Azure Databricks with on-premises NVIDIA DGX systems for deep learning workloads.

Article

Hybrid deep learning with Azure Databricks and on-prem NVIDIA DGX

Financial services

Article Hybrid deep learning with Azure Databricks and on-prem NVIDIA DGX

Modern enterprises increasingly rely on deep learning to power mission-critical workflows such as global demand forecasting, inventory optimization, supply chain prediction, video-based defect detection, and financial risk modeling. These workloads demonstrate rapidly increasing GPU requirements, g...

Financial services

AI demand forecasting model comparison visualization showing pixelated human figures with data blocks representing Time Series Foundation Models and predictive analytics

Article

Time-series foundation models: AI demand forecasting comparison

Manufacturing

Article Time-series foundation models: AI demand forecasting comparison

Predictive analytics is undergoing a major transformation. This AI demand forecasting model comparison reveals significant performance gaps between traditional and modern approaches. Demand forecasting has long guided decisions in retail and manufacturing, but today’s data volumes and volatility ar...

Manufacturing

Stylized shoppers and digital devices illustrating agentic payments.

Article

What the ACP vs AP2 agentic payments comparison means for you

Retail

Article What the ACP vs AP2 agentic payments comparison means for you

Agentic commerce is in the midst of a defining moment. Instead of a customer navigating a checkout flow, AI shopping agents can now autonomously purchase goods, renew subscriptions, or restock supplies, executing payments entirely on the customer’s behalf through agentic payments protocols. It’s...

Retail

Inventory management system featuring a central storefront surrounded by delivery vans, shopping carts, stacked packages, and digital screens. The scene depicts the integration of online and physical retail, logistics, and automated inventory processes, all connected within a seamless, technology-driven supply chain

Article

Beyond multichannel: The competitive edge of omnichannel order management

Retail

Article Beyond multichannel: The competitive edge of omnichannel order management

You know the feeling: you walk into a store only to find out that the product you saw online is out of stock! This is one of the most common and problematic experiences for customers who shop multichannel retail. The problem for you? Disconnected sales channels, lost income, frustrated custom...

Retail

A shopping cart surrounded by silhouetted people in a vibrant, digital marketplace with hexagonal icons floating above, representing B2B composable commerce.

Article

Composable commerce for B2B: Overkill or delivers big?

High-tech

Article Composable commerce for B2B: Overkill or delivers big?

The buzzword “composable commerce” has dominated digital strategy conversations since Gartner popularized the term in 2020. But behind the marketing hype lies a longstanding, proven practice of integrating specialized, best-of-breed technology components into a flexible and scalable ecosystem....

High-tech

Multicolor whisps of smoke on a black background

Article

Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

Cross-industry

Article Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

For many businesses, moving away from familiar but inherently unadaptable legacy suites is challenging. However, eliminating this technical debt one step at a time can bolster your confidence. The best starting point is transitioning from a monolithic CMS to a headless CMS. This shift to a modern c...

Cross-industry

The next big thing in customer service, a deep learning question-answering system

Question answering background

Transfer learning applied to question answering

You shall know a word by the company it keeps

Using BERT and XLNet for question answering

From theory to practice

Conclusion

Tags

You might also like

Let's talk

Thank you!

Thank you for reaching out!

Something went wrong...

CONTACTS

SECTIONS

FOLLOW US

The next big thing in customer service, a deep learning question-answering system

Question answering background

Transfer learning applied to question answering

You shall know a word by the company it keeps

Using BERT and XLNet for question answering

From theory to practice

Conclusion

Tags

You might also like

Subscribe to Grid Dynamics insights now

Let's talk

Thank you!

Thank you for reaching out!

Something went wrong...

Subscribe to Grid Dynamics
insights now