Home Insights Articles Improving product ranking in e-commerce with sparse neural search

Improving product ranking in e-commerce with sparse neural search

Stanislav Stolpovskiy

Jan 23, 2024 • 7 min read

Model of a brain with lights and connections on shopping cart

Table of Contents

Nature of product retrieval
Pros and cons of sparse and dense retrieval systems
Sparse neural search
The best of both worlds
Query expansion
Application to e-commerce
Conclusion

In the ever-growing world of e-commerce, providing customers with an efficient and relevant product ranking experience is crucial for driving sales and maintaining customer satisfaction. Online retailers invest significant resources into optimizing their search and recommendation systems to ensure that users find what they are looking for quickly and effortlessly. However, with the constant expansion of product catalogs and the increasing complexity of user queries, traditional algorithms may fall short of meeting these needs.

This blog post explores an application of SPLADE (SParse Lexical AnD Expansion Model) for addressing the product retrieval challenge in e-commerce. Developed as a sparse model for first-stage ranking in information retrieval, SPLADE leverages sparse lexical representations and query expansion techniques to better understand user intent and deliver more relevant search results.

Nature of product retrieval

Let’s take a step back and dissect the nature of existing product retrieval approaches. Product retrieval is the process of finding products that are relevant to a user’s query. There are several ways to achieve this, each with its own strengths and weaknesses.

Boolean retrieval, or the “classical” bag of words (BOW) approach, is a simple and straightforward method. For example, a user might search for “black straight short dress”. When employing the Inverted Index algorithm, each search term yields a list of products containing that specific term. This list is then intersected with others, ultimately returning products that encompass all the query terms or a relevant subset thereof:

The determination of a search result as either an exact match, when all terms align, or a partial match, contingent on the extent of term correspondence, characterizes Boolean retrieval. It is called Boolean retrieval because it is based on Boolean logic and set intersection.

However, instead of representing a query as a set of arrays with documents, a more efficient approach involves creating a vector for each search query and document. Each vector is designed with a length corresponding to the size of the word dictionary. Taking our example search query into account, the representation will look like this:

Mapping between numeric array where each index represents a word

In this scenario, a set is utilized to represent the index of a word in the dictionary, with a value of 1 denoting the presence of the term in the query and 0 indicating its absence. Because word dictionaries are usually pretty big, often exceeding 30k terms, such vectors will be very sparse. In mathematics and computer science, a dense vector is a vector where most of the elements are non-zero. In other words, a dense vector contains a significant number of non-zero elements relative to its total number of elements.

On the other hand, a sparse vector is a vector where most of the elements are zero. In other words, a sparse vector contains very few non-zero elements relative to its total number of elements. That is why such a retrieval approach is also called Sparse retrieval.

Some people think of TF/IDF (Term Frequency-Inverse Document Frequency) and BM25 (Best Matching 25) as a form of Sparse vectors. It is important to note, however, that both of these algorithms represent relevance scoring approaches and have nothing to do with how embedding is composed. Here is an example:

Word	Boolean	TF/IDF	BM25
Black	1	0.068	20.66
Dress	1	0.023	5.01
Red	0	0	0
Blue	0	0	0

On the other hand, when we talk about Dense retrieval, the situation is very different. Dense retrieval is a type of vector search retrieval where features from both the query and products are represented as compressed dense vectors. Due to the compression, the conventional retrieval approach employed with sparse features is not applicable. Instead, the Nearest Neighborhood Search algorithm takes precedence in this scenario. In this algorithm, the distance metric between Query vector and Product vectors is pivotal. The proximity of products to the search query, as determined by this distance metric, establishes their relevance. Essentially, the closest products to the search query are identified as the most relevant ones in this dense retrieval framework.

Pros and cons of sparse and dense retrieval systems

Both retrieval systems have their pros and cons. Modern dense retrieval systems show better accuracy than classic TF/IDF and BM25 retrieval methods in general domain retrieval tasks. Sparse systems, such as Boolean retrieval, fail to deliver relevant results in the case of out-of-vocabulary queries. Moreover, in scenarios where multiple matches occur in different sections of documents, Boolean retrieval may experience confusion with ranking. However, in the context of exact matches, vector search behavior in dense retrieval systems can be unstable due to the inherent complexities of the multi-dimensional space retrieval process. Additionally, dense retrieval lacks the provision of explanations for why a particular product is ranked at a specific position. On the other hand, classical approaches provide more predictable behavior and results can be explained.

Sparse neural search

Sparse neural search represents a departure from the static nature of classic sparse retrieval approaches, which necessitate constant resources for maintenance and tuning. Recognizing the challenges posed by this static nature, researchers have actively sought solutions, resulting in numerous publications and models integrating machine learning (ML) components into Boolean retrieval processes. Among these, the SPLADE model stands out as a significant advancement.

The SPLADE, or SParse Lexical AnD Expansion Model, was introduced by Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant In their paper presented at the SIGIR 2021 conference.

This paper proposes an approach that attempts to harmonize dense and sparse retrieval methods, leveraging the strengths of both. In this approach, a neural network based on the transformer architecture plays a pivotal role. This network is adept at receiving input in the form of queries or product descriptions and is trained using a methodology akin to dense embedding models, incorporating insights gleaned from clickstream data. The model inference flow unfolds as follows:

The biggest difference is that the model does not produce a dense vector for text embedding, but a sparse vector with the same length and order as the Transformer word corpus dictionary. In other words, the model is trying to predict two things: determining the terms to search for and estimating the importance of each term. Using this term importance, SPLADE can achieve the following:

Query expansion: By assigning weights for terms that are not present in the original query but deemed important by the model, SPLADE enhances the query’s inclusiveness, capturing potentially relevant nuances.
Query relaxation: Conversely, SPLADE can refine queries by adjusting weights or removing terms below a specified threshold. This process aids in the elimination of potentially misleading or irrelevant terms, contributing to a more focused and accurate search.

Boolean and SPLADE retrieval combination

After that, the output vector can be used to compose a search query and run it against an inverted index.

The best of both worlds

Such an approach gives us benefits from both classic Boolean retrieval and vector search, but it actually inherits some cons as well.

In this context, the advantages are twofold, encompassing the utilization of existing infrastructure and enhanced explainability, courtesy of the index approach. Moreover, drawing inspiration from dense retrieval methods, the system reaps benefits such as automatic query expansion through a self-learning mechanism. Additionally, the fine-tuning of rankings, achieved through meticulous weight adjustments, serves to amplify the overall effectiveness of the retrieval process.

Query expansion

The SPLADE model automatically expands the user’s query by adding synonyms, related words, and other lexical elements. This process helps the model better grasp user intent and identify more relevant documents (or products). Query expansion is based on the analysis of document structures and their content, as well as the utilization of external knowledge sources like semantic networks.

A notable distinction from Dense retrieval is the transparency SPLADE offers regarding the terms employed for expansion. This unique feature provides invaluable insights, serving as a wellspring for linguistic enrichment. Importantly, these insights permeate across all system components, spanning search functionality and autosuggest features.

Application to e-commerce

A prominent challenge observed in much of the publicly available research is the reliance on general datasets, often failing to demonstrate consistent results when applied to structured data. The ubiquitous MS Marco dataset is a frequent culprit in this regard.

To address this limitation, our research delved into the efficacy of utilizing pre-trained and fine-tuned SPLADE models on domain-specific data, specifically in the context of product catalog search use cases. A crucial aspect of our investigation was to assess the feasibility of automatic training for the SPLADE model without delving into extensive hyperparameter tuning.

A significant hurdle encountered during our exploration was the misalignment between the SPLADE model, designed with a BERT input, and the structural nature of e-commerce data. To streamline the process, we opted for a simplified approach, consolidating product title, category, and key attributes into a singular product string.

The evaluation results of these models on our test dataset yielded compelling insights:

Efficacy of pre-trained and fine-tuned SPLADE models on product catalog search data

The discernible trend in our observations highlights that SPLADE, whether pre-trained on MSMARCO or fine-tuned on our proprietary dataset, consistently outperforms BM25 in terms of results.

Conclusion

The SPLADE model stands out as a superior alternative to traditional ranking algorithms like BM25, boasting faster performance, enhanced scalability, and heightened accuracy. These advantages position the model as an enticing choice for diverse information retrieval applications, particularly in the domain of product ranking within e-commerce platforms.

What makes SPLADE particularly appealing is its seamless integration into existing retrieval system workflows based on inverted indexes. This integration comes without the need for significant infrastructure modifications, allowing for swift implementation with immediate impact. The model aligns with Lucene-based or legacy retrieval systems, eliminating the necessity for a separate indexing pipeline and vector database. Consequently, SPLADE proves versatile for both creating new search solutions and enhancing existing ones.

However, it’s crucial to recognize the inherent trade-offs associated with SPLADE. While it offers efficiency gains, the model may yield less precise results due to query expansion and relaxation. Therefore, we recommend caution when considering it as a primary stage in the search pipeline or in scenarios where precision is paramount.

Despite these considerations, the ongoing advancements in term-based search approaches underscore the relevance and value that SPLADE brings to search systems. The inclusion of this model in our Semantic Vector Search Starter Kit further exemplifies its practical application and efficacy.

Tags

AI-driven search and experiences

Artificial intelligence

Customer experience

Digital engagement

Retail

Inventory management system featuring a central storefront surrounded by delivery vans, shopping carts, stacked packages, and digital screens. The scene depicts the integration of online and physical retail, logistics, and automated inventory processes, all connected within a seamless, technology-driven supply chain

Article

Beyond multichannel: The competitive edge of omnichannel order management

Retail

Article Beyond multichannel: The competitive edge of omnichannel order management

You know the feeling: you walk into a store only to find out that the product you saw online is out of stock! This is one of the most common and problematic experiences for customers who shop multichannel retail. The problem for you? Disconnected sales channels, lost income, frustrated custom...

Retail

A shopping cart surrounded by silhouetted people in a vibrant, digital marketplace with hexagonal icons floating above, representing B2B composable commerce.

Article

Composable commerce for B2B: Overkill or delivers big?

High-tech

Article Composable commerce for B2B: Overkill or delivers big?

The buzzword “composable commerce” has dominated digital strategy conversations since Gartner popularized the term in 2020. But behind the marketing hype lies a longstanding, proven practice of integrating specialized, best-of-breed technology components into a flexible and scalable ecosystem....

High-tech

Multicolor whisps of smoke on a black background

Article

Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

Cross-industry

Article Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

For many businesses, moving away from familiar but inherently unadaptable legacy suites is challenging. However, eliminating this technical debt one step at a time can bolster your confidence. The best starting point is transitioning from a monolithic CMS to a headless CMS. This shift to a modern c...

Cross-industry

Ecommerce interface showing clothing on a rack to represent merchandising

Article

How a merchandising experience platform puts retailers in control of search, browse, and recommendations

Retail

Article How a merchandising experience platform puts retailers in control of search, browse, and recommendations

As a retail leader, are you in complete control of your search, browse, and recommendation strategies? Do your digital experiences align with your business goals while delivering what customers expect? Can you control product rankings to highlight specific items in search results, adjust categories...

Retail

Yellow bubbles coming out of a purple box

Article

10 reasons to migrate to a headless CMS with Contentstack and Grid Dynamics

Retail

Article 10 reasons to migrate to a headless CMS with Contentstack and Grid Dynamics

The headless CMS market is experiencing unprecedented growth as organizations recognize its potential for delivering flexible, personalized digital experiences. Recent market analysis reveals striking momentum—the global headless CMS software market, valued at $851.48 million in 2024, is projected...

Retail

Silhouette of a person standing on stairs in front of a large glass ball against a sunset background

Article

Probabilistic forecasting for enhanced demand prediction

Manufacturing

Article Probabilistic forecasting for enhanced demand prediction

In today's fast-paced and data-driven world, accurately predicting demand is more critical than ever for businesses aiming to stay competitive. Traditional forecasting methods often provide a single-point estimate, which can be useful but falls short in accounting for the inherent uncertainties and...

Manufacturing

Virtual model wearing a series of different clothing items to represent virtual try-on capabilities

Article

Digital dressing rooms: How generative AI is redefining virtual try-ons

Retail

Article Digital dressing rooms: How generative AI is redefining virtual try-ons

Have you come across a retail marketing message lately that states, 'Bring the fitting room home and find what you love'? Many retail brands today showcase their customer-first mindset through 'try before you buy' experiences, allowing customers to order products online, try everything, and return...

Retail

Improving product ranking in e-commerce with sparse neural search

Nature of product retrieval

Pros and cons of sparse and dense retrieval systems

Sparse neural search

The best of both worlds

Query expansion

Application to e-commerce

Conclusion

Tags

You might also like

Get in touch

Thank you!

Something went wrong...

Improving product ranking in e-commerce with sparse neural search

Nature of product retrieval

Pros and cons of sparse and dense retrieval systems

Sparse neural search

The best of both worlds

Query expansion

Application to e-commerce

Conclusion

Tags

You might also like

Subscribe to Grid Dynamics insights now

Get in touch

Thank you!

Something went wrong...

Subscribe to Grid Dynamics
insights now