Home Insights Articles Tiered machine learned ranking improves relevance for the retail search

Tiered machine learned ranking improves relevance for the retail search

Eugene Steinberg

Mar 03, 2020 • 8 min read

Tiered machine learned ranking improves relevance for the retail search

Table of Contents

Conclusion

Online retailers operating large catalogs are always looking to improve the quality of their product discovery functionality. Keyword search and category browse experience powered by their search engine is a first stage of sales funnel where fast and relevant search results can produce delightful and seamless shopping experience leading to higher conversion rates.

Traditional search engines rely on matching words in the queries to words in the products, and making ranking decisions based on frequency of the matches. Search relevance engineers spend a lot of time tuning ranking formulae to carefully balance the value of words matching in different attributes, making decisions like “match in brand field is twice as important than match in category field” or “partial match in product name is 5 times less important than full match”. They also take into account business signals, such as price, margins, sales, inventory, etc…

Even when ranking formulae is well-tuned, in many search sessions customers are using broad words, such as “jeans” or “shirt” or “Ralph Lauren” which match many products equally well. There is simply not enough information about shopping intent in the query to make a good ranking decision. At the same time, customer behavior data on the site provides a trove of information on what products customers are really interacting with when searching for particular phrases.

Using machine learning, we can use this information to automatically devise optimal ranking formulae which will also self-learn and adapt to changes in customer behavior. This approach, known as learning-to-rank (aka LTR), is a de-facto standard ranking feature in modern retail search systems. However, because of high computational complexity of ML models, it can only be applied to dozens or hundreds of top search results, limiting its potential.

In this blog post, we will describe a tiered learning-to-rank approach which allows us to work around this limitation.

First, lets review how learning-to-rank is implemented in popular open source engines (Solr and Elasticsearch)

Learning to rank in typical search system

Search query is sweeping through the search index, producing candidate matches. Every match is scored according to ranking formulae and collected into priority queue to find the top best matches. Without learning-to-rank, those top matches are returned to the customer. With learning-to-rank, collected top matches are re-scored according to the ranking model. Ranking model require features, which are data points fueling decisions made by the model. Features can be as simple as a value of particular product attribute or as complex as a result of search subquery which has to be separately executed. Ranking model takes into account all those features to re-score the top N products which are returned to the customer.

As you can see, from a performance standpoint the feature extraction and ranking phases can be quite intensive and their complexity directly depends on how many products they have to examine and rank. This is fine to dozens and hundreds, but not for thousands or millions. This is why you can’t run full LTR model against all matching products – feature extraction and ranking will be way too slow.

In some search systems, re-ranking is happening not inside search engine, but in a dedicated microservice or even a third party system. Still, this doesn’t change the fact that only the top few dozen results get re-sorted by the machine learning algorithm. It also adds serialization and wiring overhead to the system.

What if we are looking to run LTR ranking against all the products which are matching the query? Our goal is to surface products which are relevant from LTR perspective but can be missed by convenient ranking formulae. We still can do it, yet for performance reasons, the model should be pretty simple, so it can be used to quickly score every matching product.

This way, we will have a two-tier ranking system. Our first tier of ranking will quickly evaluate every matching product with a simple model and find top N best matches, while the more sophisticated second tier model will take its time to exhaustively re-rank those matches to promote most relevant results.

Let’s review a practical case study of application of this technique to online commerce search.

If we look at learn-to-rank as a machine learning problem, we are trying to predict a correct ranking (order) of the products based on the query features q, and product’s features d. In other words, we are trying to learn the ranking function f(q,d) which will estimate a product’s relevance score for the query.

As with any supervised learning problem, we will need some sort of ground truth about the relevance of particular products for particular query. Such ground truth can be based on expert judgement, or on crowdsourcing, but by far the most practical way of obtaining the ground truth signal is to use customer behavior as a proxy to relevance.

Generally, there are three approaches used in learning-to-rank models: pointwise, pairwise and listwise.

The pointwise approach is trying to directly predict relevance score, pairwise approach predicts relative order of two candidate products, while listwise approach deals with a whole list of products at once.

As usual in machine learning tasks, before we do any training we need to vectorize our data – e.g. convert our queries and documents to vectors, matrices or tensors.

Kinds of features in learning to rank — *Types of features in learning to rank*

In general, we will have 3 kinds of features:

query features, which can be as simple as number of words in query or as complex as deep learning word or sentence embedding.
document features, which in our e-commerce case can be things like price, units sold, sales, popularity, number of views, number of reviews, newness, etc…
query-document features – those are features which represent how well document matches the query. There are numerous techniques to engineer those features, and we will discuss some of them here.

Let’s assume we have following dataset capturing our queries and their relationships with discovered products:

Relationship between queries and discovered products

In this dataset we introduce a bunch of query-document features

Based on the intent classification model, we can capture the probability of this item’s product type as predicted by query intent model. E.g. If the product type of this product is “jeans” and the customer is searching for “dress” this score will be low. Same approach is applied to more granular “item type”
We analyze attribute-level matches and capture if query is matched to one of the product attributes by search engine. We encode those match patterns with one-hot encoding, e.g. if our attributes are (Brand, Types, Occasion, Size, Color) then (0,1,0,0,1) will denote that there was a match in Type and Color fields.
We scale all product-level features (sales, views, etc…) to normal distribution using box-cox transformation and standard scaler, as our ML model expects normally distributed features.

Now, we need to develop and train a model which can be used by the search engine to directly score the matching products. To achieve that we will use a neat trick of pairwise linear transformation. For mathematically inclined, we can refer to the original article.

To the rest of us, the main idea can be described as follows:

For each query, we generate pairs of products which were interacted with after searching this query
We linearly transform their corresponding features to get “delta-features”, which describe the differences between products for document features, and differences between ways how products match queries for query-document features.
We define a training target to reflect a judgement on which of the products in the pair is more relevant for the query.
We train the linear model to predict the target, obtaining the weights for the delta-features
… and here is the the neat trick: As the model is linear and transformation is linear, we can still use the learned delta-feature weights on the original features to predict the score which will still correctly order the products in the pairs. This allow to use learned weights directly in search engine.

Before we can proceed, we should figure out how to create a learning target, aka relevance judgement. In general, there is a separate art and science behind relevance judgement model. Researchers and practitioners are devising different kinds of learning targets from clickstream data, from just simply counting clicks on products to sophisticated models which incorporate post-click behavior.

Here, we will consider the following simple technique for generating relevance judgement. We will group all events by the query and will define relevance score as

S = log(clicks) * carts / clicks.

This formula takes into account general attractiveness of the document in the result as log(clicks), as well as add-to-cart ratio as indicator of relevance.

Now, let’s look at the example of pairwise transformation. We will start with the following simple dataset.

First, we generate pairs of products from it, encode differences between their features and create binary and float prediction targets.

Target after pairwise transformation shows which product in pair of products is considered more relevant for the query – first (1) or second (-1), based on score.

Next, we train a popular collection of “usual suspects” of linear models: logistic regression, linear regression, linear SVM. Resulting weights are transferred to rank products in the search results. In order to evaluate the quality of the model, we used following search quality technical metrics:

Search quality technical metrics formula

We define “binary target” as a simple relevance metric which indicates whether a product was added to cart after the query. We consider products which are never added to cart as irrelevant for the query.

“Float target” is calculated as a score normalized for the maximum score within the query, and is used as a proxy for relevance when calculating NDCG and precision metrics.

For precision metrics, following diagram explains it visually:

After evaluation of the model results we selected logistic regression as a winning model. Other more complicated models didn’t show substantial uplift compared to logistic regression on our dataset.

We compared our metrics with a baseline of standard search results (without existing LTR resorting of first few pages) and observed a massive uplift in all metrics:

NDCG@48 improved by 32%, while NDCG@12 improved by 50%
Precision improved by 20-35%
Ranked precision@12 improved by 52%

This means that new model is much better than a baseline in predicting attractive and relevant products and pushing those products to first pages of search results.

Conclusion

In this blog post, we made a case and described a practical approach to building a tiered learn-to-rank search system which can extend the coverage of machine learned ranking to the whole set of matching products, not only to the first few pages of the results. Because of the linear nature of the model this can happen without significant impact on search performance. As a result, larger number of relevant products are pushed to the top few pages by the search engine where ranking can be further optimized by more sophisticated algorithms.

If you are interested in search quality and learning-to-rank and want to have a discussion with us, don’t hesitate to reach out!

Tags

AI-driven search and experiences

Artificial intelligence

Customer experience

Digital engagement

Retail

Isometric visualization of AI-powered data flows connecting enterprise product catalog systems

Article

Six reasons your product catalog needs a makeover in 2026—and how to get it right

Retail

Article Six reasons your product catalog needs a makeover in 2026—and how to get it right

Once upon a time, your enterprise product catalog was a backend concern. A necessary system of record. Something teams updated quietly while the real “experience” work happened elsewhere. Today, that separation no longer exists. Research shows that 87% of shoppers rate product data as “extremely...

Retail

Distributed computing infrastructure with interconnected blocks and data streams in red, green, and amber, representing the hybrid deep learning architecture connecting cloud-based Azure Databricks with on-premises NVIDIA DGX systems for deep learning workloads.

Article

Hybrid deep learning with Azure Databricks and on-prem NVIDIA DGX

Financial services

Article Hybrid deep learning with Azure Databricks and on-prem NVIDIA DGX

Modern enterprises increasingly rely on deep learning to power mission-critical workflows such as global demand forecasting, inventory optimization, supply chain prediction, video-based defect detection, and financial risk modeling. These workloads demonstrate rapidly increasing GPU requirements, g...

Financial services

AI demand forecasting model comparison visualization showing pixelated human figures with data blocks representing Time Series Foundation Models and predictive analytics

Article

Time-series foundation models: AI demand forecasting comparison

Manufacturing

Article Time-series foundation models: AI demand forecasting comparison

Predictive analytics is undergoing a major transformation. This AI demand forecasting model comparison reveals significant performance gaps between traditional and modern approaches. Demand forecasting has long guided decisions in retail and manufacturing, but today’s data volumes and volatility ar...

Manufacturing

Stylized shoppers and digital devices illustrating agentic payments.

Article

What the ACP vs AP2 agentic payments comparison means for you

Retail

Article What the ACP vs AP2 agentic payments comparison means for you

Agentic commerce is in the midst of a defining moment. Instead of a customer navigating a checkout flow, AI shopping agents can now autonomously purchase goods, renew subscriptions, or restock supplies, executing payments entirely on the customer’s behalf through agentic payments protocols. It’s...

Retail

Inventory management system featuring a central storefront surrounded by delivery vans, shopping carts, stacked packages, and digital screens. The scene depicts the integration of online and physical retail, logistics, and automated inventory processes, all connected within a seamless, technology-driven supply chain

Article

Beyond multichannel: The competitive edge of omnichannel order management

Retail

Article Beyond multichannel: The competitive edge of omnichannel order management

You know the feeling: you walk into a store only to find out that the product you saw online is out of stock! This is one of the most common and problematic experiences for customers who shop multichannel retail. The problem for you? Disconnected sales channels, lost income, frustrated custom...

Retail

A shopping cart surrounded by silhouetted people in a vibrant, digital marketplace with hexagonal icons floating above, representing B2B composable commerce.

Article

Composable commerce for B2B: Overkill or delivers big?

High-tech

Article Composable commerce for B2B: Overkill or delivers big?

The buzzword “composable commerce” has dominated digital strategy conversations since Gartner popularized the term in 2020. But behind the marketing hype lies a longstanding, proven practice of integrating specialized, best-of-breed technology components into a flexible and scalable ecosystem....

High-tech

Multicolor whisps of smoke on a black background

Article

Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

Cross-industry

Article Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

For many businesses, moving away from familiar but inherently unadaptable legacy suites is challenging. However, eliminating this technical debt one step at a time can bolster your confidence. The best starting point is transitioning from a monolithic CMS to a headless CMS. This shift to a modern c...

Cross-industry

Tiered machine learned ranking improves relevance for the retail search

Conclusion

Tags

You might also like

Let's talk

Thank you!

Thank you for reaching out!

Something went wrong...

CONTACTS

SECTIONS

FOLLOW US

Tiered machine learned ranking improves relevance for the retail search

Conclusion

Tags

You might also like

Subscribe to Grid Dynamics insights now

Let's talk

Thank you!

Thank you for reaching out!

Something went wrong...

Subscribe to Grid Dynamics
insights now