Home Insights Articles Harnessing multi-armed bandits for optimized coupon recommendations

Harnessing multi-armed bandits for optimized coupon recommendations

Vladimir Eremin

Oct 05, 2023 • 8 min read

Table of Contents

Recommendation systems are all about data
Mastering multi-armed bandits
Choosing the right bandit
Data for bandits
Defining the learner
Running the model
Conclusion

Envision yourself as a store owner looking to boost foot traffic. You hatch a plan: stand at the front and actively promote your items to those walking by. With enthusiasm, you wear a striking outfit and hold up a blank board, ready to publicize a special deal. However, when it’s time to pen the actual offer, you’re stumped. A “10% discount on everything” could work, but you’re torn between it being too small to matter or too large to remain profitable. A flat “$2 off your initial buy” crosses your mind, but could this be overly generous for just any passerby? Then the idea of “Free gift with every purchase” springs forth. It’s intriguing, but will it genuinely enhance the day’s revenue?

You quickly realize that what you truly need is some sort of method, capable of predicting the most appealing offer for each potential customer. Welcome to the intricate and intriguing realm of recommendation systems, where algorithms help to match the right offer to the right customer, improving your chances of conversion and sales.

Recommendation systems are all about data

At Grid Dynamics, we have profound expertise in recommendation systems to meet the needs of our enterprise clients. And so when we talk about recommendation systems, we talk big data! The abundance of data ensures a higher level of confidence about user preferences. In contrast, when there is limited data, there is a lack of certainty about what users might prefer. However, even in the face of such uncertainty, recommenders tend to prioritize and suggest items that have received a higher level of engagement previously.

For instance, with collaborative filtering, a most popular method typically used by recommender systems, users are matched based on their past behaviors or ratings. A user may be recommended a movie that other similar users have enjoyed.

Meanwhile, item-based similarity approaches would recommend items that are similar to ones the user has previously interacted with or rated highly. For instance, if a user liked a certain science fiction book, they would be recommended other books in the same genre or by the same author.

However, these methods can contribute to a feedback loop, where items that aren’t initially recommended continue to receive low to no engagement. This results in a cycle where those items that were initially popular and recommended continue to get promoted, while potentially relevant items get ignored. This might lead to the problem of overfitting, where only a limited set of items are recommended over and over, while a large variety of other potentially interesting items are not discovered or recommended. As a result, the system fails to explore the breadth of user preferences, limiting the overall user experience.

Mastering multi-armed bandits

There are several approaches that can solve this type of overfitting, and the multi-armed bandit approach is one of the most powerful and commonly used ones. Bandit algorithms propose a solution to this feedback-loop problem by modeling uncertainty, and intentionally exploring to minimize it. This strategy embodies the concept of ‘exploration vs exploitation’, where ‘exploitation’ is using the knowledge we already have to recommend the most engaging items, while ‘exploration’ involves suggesting less-known items to gather more information about user preferences.

By acknowledging the inherent uncertainty in the data, bandits actively engage in exploration to reduce it, thereby learning about the potential relevance of previously unexplored or underexplored items. For instance, instead of continually recommending the best-performing item (a best-selling menu item, for example), a bandit algorithm might occasionally recommend a less popular item (perhaps a new item or previously unpopular). This explorative approach provides an opportunity to discover whether users might like this less-known item, thereby expanding the scope of recommendations.

Moreover, bandits incorporate a system of rewards, where a successful recommendation (like a user clicking on a recommended item) garners a ‘reward’. Over time, these rewards inform the algorithm about the likelihood of different items being appreciated by different users, further refining the recommendation process. This harmonious blend of uncertainty, exploration, and learning makes bandit algorithms a compelling approach to mitigating the issues prevalent in the realm of recommendation systems.

Reward is usually denoted as $R_t$. Every time an action is taken, a reward is obtained. Each time an action is taken the reward returned can have a different value.

Below are a few more definitions that are used to define the Bandit model:

The number of available actions is denoted by the letter $k$
Action: the action taken at time-step $t$ is denoted as $A_t$
Each of the available $k$ actions has an expected reward: $Q(a)$:

$${Q_a}={E}[{R_t}|{A_t}={a}]$$

The most simple example of the reward function is a Sample Average Estimate:

$${Q}_{t+1}{(a)}= \frac{1}{n}\sum_{i=1}^{n} R{_t}$$

Choosing the right bandit

Choosing a different design for the reward function yields different types of bandits. In the following sections, we will touch upon three fundamental bandit algorithms: Epsilon-greedy, Upper Confidence Bound (UCB), and Thompson Sampling.

1.Epsilon-Greedy: This algorithm operates by using a probability to determine if it will explore or exploit. So, the reward function could be represented as:

$${R_t}=\text{argmax}_{a}{Q_t}{(a)}\text{, with probability}\space{1} − {\varepsilon}\space{a}\space\text{random action, with probability}\space{\varepsilon}$$

Where

$R_t$ represents the chosen action at time $t$,
$Q_t(a)$ is the estimated value of action $a$ at time $t$, and
$\varepsilon$ is the probability of choosing to explore rather than exploit.

2. Upper Confidence Bound (UCB): The UCB algorithm adds a confidence interval to the estimated reward for each action. The action chosen is the one with the highest upper confidence bound. This could be represented as:

$$A{_t}=\text{argmax}{_a}\left( Q{_t}(a)+c\sqrt[]\frac{In(t)}{N{_t}(a)} \right)$$

Where

$A_t$ is the action chosen at time $t$,
$Q_ta$ is the estimated value of action $a$ at time $t$,
$c$ is a constant determining the level of exploration,
$N_t (a)$ is the number of times action a has been chosen up to time t, and ln is the natural logarithm.

3. Thompson Sampling:

In this algorithm, an action is chosen based on the highest sample from each action’s reward distribution. This could be represented as:

$$A_t = \text{argmax}_{a} \left[ \text{Sample from } β(R_{1:t-1}(a), T_{1:t-1}(a)) \right]$$

Where

$A_t$ is the action chosen at time $t$,
represents a Beta distribution (commonly used in Thompson Sampling when rewards are binary),
${R}_{1:t-1}(a)$ is the sum of rewards for action a up to time $t−1$, and
${T}_{1:t-1}(a)$ is the total number of times action a has been chosen up to time $t−1$.

Our primary focus will be on Thompson Sampling, as it frequently surpasses both Epsilon-Greedy and UCB algorithms in performance.

Data for bandits

Returning to our shopkeeper scenario, after acquiring a solid understanding of multi-armed bandits and different reward functions, our shopkeeper resolves to employ the Thompson Sampling technique to identify the optimal offer for each of his patrons. His store, being somewhat removed from the bustling main street, sees a regular flow of three individuals passing by daily–we’ll denote these as Person $P_1$, $P_2$, and $P_3$.

For each individual $P_i$ (where $i$ ranges from 1 to 3), there are five potential offers (or actions) that the shopkeeper could extend, denoted as $a_1$, $a_2$…, $a_5$. Notably, he also posits that not all individuals may necessarily need an offer to visit the store. To verify this hypothesis, he decides to occasionally abstain from extending any offers at all, thereby establishing a control observation for his experiment.

With each passing day, the shopkeeper observes the interactions of each individual with the varying offers (including the no-offer scenario), aiming to discern the optimal action $A_i$ for each person $i$. This optimal action would not only encourage a higher frequency of store visits (improving conversion rates), but also exhibits an uplift over the control scenario of no offers. In doing so, the shopkeeper can ensure his marketing campaign is cost-efficient, only extending offers when they demonstrably enhance store visitation.

Defining the learner

With a clear strategy in mind, the shopkeeper set forth on a two-month experiment. Each day, he displayed a distinct offer (or no offer at all) to each of the three individuals, and meticulously recorded the outcome or “reward” $(R)$ as either a conversion $(1)$ or non-conversion $(0)$:

$${R}_{t,i,a}=\{\text{1 if conversion 0 if non-conversion}\}$$

In this formula, $t$ denotes the time step or day index, $i$ signifies the person, and $a$ indicates the offer index.

The shopkeeper then updated the parameters of the Beta distribution for each person’s offer preference using the following rules:

$$\alpha_{t,i,a} := \alpha_{t-1,i,a}+R_{t,i,a} (\text{representing the number of successes})$$
$$\beta_{t,i,a} := \beta_{t-1,i,a}+\big(1-R_{t,i,a}\big) (\text{representing the number of failures})$$

Here, $\alpha_{t,i,a}$ and $\beta_{t,i,a}$ denote the parameters of the Beta distribution for person $i$ action $a$ at time step $t$.

These updates provided the shopkeeper an estimated conversion probability $P_i(X|a)$ for each individual and offer. This probability followed a Beta distribution:

$$P_i(X|a)\sim\text{Beta}\bigg(\alpha_{t,i,a}\text{,}\beta_{t,i,a}\bigg)$$

Finally, to decide which offer to show next, the shopkeeper chose the action a that yielded the maximum expected reward. He used the following strategy to select the action:

$$A_{t+1,i} = \text{argmax}_{a} \left[ \text{Sample from } \text{Beta}(\alpha_{t,i,a}, \beta_{t,i,a}) \right]$$

Running the model

After two months of diligent experimentation, the shopkeeper had amassed a rich collection of data for each passerby, shedding light on their distinct preferences. The shopkeeper meticulously recorded a sequence of observations:

This table delineates the data he recorded. The day_index column begins from 0, representing the first day of the experiment. The customer_id could be either 0, 1, or 2, indicating the individual to whom an offer was presented. The offer_index reflects the specific offer shown–ranging from 1 to 5, with ‘-1’ indicating a control experiment where no offer was presented. Finally, the success column is a binary marker showing whether the individual visited the shop after seeing the offer, with ‘1’ indicating a visit and ‘0’ representing no visit.

Armed with this data, the shopkeeper dusted off his seldom-used laptop and entered these records into a Jupyter Notebook to analyze and visualize the results. His analysis culminated in a series of animated plots demonstrating the convergence of the Beta distribution for each individual over time.

For the first customer, the plot reveals a surprising trend. This customer demonstrated the largest response in the control observations, suggesting that the mere presentation of an offer discouraged them from entering the store. In marketing terminology, this segment of customers is known as ‘sleeping dogs,’ who are best left undisturbed.

In stark contrast, the second customer showed a strong preference for a specific offer. The offer $a^{(1)}$ yielded a clear uplift in store visits compared to other offers and the control group. This data suggests that the individual was persuadable and therefore worth the investment in marketing efforts.

The third customer, however, painted a completely different picture. The response rate was uniform across all offers, including the control group. This was a clear indication that regardless of the marketing efforts or offers, this individual’s decision to visit the store remained unchanged. As such, including this individual in the target audience would not result in incremental net sales and would be an inefficient use of resources.

Conclusion

In conclusion, the shopkeeper’s innovative application of Thompson Sampling provided him with invaluable insights. It empowered him to tailor his marketing strategy for each individual customer, enhancing efficiency and cost-effectiveness.

By mapping customer responses to offers, he could discern the impact of different marketing efforts. This data-driven approach enabled more effective customer segmentation. Instead of a one-size-fits-all strategy, he could now tailor offers based on individual preferences, optimizing the value of his marketing budget. Moreover, he could identify customers who remained unaffected by offers, thereby avoiding unnecessary marketing expenditure. On the flip side, he recognized customers who would purchase irrespective of promotions, saving resources on unneeded incentives.

In essence, the shopkeeper’s strategic use of Thompson Sampling refined his marketing from broad-strokes to precision-guided. This case underscores the potential of machine learning algorithms in enhancing marketing effectiveness and customer segmentation while ensuring optimal return on marketing investment.

Tags

AI-driven search and experiences

Artificial intelligence

Customer experience

Digital engagement

Retail

Inventory management system featuring a central storefront surrounded by delivery vans, shopping carts, stacked packages, and digital screens. The scene depicts the integration of online and physical retail, logistics, and automated inventory processes, all connected within a seamless, technology-driven supply chain

Article

Beyond multichannel: The competitive edge of omnichannel order management

Retail

Article Beyond multichannel: The competitive edge of omnichannel order management

You know the feeling: you walk into a store only to find out that the product you saw online is out of stock! This is one of the most common and problematic experiences for customers who shop multichannel retail. The problem for you? Disconnected sales channels, lost income, frustrated custom...

Retail

A shopping cart surrounded by silhouetted people in a vibrant, digital marketplace with hexagonal icons floating above, representing B2B composable commerce.

Article

Composable commerce for B2B: Overkill or delivers big?

High-tech

Article Composable commerce for B2B: Overkill or delivers big?

The buzzword “composable commerce” has dominated digital strategy conversations since Gartner popularized the term in 2020. But behind the marketing hype lies a longstanding, proven practice of integrating specialized, best-of-breed technology components into a flexible and scalable ecosystem....

High-tech

Multicolor whisps of smoke on a black background

Article

Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

Cross-industry

Article Headless CMS for the AI era with Grid Dynamics, Contentstack, and Google Cloud

For many businesses, moving away from familiar but inherently unadaptable legacy suites is challenging. However, eliminating this technical debt one step at a time can bolster your confidence. The best starting point is transitioning from a monolithic CMS to a headless CMS. This shift to a modern c...

Cross-industry

Ecommerce interface showing clothing on a rack to represent merchandising

Article

How a merchandising experience platform puts retailers in control of search, browse, and recommendations

Retail

Article How a merchandising experience platform puts retailers in control of search, browse, and recommendations

As a retail leader, are you in complete control of your search, browse, and recommendation strategies? Do your digital experiences align with your business goals while delivering what customers expect? Can you control product rankings to highlight specific items in search results, adjust categories...

Retail

Yellow bubbles coming out of a purple box

Article

10 reasons to migrate to a headless CMS with Contentstack and Grid Dynamics

Retail

Article 10 reasons to migrate to a headless CMS with Contentstack and Grid Dynamics

The headless CMS market is experiencing unprecedented growth as organizations recognize its potential for delivering flexible, personalized digital experiences. Recent market analysis reveals striking momentum—the global headless CMS software market, valued at $851.48 million in 2024, is projected...

Retail

Silhouette of a person standing on stairs in front of a large glass ball against a sunset background

Article

Probabilistic forecasting for enhanced demand prediction

Manufacturing

Article Probabilistic forecasting for enhanced demand prediction

In today's fast-paced and data-driven world, accurately predicting demand is more critical than ever for businesses aiming to stay competitive. Traditional forecasting methods often provide a single-point estimate, which can be useful but falls short in accounting for the inherent uncertainties and...

Manufacturing

Virtual model wearing a series of different clothing items to represent virtual try-on capabilities

Article

Digital dressing rooms: How generative AI is redefining virtual try-ons

Retail

Article Digital dressing rooms: How generative AI is redefining virtual try-ons

Have you come across a retail marketing message lately that states, 'Bring the fitting room home and find what you love'? Many retail brands today showcase their customer-first mindset through 'try before you buy' experiences, allowing customers to order products online, try everything, and return...

Retail

Harnessing multi-armed bandits for optimized coupon recommendations

Recommendation systems are all about data

Mastering multi-armed bandits

Choosing the right bandit

Data for bandits

Defining the learner

Running the model

Conclusion

Tags

You might also like

Get in touch

Thank you!

Something went wrong...

Harnessing multi-armed bandits for optimized coupon recommendations

Recommendation systems are all about data

Mastering multi-armed bandits

Choosing the right bandit

Data for bandits

Defining the learner

Running the model

Conclusion

Tags

You might also like

Subscribe to Grid Dynamics insights now

Get in touch

Thank you!

Something went wrong...

Subscribe to Grid Dynamics
insights now