By: Deborah O'Malley, M.Sc | Last updated May, 2021
When you hear the name "multi-armed bandit," it might conjure up a lot of interesting imagery for you.
Perhaps you envision a bank robber armed with many guns, or a Bonnie and Cylde-type figure about to embark on a thrilling armed robbery cross-country car chase.
In fact, the title "multi-armed bandit" references the old-school style slot machines that used to line Las Vegas casino halls.
These slot machines affectionately garnered the name "one-armed bandits" because with the simple pull of a lever, or arm, you could runaway with millions or lose your life savings.
So, what does a slot machine have to do with A/B testing, anyway?
Well, with a little luck, you might just see the connection. If you do, there's a big payout at the end. 😉
When running an A/B testing using the multi-armed bandit methodology, the situation is somewhat akin to selecting the best casino slot machine with highest chances of winning.
If you were to go to the casino to play the slots, you'd want to zero in on the machine with the highest likelihood to win and go straight to it.
You'd probably figure out the best slot machine by watching others' play for a little while and observing the activity around you.
A machine that just gave a big payout wouldn't be as likely to win again, so you'd probably be more tempted to go to a machine that hadn't hit the jackpot in a while.
But there's a definite tension between choosing the best arms that have performed well in the past versus going to new or, seemingly inferior arm, that you can only hope will perform better.
In multi-armed bandit testing, this same analogy can be applied.
However, instead of slot machines, each "arm" is a test variant.
With test variants, sticking with the tried-and-true control version is like staying on the same slot machine that's already paid out many times before. The chances of winning aren't quite as high -- but you never know what will come until you pay to play.
In contrast, each test variant is like a different slot machine that offers an exciting chance of winning.
As an optimizer, your goal is to find the version, or slot machine, with the best payout rate, while also maximizing your winnings.
Much easier said than done, but luckily, there are highly developed mathematical models for managing this multi-armed conundrum.
The Classic A/B Testing Approach
To fully understand the multi-armed bandit approach, you first need to be able to compare it against classical hypothesis-based A/B testing.
A classic A/B tests positions a control against one or more variants.
In this standard A/B test set-up, typically, 50% of the traffic is directed to the control; the other half of traffic to the variant.
The experiment runs with this traffic allocation until one version reaches a statistically significant conversion difference.
A winner is then declared, and if the winning version is implemented, all traffic will be directed to it -- until another experiment begins.
In this standard A/B testing methodology, even if one version far outperforms the other from the get-go, traffic will still be allocated evenly -- with 50% of visitors seeing the underperforming version and the other half directed to the version that is pulling way ahead.
An A/B testing purist will staunchly advocate that only a test which splits traffic equally (usually 50/50) -- and keeps that allocation throughout the duration of the test -- will yield accurate, reliable, statistically significant results.
The Multi-Armed Bandit Approach
However, one could justifiably argue that the notion of equal traffic allocation is a waste of time and visitor views. If one version is clearly pulling ahead, more traffic should be allocated to it sooner.
This is the approach multi-armed bandit testing takes.
In a multi-armed bandit experiment, traffic is initially equally split across the variants; however, as a winner begins to emerge, traffic is re-weighted and diverted to the best performing version.
The philosophy behind this approach is there's no logical reason to keep sending only half the traffic to a version that's garnering 80% of all conversions, for example.
So why not re-allocate the traffic to the winning version mid-test? That way, you give the version that appears to be the winner the best chance to pull ahead more quickly.
Tell this idea to an A/B testing purist and they'll balk at you!
They'll adamantly argue that, anytime traffic is unevenly allocated -- and especially if it's re-allocated mid-test -- it completely throws off the statistical validity and rigour of the test results.
As a result, unequal allocation of traffic will not yield statistically sound results -- and the approach is faulty. It should not be done!
However, there is a large and growing cohort of optimizers who argue the multi-armed bandit method -- in which traffic weighted towards the winning version mid-test -- is perfectly legitimate.
This faction of testers will argue not only for the legitimacy of multi-armed bandit testing, but also sing it's praises that it's a completely valid, efficient, and effective way to run a study.
In fact, it they will argue, this method yields results more more quickly and efficiently, with less traffic -- for the express reason that traffic is shifted, mid-test, to the version that appears to be pulling ahead as the winner.
In fact, according to Google:
"Experiments based on multi-armed bandits are typically much more efficient than 'classical' A/B experiments based on statistical-hypothesis testing. They’re just as statistically valid, and in many circumstances they can produce answers far more quickly."
Given the growing number of A/B testing platforms, like CrazyEgg and ABTesting.ai using the multi-armed bandit methodology, testers can have strong confidence this approach is sound.
In a multi-armed bandit test set-up, the conversion rates of the control and variants are continuously monitored.
A complex algorithm is applied to determine how to split the traffic to maximize conversions. The algorithm sends more traffic to best-performing version.
In most multi-arm bandit testing platforms, each variation in any given test is assigned a weight, a creation date, an allocated number of views, and tracks number of conversions.
The number of views, conversions, and creation date are all assessed to determine the weight, or what percentage of visitors will see the version. The weight is adjusted daily based on the previous cumulative results.
When new variants are added, the total traffic percentage changes. To explore the viability of the new variant, the system gives more traffic to the new variant to fairly test it against the other variants that have already been running.
Depending on visitors' actions with the control or variant, if one version starts take the lead, the percentage of traffic will swing, and more traffic will be weighted to it.
If, after a period of time, visitor behavior changes and that version begins to underperform, the traffic percentage will adjust again.
If a certain version appears to be strongly underperforming, that variant can be turned off or stopped without any consequence. The test results will simply re-calculate the weights on the remaining versions running.
Eventually, if a test is left to run long enough, a clear winner should emerge.
While the multi-armed bandit testing methodology is very flexible and allows for variations to be added, eliminated, or re-weighted mid-experiment, just like any study, testers can get into trouble stopping a multi-armed bandit test too early.
Because there's no specific level of confidence or statistical significance to achieve, experimenters may be tempted to declare a winning result may be prematurely.
This outcome occurs when, for example an experiment is stopped when only 85% of traffic is allocated to the winning version.
Although 85% of traffic represents the strong majority of visitors, and gives a very sound indication that this leading version should indeed be declared the ultimate winner, there is still a 15% chance that variant might underperform.
As a result, a true winner should be implemented only when 95% of visitors, or more, have been reallocated to the penultimate winning version.
Using this 95% traffic benchmark, it's then recommended the test run for an additional 2 days to ensure there are no further traffic fluctuations.
By following the 95% guideline, testers can ensure their results are highly valid at what would equate to a 95% level of confidence and statistical validity in classical hypothesis-based A/B testing.
With this divide between A/B testing approaches, you may be caught in between wondering which approach you should use and implement for your own studies.
While an equally allocated A/B test may uphold a strong level of statistical rigour, a multi-armed bandit test is highly likely to yield reasonably accurate results -- valid enough from which you can make sound business decisions.
These results, however, are more of an approximation of what is likely to work rather than absolute evidence of what has proven to work.
Therefore, you're sacrificing some degree of absolute statistical certainty and trading it off for a quicker, more efficient experiment that's likely to yield reasonably accurate test results.
So, if you don't feel any liberty whatsoever to have an iota of uncertainty, don't go with a multi-armed bandit approach; stick with a classical hypothesis-based A/B testing approach instead.
Conversely, if you're okay with a reasonably high level of accuracy, and are willing to accept a small margin of error as a trade-off for faster, more efficient test results, go with a multi-armed banding methodology.
2. There's really not any certainty anyway. . .
Keep in mind, even if you're absolutely staunchly committed to upholding the highest level of statistical rigour, testing itself is an approximation of what you think is likely to continue to occur based on evidence gathered through past data.
While previous data provides a good indication of what has likely happened and will probably continue to occur, you can't put your utmost faith in it.
First off, the numbers aren't always accurate and don't always tell the true story. For example, your Google Analytics data may not be tracking all visitors, or accurately capturing all users in our emerging cookie-less world.
Secondly, the act of testing itself is speculative and trying to forecast future outcomes based on current data is even more problematic.
Case in point, when acclaimed A/B testing expert, David Mannheim, posed the question, "do you believe you can accurately attribute and forecast revenue to A/B testing" a strong 58% of poll respondents proclaimed, "no, A/B testing cannot accurately forecast revenue."
The rationale: as Mannheim explains in his article, experiment results are speculative in the first place, so how can you speculate upon a speculation and obtain completely accuracy?
You can't!
Testing only provides a reasonable level of certainty no matter how statistically significant and valid the test.
So going with a quicker, more efficient testing method makes sense. After all, testing is put in place to mitigate risk, and you can always conduct follow-up experiments to further mitigate risk and confirm ongoing accuracy of your teased out assumptions.
Running an experiment using the multi-armed bandit methodology seems to be a perfectly legitimate way to more quickly derive reasonably accurate test results.
It should be strongly considered as a viable alternative to a traditional hypothesis-based testing approach.
What do you think?
Is the multi-armed bandit testing approach sound and reliable enough to base important conversion decisions?
Will you consider using the methodology for your future tests? Why or why not?
Share your thoughts in the Comments section below.
Join the Best in Test awards ceremony. Submit your best tests and see who wins the testing awards.
A primer explaining the 4 different types of tests you can run, what they mean, and how you can use each to improve your competitive testing advantage.
One of the most debated testing topics is how large does my sample size need to be to get trustworthy test results? Some argue samples of more than 120,000 visitors per variant are needed to begin to see trustworthy test results. Ishan Goel of VWO disagrees. What does he think is needed to get trustworthy test results? Listen to this webinar recording to find out.