By: Deborah O'Malley, M.Sc. | Last updated November, 2022
If you've been in the A/B testing field for a while, you've probably heard the term Sample Ratio Mismatch (SRM) thrown around.
Maybe you've talked with people who tell you that, if you're not looking at it, you're not testing properly. And you need to correct for it.
But, in order to do so, you first need to know what SRM is and how to spot it.
This article outlines the in's and out's of SRM, describing what it is, why you need to be looking at it when testing, and how to correct for it, if an SRM issue occurs.
The term Sample Ratio Mismatch (SRM) sounds really intimidating.
It's a big mumble, jumble of words. With "sample," and "ratio," and what exactly is "mismatch" anyway?
Well, let's break it all down, starting with the word sample.
In A/B testing, a sample applies to 2 separate but related concepts that impact test traffic:
2) The test sample
What's the difference?
Traffic allocation is the way traffic is split.
Visually, equally split traffic looks something like this:
If a test has more than one variant, for example in an A/B/C test, traffic can still be equally split if all versions receive approximately the same amount of traffic.
In a test with 3 variants, equally allocated traffic would be split 33/33/33.
Visually it would look something like this:
While traffic can be divided in other ways, say 70/30 or 40/30/10, for example, this unequal allocation is not considered best practice. As explained in this article, traffic should, ideally, always be equally allocated.
However, whether traffic has been equally or unequally allocated, SRM can still occur and should always be calculated.
Despite how traffic is allocated, if the sample of traffic is routed so one variant receives many more visitors than the other, the ratio of traffic is not equal. And you have a Sample Ratio Mismatch (SRM) issue.
Visually, the test sample should be routed like this:
If it's not, a SRM issue occurs. One version has far more traffic routed to it than the other and the ratio of traffic is off. Visually, a SRM issue looks like this:
According to Georgi Georgiev, of Analytics-Toolkit, in his article, Does Your A/B Test Pass the Sample Ratio Mismatch Check?
"Sample ratio mismatch (SRM) means that the observed traffic split does not match the expected traffic split. The observed ratio will very rarely match the expected ratio exactly."
And Ronny Kohavi, in his book Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, adds:
If the ratio of users (or any randomization unit) between the variants is not close to the designed ratio, the experiment suffers from a Sample Ratio Mismatch (SRM).
Note that the SRM check is only valid for the randomization unit (e.g., users). The term "traffic" may mislead, as page views or sessions do not need to match the design.
In other words, very simply stated: SRM occurs when one test variant receives noticeably more users than expected.
In this case, you've got an SRM issue. And that's a big problem because it means results aren't fully trustworthy.
When SRM occurs, test results can’t be fully trusted because the uneven distribution of traffic skews conversion numbers.
Take this real-life case study example:
An A/B test was run with traffic equally split, 50/50. The test ran for 4 weeks achieving a large sample of 579,286 total visitors.
At the end of the experiment, it would have been expected that, based on this 50/50 split, each variation should have received roughly 289,643 visitors (579,286/2=289,643):
But, SRM ended up occurring and the variation received noticeably more users than the control:
The ratio of the traffic was clearly mismatched – even though it was supposed to be split evenly.
This finding was itself problematic.
But the real issue was that, because the sample was mismatched, the conversion rates were completely skewed.
Looking to increase conversions, the control achieved 4,735 conversions, but the variant slightly outperformed with 5,323 conversions.
Conversion rate without and with SRM
Without SRM, the variant appears to be the winner:
But, with SRM, the ratio of traffic became unevenly mismatched and altered the conversion rate so the control appeared to outperform the variant, making the test seem like a loser:
Which was the real winner or loser? With SRM, it became unknown.
SRM leaves your real results in the dark so you can't be certain if you truly have a test win or loss. The data is inaccurate.
So SRM is definitely something you want to avoid in your own testing practice.
However, by checking for SRM, and verifying your test doesn't have an SRM issue, you can be more confident results are trustworthy.
In fact, according to this insightful SRM research paper, "one of the most useful indicators of a variety of data quality issues is (calculating) Sample Ratio Mismatch."
Furthermore, according to Search Discovery,
"When you see a statistically significant difference between the observed and expected sample ratios, it indicates there is a fundamental issue in your data (and even Bayesian doesn't correct for that). This bias in the data causes it to be in violation of our statistical test's assumptions."
Simply put: if you're not looking out for SRM in your A/B tests, you might think your data is reliable or valid when it actually isn't.
So make sure to check for SRM in your tests!
Calculating SRM should be a standard part of your data confirmation process before you declare any test a winner.
The good news: determining an SRM issue is actually pretty easy to do. In fact, some testing platforms, like Convert.com, will now automatically tell you if you have a SRM issue.
But, if the testing platform you're using doesn't automatically detect SRM, no worries. You can easily calculate it yourself, ideally before you've finished running your test.
If you're a stats guru, you can determine if there's a SRM issue through a Chi-Square Calculator for Goodness of Fit. However, if the name alone scares you off, don't worry.
These calculators can be used for both Bayesian and Frequentist methods and can be used whether your traffic is equally or unequally allocated.
If you don't have a SRM issue, you'll see a message that looks like this:
Assuming everything checks out and no SRM is detected, you're golden.
So long as the test has been run properly, your sample size is large enough to be adequately powered to determine a significant effect, and you've achieved statistical significance at a high level of confidence, you can declare your test a winner.
However, if you have an SRM issue, the calculator will alert you with a message that looks like this:
A p-value of ≤ 0.01 shows a significant result. The lower the p-value, the more likely an SRM issue has occurred.
If SRM is detected, that's a problem. It shows the ratio of traffic hasn't been directed to each variant equally and results might be skewed.
According to experimentation specialist, Iqbal Ali, in his article The essential guide to Sample Ratio Mismatch for your A/B tests, SRM is a common issue.
In fact, it happens in about 6-10% of all A/B tests run.
And, in redirect tests, where a portion of traffic is allocated to a new page, SRM can be even more prevalent.
However, if any test you've run has an SRM issue, you need to be aware of it, and you need to be able to mitigate the issue.
Yes, SRM can occur with samples of all sizes.
The Type I error rate is always 1% with a Chi Test with the p-value at < 0.01. This means it doesn't matter if we check with 100 users or 100,000 users. Our false-positive rate remains at about 1 out of 100 hundred tests. (See the green line on the chart below).
Having said that, we need to be wary as with low volumes of traffic, we can see larger differences happen by random chance, WITHOUT it being flagged by our SRM check. (See the red line on the chart below).
It should be rare to see those extremes though.
It should be even rarer to see even larger outliers, where SRM alert triggers (false positives). (See the yellow line on the chart below). I wanted to see this number as it's a good indication of volatility. The smaller the traffic volumes, the larger the volatility.
At 10,000+ users assigned, the % difference between test groups before SRM triggers is <5%.
At 65,000+ users this % difference drops to < 2%. So, the Chi Test becomes more accurate.
So beware, no matter how big or small your traffic volume, SRM is a possibility. But, the larger the sample, the more likely for a SRM issue to be accurately detected.
If your test shows a SRM issue, the first step is to figure out why it may have occurred.
SRM usually happens because:
Or a combination of all these factors.
When any test shows an SRM issue, the first step is to review the set-up and confirm everything has been allocated and is tracking properly. You may find a simple (and often hard to detect) issue led to the error.
However, if you can't isolate any particular problems, it's worthwhile reviewing a larger collection of your tests to see if you can find trends across a broader swath of studies.
For example, for one client who had a lot of tests with SRM issues, I did a meta analysis of all tests run within a 1-year period. Through the analysis, I noticed for every test with over 10,000 visitors per variant, an SRM issue occurred on a particular testing platform.
While this finding was clearly problematic, it meant the SRM issue could be isolated to this variable.
After this issue was detected, the testing platform was notified of the issue; I've since seen updates that the platform is now working to fix this problem.
In his paper Raise the Bar on Shared A/B Tests: Make Them Trustworthy, data guru, Ronny Kohavi, describes that a trustworthy test must meet 5 criteria, including checking for SRM.
In this LinkedIn post, Ronny compares not checking for SRM to a car without seatbelts. Seatbelts save lives; SRM is a guardrail that saves you from incorrectly declaring untrustworthy results as trustworthy.
If your study shows an SRM issue, the remedy is simple: re-run the test.
Make sure you get similar results, but without the SRM issue.
As the GuessTheTest article explains, an A/A test is exactly as it sounds. A split-test that pits two identical versions against each other.
To perform an A/A test, you show half of visitors version 1, and the other half version 2.
The trick here is both versions are exactly the same!
Doing so enables you to validate that you’ve set-up and run the test properly and that the data coming back is clean, accurate, and reliable -- helping you ensure there isn't an SRM problem.
With an A/A test, you expect to see both variants receive roughly equal traffic with about the same conversion rate. Neither version should be a winner.
In an A/A test, when you don't get a winner, you should celebrate -- and know you're one step closer to being able to run a test without an SRM issue.
An important note, an A/A test won't fix a SRM issue, but it will tell you if you have one and can bring you one step closer to being able to diagnosing the problem.
To mitigate against undetected SRM issues, you might also want to consider building an SRM alerting system.
As experimentation mastermind Lukas Vermeer of Vista remarked in this LinkedIn post, his team created their own SRM altering system in addition to the testing platform they're using. (Lukas has also created a free SRM chrome extension checker available here).
The progressive experimentation folks at Microsoft have done something similar:
If you're not willing to invest in any of these strategies because they're too time or resource intensive, fine.
You then need to willingly accept the tests you're declaring as winners aren't fully trustworthy. And you really shouldn't be making implementation decisions based on the results.
In that case, you need to choose your poison: either accept untrustworthy data or attempt to obtain accurate test results that may take more effort to yield.
If your A/B tests have fallen victim to a SRM issue, the results aren't as valid, and reliable, as they should be. And that's a problem.
Because, if you're making data-driven decisions without accurate data, you're clearly not in a good position.
Testing for SRM is like putting on your seatbelt. It could save you from implementing a so-called "winning" test that actually amounts to a conversion crash.
So remember: always be testing. And, always be checking for SRM.
Hope you’ve found these suggestions useful and informative. Please share your thoughts and comments in the section below.