Last updated September, 2022
Written by Deborah O'Malley & Timothy Chan
With special thanks to Ronny Kohavi for providing feedback on an earlier draft of this article. Ronny's Accelerating Innovation With A/B testing course provided much of the inspiration for this piece and is a must-take for every experimenter!
Awesome. You’ve just run an A/B test, crunched the numbers, and achieved a statistically significant result.
It’s time to celebrate. You’ve got another test win. 🙂
Or have you?
As this article explains, understanding and calculating statistical significance is actually quite complex.
To properly call a winning (or losing) test, you need to understand what a statistically significant result really means.
Otherwise, you’re left making incorrect conclusions, random decisions, or money-losing choices.
Many experimenters don’t truly know what statistical significance is or how to derive a statistically significant test result.
So, in plain English, this guide is here to set it all straight for you so you can accurately declare and interpret a statistically significant A/B test with accuracy and ease.
Before we get too far in, it’s important to lay the groundwork so you’re clear on a few important points:
Because this article has been written with the express purpose of simplifying a complex topic, we’ve extracted only the most meaningful concepts to present to you.
We’re trying to avoid bogging you down with all the nitty, gritty details that often cause more confusion than clarity.
As such, this article does not offer an in-depth examination of every aspect of statistics. Instead, it covers only the top topics you need to know so you can confidently declare a statistically significant test.
Aspects like sample size, power, and its relationship to Minimum Detectable Effect (MDE) are separate topics not included in this article. While all tied to getting statistically significant results, these concepts need a whole lot more explanation outside of statistical significance.
If you're interested, you can read more about sample size here, and its relationship to power and MDE here.
Scientists use statistical significance to evaluate everything from differences in bunny rabbit populations to whether a certain dietary supplement lowers obesity rates.
That’s all great. And important.
But as experimenters interested in A/B testing, this article focuses on the most common evaluation criteria in online testing – and the one that usually matters most: conversion rates.
Although, in other scenarios there may be a more complicated Overall Evaluation Criterion (OEC), in this article, anytime we talk about a metric or result, we’re referring, specifically, to conversion rates. Nothing else.
So, you can drop bunny rabbits and diet pills out of your head. At least for now.
And while conversion rates are important, what's usually key is increasing them!
Therefore, in this article, we focus only on one-sided tests – which is a fancy stats way of saying, we’re looking to determine if the treatment has a higher conversion rate than the control.
A one-sided, also known as a one-tailed test, measures conversions only going one-way, compared to the control. So either just up OR just down.
In contrast, a two-sided test measures conversion results up AND down, compared to the control.
One-side vs. two-sided tests are a whole topic to explore in and of themselves.
In this article, we’ve focussed just on one-sided tests because they're used for detecting only a positive (or negative) result.
But, you should know, there’s no clear consensus on whether a one-sided or two-sided test is best to use in A/B testing. For example, this in-depth article states one-tailed tests are better. But, this one argues two-tailed tests are best.
As a general suggestion, if you care only whether the test is better than the control, a one-sided test will do. But if you want to detect whether the test is better or worse than the control, you should use a two-sided test.
It’s worth mentioning that there are other types of tests better suited for other situations. For example, non-inferiority tests check that a test is not any worse than the control.
But, this article assumes that, at the end of the day, what most experimenters really want to know is: did the test outperform better the control?
Statistical significance helps us answer this question and lets us accurately identify a true conversion difference not just caused by random chance.
Now that we’ve got all these stipulations out of the way, we’re ready to have some fun learning about statistical significance for A/B testing.
Let’s dig in. . .
Alright, so what is statistical significance anyway?
Google the term. Go ahead, we dare you!
You’ll be bombarded with pages of definitions that may not make much sense.
Here’s one: according to Wikipedia, “in statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis. More precisely, a study's defined significance level, denoted by alpha, is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, is the probability of obtaining a result at least as extreme, given that the null hypothesis is true.”
If you find this definition seems confusing, you’re not alone!
Here’s what’s important to know: a statistically significant finding is the standard, accepted way to declare a winning test. It provides evidence to suggest you’ve found a winner.
This evidence is important!
Because A/B testing itself is imperfect. With limited evidence, there’s always the chance you'll get random, unlucky results misleading you to wrong decisions.
Statistical significance helps us manage the risk.
To better understand this concept, it can be helpful to think of a coin flip analogy.
Imagine you have a regular coin with heads and tails. You make a bet with your friend to flip a coin 5 times. Each time it lands on heads, you win $20. But every time it lands on tails, you pay them $20.
Sounds like an alright deal. You can already see the dollar signs in your eyes.
But what happens when you flip the coin and 4 out of 5 tosses it lands on tails? Your friend is happy.
But you're left pretty upset. You might start to wonder if the coin is rigged or it’s just bad luck.
In this example, the unfavorable coin flips were just due to random chance. Or what we call, in fancy stats speak, statistical noise.
While the results of this simple coin bet may not break the bank, in A/B testing, you don’t want to be at the whim of random chance because incorrectly calling A/B tests may have much bigger financial consequences.
Fortunately, you can turn to statistical significance to help you minimize the risk of poor decisions.
To fully understand, you’re going to need to wrap your head around a few fancy stats definitions and their relationship to each other. These terms include:
While these terms may not make much sense yet, don’t worry.
In the sections below, we’re going to clearly define them and show you how each concept ties into statistical significance so you can accurately declare a winning A/B test with absolute confidence.
Let’s start with the basics.
In A/B testing, there are two main statistical frameworks, Bayesian and Frequentist statistics.
Don’t worry. While these words sound big and fancy, they’re nothing to be intimidated by.
They simply describe the statistical approach taken to answer the question: does one A/B test version outperform another?
Bayesian statistics measures how likely, or probable, it is that one version performs better than another.
In a Bayesian A/B test, results are NOT measured by statistical significance.
This fact is important to realize because popular testing platforms like VWO Smarts Stat Mode and Google Optimize currently use the Bayesian framework.
So if you’re running a test on these platforms, know that, when determining if you have a winner, you’ll be evaluating the probability of the variant outperforming – not whether the result is statistically significant.
Statistical significance, however, is entirely based on frequentist statistics.
And, as a result, frequentist statistics takes a comparatively different approach.
Using the frequentist method is more popular and involves fewer assumptions. But, the results are more challenging to understand.
Which is why this article is here for you.
In A/B testing, frequentist statistics asks the question, “is one version better than the other?”
To answer this question, and declare a statistically significant result, a whole bunch of checkpoints have to be met along the way, starting with the null hypothesis.
Say what? What the heck is a null hypothesis?
To better understand this term, let’s break it down in the two words, starting with hypothesis.
Hypothesis testing is the very basis of the scientific method.
Wikipedia defines hypothesis as a proposed explanation for a phenomenon. But, more simply stated, we can call it an educated guess that can be proven wrong.
In A/B testing, you make an educated guess about which variant you think will win and try to prove or disprove your belief through testing.
Through the hypothesis, it's assumed the current state of scientific knowledge is true. And, until proven otherwise, everything else is just speculation.
Which means, in A/B testing, you need to start with the assumption that the treatment is not better than the control. The real conversion difference is null or nothing at all; it's ≤ zero.
And you've gotta stick with this belief until there's enough evidence to reject it.
Without enough evidence, you fail to reject the null hypothesis, and, therefore, deem the treatment is not better than the control. (Note, you can only reject or fail to reject the null hypothesis; you can't accept it).
There’s not a large enough conversion rate difference to call a clear winner.
However, as experimenters, our goal is to decide whether we have sufficient evidence – known as a statistically significant result – to reject the null hypothesis.
With strong evidence, we can conclude, with high probability, that the treatment is indeed better than the control.
Our amazing test treatment has actually outperformed!
In this case, we can reject the null hypothesis and accept an alternative viewpoint.
This alternate view is aptly called the alternative hypothesis.
When this exciting outcome occurs, it’s deemed noteworthy and surprising.
To determine that answer, we need to calculate the probability of whether we made an error in rejecting, or failing to reject the null hypothesis, and calling a winner when there really isn’t one.
Because as mere mortals struggling to understand all this complex stats stuff, it’s certainly possible to make an incorrect call.
In fact, it happens so often, there’s names for these kinds of mistakes. They’re called type I and type II errors.
A type I error occurs when the null hypothesis is rejected – even though it was correct and shouldn’t have been.
In A/B testing, it occurs when the treatment is incorrectly declared a winner (or loser) but it’s not – there's actually no real conversion difference between versions; you were misled by statistical noise, an outcome of random chance.
This error can be thought of as a false positive since you’re claiming a winner that’s not there.
While there’s always a chance of getting misleading results, sound statistics can help us manage this risk.
Because calling a test a winner – when it’s really not – is dangerous.
It can send you chasing false realities, push you in the wrong direction and leave you doubling down on something you thought worked but really didn’t.
In the end, a type I error can drastically drag down revenue or conversions. So, it’s definitely something you want to avoid.
A type II error is the opposite.
It occurs when you incorrectly fail to reject the null hypothesis and instead reject the alternative hypothesis by declaring that there’s no conversion difference between versions – even though there actually is.
This kind of error is also known as a false negative. Again, a mistake you want to avoid.
Because, obviously, you want to be correctly calling a winner when it’s right there in front of you. Otherwise, you’re leaving money on the table.
If it’s hard to keep these errors all straight in your head, fear not.
Here’s a diagram summarizing this concept:
Making errors sucks. As an experimenter, your goal is to try to minimize them as much as possible.
Unfortunately, given a fixed number of users, reducing type I errors will end up increasing the chances of a type II error. And vice versa. So it’s a real tradeoff.
But here is, again, a silver lining in the clouds. Because in statistics, you get to set your risk appetite for type I and II errors.
Through two stats safeguards known as alpha (α) and beta (β).
To keep your head from spinning, we’re only gonna focus on alpha (α) in this article. Because it’s most critical to accurately declaring a statistically significant result.
While Beta (β) is key to setting the power of an experiment, we’ll assume the experiment is properly powered with beta ≤ 0.2, or power of ≥ 80%. Cause, if we delve into beta's relationship with significance, you may end up with a headache. So, we'll skip it for now.
With alpha (α) top of mind, you may be wondering what function it serves.
Well, significance level alpha (α) helps us set our risk tolerance of a type I error.
Remember, a type I error, also known as a false positive, occurs when we incorrectly reject the null hypothesis, claiming the test treatment converts better than the control. But, in actuality, it doesn’t.
Since type I errors are bad news, we want to mitigate the risk of this mistake.
We do so by setting a cut-off point at which we’re willing to accept the possibility of a type I error.
This cut-off point is known as significance level alpha (α). But most people often just call it significance or refer to it as alpha (denoted α).
Experimenters can choose to set α wherever they want.
The closer it is to 0, the lower the probability of a type I error. But, as mentioned, a low α is a trade-off. Because, in turn, the higher the probability of a type II error.
So, it’s a best practice to set α at a happy middle ground.
A commonly accepted level used in online testing is 0.05 (α = 0.05) for a two-tailed test, and 0.025 (α = 0.025) for a one-tailed test.
For a two-tailed test, this level means we accept a 5% (0.05) chance of a type I error, or of incorrectly rejecting the null hypothesis by calling a winner when there isn’t one.
In turn, there’s a 95% probability the null hypothesis is correct; the test treatment is indeed no better than the control -- assuming, of course, the experiment is not under-powered and we have a large enough sample size to adequately detect a meaningful effect in the first place.
That’s it. That’s all α tells us: the probability of making a type I error, assuming the null hypothesis is correct.
That said, it’s important to clear a couple misconceptions:
At α = 0.05, the chance of making a type I error is not 5%; it’s only 5% if the null hypothesis is correct. Take note because a lot of experimenters miss the last part – and incorrectly believe it means there's a 5% chance of making an error, or wrong decision.
Also, a 5% significance level does not mean there’s a 5% chance of finding a winner. That’s another misconception.
Some experimenters extrapolate even further and think this result means there's a 95% chance of making a correct decision. Again, this interpretation is not correct. Without introducing subjective bias, it’s very difficult to know the probability of making a right or wrong decision. Which is why we don’t go down this route in stats.
So, if α is the probability of making a type I error, assuming the null hypothesis is correct, how do we possibly know if the null hypothesis is accurate?
To determine this probability, we rely on alpha’s close sibling, p-value.
According to Wikipedia, p-value is the probability of the test producing an observed result at least as or more extreme than the ones observed in your data, assuming the null hypothesis is true.
But, if that definition doesn’t help much, don’t worry.
Here’s another way to think of it:
P-value tells us how likely it is that the outcome, or a more extreme result occurred, if the null hypothesis is true.
Remember, the null hypothesis assumes the test group’s conversion rate is no better than the control group’s conversion rate.
So we wanna know the likelihood of results if the test variant is truly no better than the control.
Because, just like a coin toss, all outcomes are possible. But some are less probable.
P-value is how we measure this probability.
In fact, some fun cocktail trivia knowledge for you, the “p” in p-value stands for probability!
And as you can probably imagine, the value part of p-value is because we express the probability using a specific value. This value is directly tied into its sibling, alpha (α). Remember, we can set α at anything we want. But, for two-tailed tests, the cut-off is generally α ≤ 0.05.
Well, guess what?
When the p-value is less than α (p≤α), it means the chance of getting the result is really low – assuming the null hypothesis is true. And, if it’s really low, well then, the null hypothesis must be incorrect, with high probability.
So we can reject the null hypothesis!
In rejecting the null hypothesis, we accept the less likely alternative: that the test variant is truly better than the control; our results are not just due to random chance or error.
Hallelujah! We’ve found a winner. 🥳
An unusual and surprising outcome, the result is considered significant and noteworthy.
Therefore, a p-value of ≤0.05 means the result is statistically significant.
However, while a p-value of ≤0.05 is, typically, accepted as significant, many data purists will balk at this threshold.
They’ll tell you a significant test should have a p-value ≤0.01. And some data scientists even argue it should be lower.
However low you go, it seems everyone can agree: the closer the p-value to 0, the stronger the evidence the conversion lift is real – not just the outcome of random chance or error.
Which, in turn, means you have stronger evidence you’ve actually found a winner.
Or written visually: ↓ p-value, ↑ significant the finding.
And that’s really it!
A very simplified, basic, incredibly stripped down way to tell you the essentials of what you absolutely need to know to declare a statistically significant A/B test.
Once it’s made clear, it’s really not that hard. Is it?
Of course, there’s SO MUCH we’ve left out. And much, much more to explain to properly do the topic justice.
So consider this article your primer. There’s plenty more to learn. . .
But, now that you more clearly understand the basics, you probably get how the null hypothesis ties into a type I error, based on significance level α, expressed as a p-value.
So it should be clear:
When calling a result “statistically significant,” what you’re really saying in stats-speak is: the test showed a statistically significant better conversion rate against the control.You know this outcome occurred because the p-value is less than the significance level (α) which means the test group’s higher conversion rate was unlikely under the null hypothesis.
In other words, the test variant does, in fact, seem to be better than the control.
You’ve found a winner! 🙂
Although most standard A/B testing platforms may declare a statistically significant winning test for you, not all get it right.
So it’s strongly advocated that you verify test results yourself.
To do so, you should use a test validation calculator, like this one from AB Testguide.
Simply plug in your visitor and conversion numbers, declare if your hypothesis was one- or two-sided, and input your desired level of confidence.
If the result is statistically significant, and you can confidently reject the null hypothesis, you’ll get a notice with a nicely visualized chart. It will look something like this:
If the result is not significant, you’ll see a similar notification informing you a winning test was not observed. It will look like this:
However, an important word of caution: just because your test result is statistically significant doesn’t always mean it’s truly trustworthy.
There are lots of ways to derive a statistically significant result by “cheating.” But that’s a whole topic for another conversation. . .
In essence, evaluating statistical significance in A/B testing asks the question: is the treatment actually better than the control?
To determine the answer, we assume the null hypothesis: that the treatment is no better than the control. So there is no winner – until proven otherwise.
We then use significance level alpha (α) to safeguard against the probability of committing a Type I error and to set a reasonable bar for the proof we need.
P-value works with α as truth barometer. It answers the question: if the null hypothesis is true, how likely is it the outcome will occur?
The answer can be stated with a specific value.
A p-value lower than or equal to α, usually set at 0.05, means the outcome is unlikely due to random chance. The result is surprising and, therefore, significant.
The lower the p-value, the more significant the finding.
A statistically significant result shows us: the result is unlikely due to just random chance or error under the null hypothesis. There’s strong evidence to reject the null hypothesis and declare a winner.
So we can take our winning result and implement it with confidence, knowing that it will likely lift conversions on our website.
Without a statistically significant result, we’re left in the dark.
We have no way to know whether we’ve actually, accurately called a winner, made an error, or achieved our results just due to random chance.
That’s why understanding what statistical significance is and how to apply it is so important. We don’t want to implement supposed winners without strong evidence and rigour.
Especially when money, and possibly our jobs, are on the line.
Currently, statistical significance is the standard, accepted way to declare a winner in frequentist A/B testing. Whether it should be or not is a whole other topic. One that’s up for serious debate.
But for now, statistical significance is what we use. So apply it. Properly.
Congrats! You’ve just made it through this article, and have hopefully learned some things along the way.
Now, you deserve a nice break.
Grab yourself a bubble tea, kick your feet up, and breathe out a sigh of relief.
The next time you go to calculate a winning test, you’ll know exactly what to look for and how to declare a statistically significant result. If you’ve actually got one. 😉
Hope this article has been helpful for you. Please share widely and post your questions or comments in the section below.
Deborah O’Malley is one of the few people who’s earned a Master’s of Science (M.Sc.) degree with a specialization in eye tracking technology. She thought she was smart when she managed to avoid taking a single math class during her undergrad. But confronted reality when required to take a master’s-level stats course. Through a lot of hard work, she managed to earn an “A” in the class and has forever since been trying to wrap her head around stats, especially in A/B testing. She figures if she can learn and understand it, anyone can! And has written this guide, in plain English, to help you quickly get concepts that took her years to grasp.
Timothy Chan is among one of the select few who holds a Ph.D. and an MBA. Well-educated and well-spoken, he’s a former Data Scientist at Facebook. Today, he’s the Lead Data Scientist at Statsig, a data-driven experimentation platform. With this experience, there truly isn’t anyone better to explain statistical significance in A/B testing.
Ronny Kohavi is one of the world’s foremost experts on statistics in A/B testing. In fact, he’s, literally, written the book on it! With a Ph.D. from Stanford University, Ronny's educational background is equally impressive as his career experience. A former vice president and technical fellow at Microsoft and Airbnb, and director of data mining at personalization at Amazon, Ronny now provides data science consulting to some of the world’s biggest brands. When not working directly with clients, he’s teaching others how to accelerate innovation in A/B testing. A course absolutely every experimenter needs to take!
Pritul - Thanks so much for your in-depth comments. You've provided a fascinating history of how these statistical methods were derived. We're addressing your comments and producing an updated version which should address all the concern you mentioned. Thanks for your feedback!