guess_the_test-white-green
Member Login
guess_the_test-white-green
Menu

Finally! Statistical significance clearly explained

By: Deborah O'Malley, M.Sc. & Timothy Chan, Ph.D. | Last updated May, 2022


Overview

Awesome! You’ve just run an A/B test, crunched the numbers, and achieved a statistically significant result.

It’s time to celebrate. You’ve got another test win. 🙂

But what does a statistically significant result really mean?

As this article explains, understanding and calculating statistical significance is actually quite complex.

Many experimenters don’t truly know what statistical significance is or how to derive a statistically significant test result.

Statistical significance is the standard, accepted way to declare a winning test. It helps us identify real effects from noise. 

So to properly call a winning (or losing) test, it’s really important to understand what a statistically significant result is and means. 

Otherwise, you risk making incorrect conclusions, random decisions, and money-losing choices.

In plain English, this guide is here to set it all straight so you can accurately declare a statistically significant A/B test with ease and confidence. 

Let’s dig in. . .


Statistical significance defined

In A/B testing, statistical significance is defined as a result that provides evidence there is an actual difference between variants, and that measured differences are unlikely due to random chance.

Say what?!?

If this definition seems confusing, hang tight. We’re going to decode it. . .

But to fully understand, you’re going to need to wrap your head around a few fancy stats terms and their relationship to each other. These terms include: 

While these terms may not make much sense yet, don’t worry.

In the sections below, we’re going to clearly define them and show you how each concept ties into statistical significance so you can accurately declare a winning A/B test.


Statistical methods in A/B testing

But first, we need to start with the basics.

In A/B testing, there are two main statistical frameworks: Bayesian and Frequentist statistics.

While these words sound big and fancy, they’re nothing to be intimidated by. 

They simply describe the statistical approach taken to answer the question: does one A/B test version outperform another?

Bayesian statistics 

Bayesian statistics measures how likely, or probable, it is that one version performs better than another.

In a Bayesian A/B test, results are NOT measured by statistical significance. 

This fact is important to realize because popular testing platforms like VWO Smarts Stat Mode and Google Optimize currently use the Bayesian framework.

So if you’re running a test on these platforms, you shouldn’t claim your test is statistically significant. That’s not accurate.

Instead, you should state the probability at which the variant may beat the control.

Frequentist statistics

Frequentist statistics takes and entirely different approach.

In A/B testing, using the Frequentist method is more popular and involves fewer assumptions. But, the results are more challenging to understand. Which is why this article is here for you.

Frequentist statistics asks, “is one variant better than the other? And attempts to answer this question by identifying how unlikely or unusual the results are under a specific assumption.

This assumption is called the null hypothesis.


Null hypothesis

Say what? What the heck is a null hypothesis?

To better understand this term, let’s break it down across the two words, starting with hypothesis.

Hypothesis: most of us have a vague sense of what a hypothesis is. Simply stated, we can call it an educated guess that can be proven wrong.

In A/B testing, you make assumption about which variant you think will win and try validate your belief through testing.

Hypothesis testing is the very basis of the scientific method. Through the hypothesis, it's usually assumed the current state of scientific knowledge is true. 

In A/B testing, this assumption means it’s initially believed there’s no conversion difference between test versions. Until proven otherwise, the control and variant(s) are thought to perform equally. 

And, here’s where the null, in null hypothesis, comes into play.

The null hypothesis, denoted H0, assumes the difference between test variants is null or amounts to nothing at all.

The null hypothesis and Frequentist testing takes a guilty until proven innocent approach.

As experimenters using this approach, we need to start by assuming our tests won’t win; when they don’t, it’s considered normal

But, sometimes, a result is unusual or surprising enough that we suspect the null hypothesis is wrong. 

This event occurs when we find a statistically significant result. We, then, reject the null hypothesis and accept alternative viewpoint.


Alternative hypothesis

This other viewpoint is known as the alternative hypothesis (denoted Ha or H1). 

It’s the alternative to the null hypothesis.

As an experimenter, your goal is actually to reject the null hypothesis and accept the alternative hypothesis.

In doing so, you show that there is in fact, a real detectable effect, or conversion difference between variants. A difference that’s not due to random chance. 

Your amazing test variant has actually outperformed!

When this exciting outcome occurs, it’s deemed noteworthy and surprising.

How surprising?

To determine that answer, we need to calculate the probability of whether we made an error in calling a winner when there really wasn’t one.

Because we are just human after all. 

And, as mere mortals struggling to understand all this complex stats stuff, it’s certainly possible to make an incorrect call.

In fact, it happens so often, there’s names for these kinds of mistakes. They’re called type I and type II errors.


Errors in testing

Type I error, false positive

A type I error occurs when the null hypothesis is rejected – even though it was correct and shouldn’t have been.

In A/B testing, this error happens when you incorrectly claim there's a conversion lift, but, in actuality, there's not. What you've really detected is just statistical noise -- an outcome of random chance.

This error can be thought of as a false positive since you’re claiming a winner that’s not really there.

Oops. You don’t want to do that!

Calling a test a winner, when it’s not may drag down revenue or conversions. 

Type II error, false negative

A type II error is the opposite. It occurs when you incorrectly declare that there’s no conversion difference between versions – even though there actually is.

This kind of error is also known as a false negative. Again, a mistake you want to avoid. 

Because, obviously, you want to be correctly calling a winner when it’s right there in front of you. Otherwise, you’re leaving money on the table.

If it’s hard to keep these errors all straight in your head, fear not. Here’s a diagram summarizing this concept:


Statistical noise & errors in testing 

Making errors sucks. And, as an experimenter, your goal is to minimize errors as much as possible.

The thing is, A/B testing itself is imperfect, and statistical noise can lead you to make wrong decisions. There’s no way around that.

To better understand this concept of statistical noise, it can be helpful to think of a coin flip analogy.

Imagine you have a regular coin with heads and tails. You make a bet with your friend and flip a coin 5 times. For each head, you win $20.  For each tail, you pay them $20.

Sweet! You're already seeing dollar bills in your eyes.

But what happens when you flip the coin 5 times and 4/5 tosses its tails? You might start to wonder whether it’s just bad luck or if the coin is actually rigged. 

In this example, the unfavorable coin flips were just due to random chance. Or what we call, in fancy stats speak, statistical noise.

This outcome, while rare, is unavoidable. 

While the results of this simple coin toss bet may not break the bank, incorrectly calling A/B test winners or losers may have much bigger financial consequences.

You don’t want to be at the whim of random chance. 

Fortunately, you can turn to statistics to help you minimize poor decisions.

That’s the good news.

The bad news?

Unfortunately, when you turn to statistics and take steps to minimize noise and type I errors, you end up increasing the chances of a type II error. And vice versa. So it’s a real tradeoff.

But there is a silver lining in the cloud. Because in statistics, you get to set your risk appetite for type I and II errors.

How? 

Through statistical cousins, Alpha (α) and Beta (β).


Statistical Power Beta (β)

Let’s start with beta (β). Beta assess the likelihood of committing a type II error.

Remember, a type II error, also known as a false negative, happens when you incorrectly claim there’s no difference between variants when, in fact, there is.

You can safeguard against making this mistake by setting the risk level at which you’re willing to accept a type II error.

This risk level can be calculated through beta’s twin brother, power.

Power

Power assesses how well you’ve managed to avoid a type II error and correctly call a winner (or loser) when there is, in fact, a real effect, or conversion difference between variants.

Like good twin brothers, power and beta (β) work together in unison.

In fact, power is calculated as 1 - β. The result is, typically, expressed as a decimal value. A power of 0.80 is the standard (1 - β = 0.80). 

This amount means there's at least an 80% likelihood of successfully detecting a meaningful effect, or conversion difference.

To fully grasp this concept, though, we should probably to go back to grade school math where we learned that, to convert a decimal to a percentage, we simply multiply it by 100, or move the decimal two places to the right.

So, in that case, a power of 0.80 * 100 = 80%.

As such, there's only a (0.20) 20% chance of missing the effect and ending up with a false negative. A risk we’re willing to take.

The higher the power, the higher the likelihood of accurately detecting a real effect, or conversion difference.

Although increasing power sounds like a great idea, it can be tricky, especially for lower traffic sites. The higher the power, the larger the sample size needs to be. And, often, the longer the test needs to run.


Significance level alpha (α)

While power helps us measure the risk of a type II error, its close cousin, alpha (α), or significance, as it’s commonly called, helps us measure the risk of a type I error.

Remember, a type I error, also known as a false positive, occurs when we think there’s a difference between versions. But, in actuality, there isn’t. The variants don’t convert better than the control.

To have a statistically significant test, we need to confirm there’s a low probability we’ve made a type I error.

So, here, again, we set a risk level, or cut-off, at which we’re willing to accept this chance of error.

This cut-off point is known as significance level, also just known as alpha (α).

Experimenters can choose to set their risk level wherever they want. But a commonly accepted level used in online testing is 0.05.

Or visually written, α = 0.05.

This amount means we accept a 5% (0.05) chance of a type I error in claiming a conversion difference when there’s actually no real difference between variants. 

The closer α is to 0, the lower the probability of a type I error. And vice versa.

However, we’ll almost never be absolutely certain there’s no mistake. So we can almost never set the significance level at 0.

At a significance level of 0.05, there’s a 95% probability we’ve made the correct conclusion without mistake. 

But how do we actually calculate this probability? 

The answer is through family ties. 

Just like beta’s twin brother is power, cousin alpha also has a sibling. Its name is p-value


P-value explained

While significance level (α) acts as a safety check to ensure we haven’t made a type I error, by incorrectly rejecting the null hypothesis, p-value serves as an even more scrupulous guardrail.

Officially defined, p-value is the probability of the test producing an observed result at least as or more extreme than the ones observed in your data, assuming the null hypothesis is true.

But, if that definition doesn’t help much, don’t worry.

Here’s another way to think of it:

P-value tells us how likely, or probable, it is that the outcome occurred, if the null hypothesis is true.

Remember, the null hypothesis assumes there is no difference in conversion rates between variants; and, we start off assuming this viewpoint is correct – until proven otherwise.

The probability of the outcome occurring, assuming the null hypothesis is true, is expressed as a value. In fact, some fun cocktail trivia knowledge, the “p” in p-value stands for probability! 

And as you can probably imagine, the value part of p-value is because we express this probability as a specific value.

This value is directly tied into its sibling, alpha (α).

Remember, we can set α (the probability of making a type I error) at anything we want. But, generally, the cut-off is 0.05 (α = 0.05).

Well, guess what?

When the p-value is less than α (p ≤ α), it means the result is fairly unlikely under the null hypothesis. In fact, it’s unlikely enough that we believe the null hypothesis is incorrect.

As such, we make a bold move and reject the null hypothesis.

In rejecting the null hypothesis, we accept the alternative: that there is truly a conversion difference between variants, and the difference we found is not just due to random chance or error. 

Hallelujah! We’ve found a winner.

An unusual and surprising outcome, the result is considered significant.

Therefore, a p-value of ≤0.05 means the result is deemed statistically significant.

However, while a p-value of ≤0.05 is, typically, accepted, many data purists will balk at this threshold.

They’ll tell you a significant test should have a p-value ≤0.01. And some data scientists even argue it should be lower.

However low you go, it seems everyone can agree: the closer the p-value to 0, the stronger the evidence the conversion lift is real – not just the outcome of random chance or error. 

Which, in turn, means there’s a higher probability you’ve actually found a winner. 

Or written another way: ↓ p-value, ↑ significant the finding.

Now, what you've just taken in is a lot! If you need a little help putting all these concepts together, here's a "family tree" that summarizes all these relationships for you:


Inverse relationship of p-value and statistical significance

As you've probably gathered, p-value and statistical significance have an inverse relationship.

The lower the p-value, the more likely the effect, or results, are real and you’ve actually found a winner. 

Or written another way: ↓ p-value, ↑ significant the finding.

Conversely, the higher the p-value, the lower the likelihood there’s really a difference between versions – and therefore, the less likely the result is significant.

Or visually written: ↑ p-value, ↓ significant the result.

But even if you achieve a p-value of ≤0.05, how can you really be sure that you truly found a difference between variants and can claim a winner?


Confidence Level defined

To answer this question, you can lean on confidence level.

With a p-value of ≤0.05, you can be pretty confident you’ve really found a meaningful difference between versions.

But. . . notice that nuance? We said that you can be pretty confident.

When money’s on the line, being reasonably sure isn’t good enough. You want to be quite certain!

And that’s why we also use the guardrail of confidence level as the final factor in achieving statistical significance.

Confidence level indicates the degree of certainty that you’ve actually found a winner and you’re not just making a mistake, or committing a type I error.

Fittingly, it shows how confident you are in your results.

Calculating confidence

Confidence level is calculated through the equation 1 - α. 

If that formula looks scary to you, don’t fret. 

Remember, α is the significance level, and a result is significant when the p-value is ≤0.05.

So, when the p-value = 0.05, the confidence level (1 - α) is 1 - 0.05 = 0.95. 

But here’s the problem, and a big source of confusion for some. . . Confidence level is stated as a percentage. Yet, here, we’ve arrived at decimal, 0.95. 

So what gives?

Again, we need to go back to grade school math where we learned that, to convert a decimal to a percentage, we multiply it by 100, or simply move the decimal two places to the right.

Let’s try it: (1 - α) = 1 - 0.05 = 0.95 which means 0.95 * 100 = 95%.

Amazing! We've now arrived at a 95% level of confidence.

And, now, more fun cocktail trivia for you. . . 

Did you know, the word percent is actually a Latin phrase meaning by the hundred? And it can be broken down into two root words:

  • Per meaning out of 
  • And cent = 100 in french

So, when something is 100%, it, literally, means out of 100.

A 95% confidence level means that, 95 times out of 100, you’ve called the test correctly without error. Or to use fancy stats speak, you've correctly accepted the null hypothesis when there is no effect.

At 95%, you've pretty high odds that you can feel confident in.

Notice something else?

At first glance, a p-value of ≤0.05 and 95% confidence level might not seem like they have anything to do with each other. But, they’re actually interdependent.

Because if we convert the p-value decimal 0.05 to a percentage (0.05 *100 = 5%), we get 5%.

Which means, out of 100%, with a 95% confidence level, there’s just a 5% (100% - 95% = 5%) chance we’ve incorrectly called a winner, or made a type I error. 

The higher the confidence level, the lower the probability the result is just random chance or our own interpretation error. And, therefore, the more significant the finding is.

Or visually written:

↑ confidence level, ↓ p-value, ↑ statistical significance. 

And conversely:

↓ confidence level, ↑ p-value, ↓ statistical significance.

Confidence level best practices

As a general best practice, you should aim to achieve a minimum 95% level of confidence at a significance level of 0.05. 

However, setting the confidence level and significance level really does depend on the experiment and your testing needs. 

If you’re just trying to get a quick win, or test for proof of concept, you might be able to get away with a lower confidence. 

But if you need to show, with absolute certainty, that your million dollar test idea is undeniably a winner, you’re going to want a highly significant result at a very high level of confidence. For example, if you're like Amazon, you're probably not gonna wanna change your checkout flow without being REALLY confident in results. There’s just too much on the line.

Also, a subtle but important point to note, a 95% level of confidence does not mean you’re 95% sure the results are accurate. Rather, it means you’re 95% sure observed results are not just an outcome of random chance or error. 

As well, it’s important to note, even though confidence level is calculated out 100%, it’s actually impossible to reach 100% confidence. You can only ever be 99.99% confident, as there’s always some possibility of error.


Confidence Interval defined

Test results are commonly reported with a margin of error, known as the confidence interval

A confidence interval is a range of values you expect your test results to fall between if you were to redo the experiment.

For example, if the conversion lift is 12% ± 4%, the confidence interval would be a range of values anywhere between an 8% (12 - 4%) to a 16% (12 + 4%) conversion lift.

But, in stats, you can't just arbitrarily declare you think the test results would fall within this range. You have to have a certain level of confidence.

A confidence interval answers the question: how probable is it the conversion lift would be close to the same, if the experiment was repeated?

The answer, is, not surprisingly, calculated with the help of our close friend, significance level (α).

Typically, confidence intervals are set to 1 - the significance level.

So if the significance level is 0.05, we report a 95% (1 - 0.05 = 0.95) confidence interval.

Which is different then, and not to be confused with a confidence level.

It’s a 95% confidence interval because the real result should be within that range 95% of the time. But may be outside that range 5% of the time. A risk we're willing to take.

Although this range can be substantial, it’s set by the statistical power and sample size of your experiment.

Smaller confidence intervals give more precise and trustworthy estimates.

Confidence interval ≠ statistical significance

Confidence interval and significance are tightly coupled. The lower the p-value, the tighter the confidence interval.

However, confidence interval is not the metric used to declare a winning test. 

Statistical significance is. 

Many times, I’ve heard experimenters boast they’ve achieved a winning test result based on a strong confidence interval. That’s not right.

You should only gloat when you’ve achieved a statistically significant win.


Correctly calling a winning test

What’s in a name: calling your test a winner

Now that you understand how the null hypothesis ties into a type I error, based on significance level α, expressed as a p-value, within a given confidence level, based on a specific confidence interval, it should be clear:

When calling a result “statistically significant,” what you’re really saying in scientific, stats speak is: the test achieved statistical significance based on significance level (α), expressed as a p-value of ≤0.05 at a 95% level of confidence that you can reject the null hypothesis and don’t have a type I error.

Or much more simply stated, the test achieved stat sig at a 95% level of confidence.

In other words, there does, in fact, seem to be a real difference between variants. 

You’ve found a winner! 🙂


Calculating and verifying if your test is a winner

Although most standard A/B testing platforms may declare a statistically significant winning test for you, not all get it right. 

So it’s strongly advocated that you verify test results yourself.

To do so, you should use a test validation calculator, like this one from AB Testguide.

Simply plug in your visitor and conversion numbers, declare if your hypothesis was one- or two-sided, and input your desired level of confidence.

One-sided vs. two-sided tests are a whole topic to explore in and of themselves. There’s no clear consensus on which is best to use for A/B testing. For example, this in-depth article states one-tailed tests are better. But, this one argues two-tailed tests are best.

A one-side test measures conversions only going one-way, compared to the control, so either just up OR just down. Whereas a two-sided test measures conversion results going up AND down, compared to the control.

As a general suggestion, if you care about negative results – and want to measure if the variant performed worse than the control – you should use a two-sided test because you want to be able to evaluate both positive AND negative results.

Conversely, if you only care about detecting a positive result, you can use a one-sided test. A one-sided test lets you more easily evaluate if you’re going to implement the variant because it’s a clear winner.

If the result is statistically significant, and you can confidently reject the null hypothesis, you’ll get a notice with a nicely visualized chart. It will look something like this:

If the result is not significant, you’ll see a similar notification informing you a winning test was not observed. It will look like this:

However, an important word of caution: just because your test result is statistically significant doesn’t always mean it’s truly trustworthy. 

There are lots of ways to derive a statistically significant result by “cheating.” In fact, we’re working on a whole new article related to this topic! Stay tuned.


Statistical significance summarized

In A/B testing, we use statistical significance to answer the question: do I have a winning test?  

This question is answered by testing the null hypothesis – which assumes there’s no conversion difference between variants.

If this assumption is found to be false, we declare the test as a winner.

We use significance level alpha (α) to quantify how likely it is we’ve made an error in claiming there’s a difference between versions – when there’s really not; all version perform the same.  

P-value works with α as truth barometer.

It answers the question: if the null hypothesis is true, how likely is it the outcome will occur?

The answer can be stated with a specific value. A p-value lower than α, usually set at  ≤0.05, means the outcome is unlikely due to random chance. The result is surprising and, therefore, significant.

The lower the p-value, the more significant the finding. 

When we’re very confident (95%+) about our result, we’re able to put our money where our mouth is and know we’ve truly found a winner, not just an outcome of error or random chance. 

So, in summary, a statistically significant result shows us the result is unlikely due to just random chance or error. There’s strong evidence we’ve found a winner. That's surprising. And, therefore, significant.

We can, then, take our winning result and implement it with confidence, knowing that it will indeed lift conversions on our website.

Without achieving a statistically significant result, we’re left in the dark.

We have no way to know whether we’ve actually, accurately called a winner, made an error, or got results just due to just random chance or noise.

For this reason, understanding what statistical significance is and how to apply it is so important. We don’t want to be wrong in calling and implementing supposed winners.

Currently, statistical significance is the standard, accepted way to declare a winner in frequentist A/B testing. 

Whether it should be or not is a whole other topic. One that’s up for serious debate. . .

But for now, statistical significance is what we use. So apply it. Properly.


Final thoughts

Congrats! You’ve just made it through this article, and have hopefully learned some things along the way.

Now, you deserve a nice break. 

Grab yourself a bubble tea , kick your feet up, and breathe out a sigh of relief.

The next time you go to calculate a winning test, you’ll know exactly what to look for and how to confidently declare a statistically significant result. If you’ve actually got one. 😉

Hope this article has been helpful for you. Please share widely and post your questions or comments in the section below.


Glossary: key terms to know

  • Frequentist testing - a statistical method of calculation used in traditional hypothesis-based A/B testing in which you aim to prove or disprove the null hypothesis.
  • Null hypothesis - the assumption there is no difference between test versions; both perform equally effectively. Your aim is to disprove, or nullify, the null hypothesis.
  • Type I error - a fault that occurs when the null hypothesis is rejected even though it shouldn’t be. Happens when you incorrectly declare a difference between versions when, really, there isn’t.
  • Significance level alpha (α) - a measure used to determine the likelihood of a type I error, expressed as a probability. The standard significance level (α) = 0.05. 
  • P-value - answers the question, if the null hypothesis is true, how likely is it the outcome will occur? The answer can be stated with a specific value. A p-value lower than α, usually set at  ≤0.05, means the outcome is unlikely due to random chance. The result is surprising and, therefore, significant. The lower the p-value, the more significant the finding.
  • Confidence level - how certain you can be the conversion uplift detected isn’t just the result of random chance or error. Has an inverse relationship to p-value and is expressed as a percentage. A p-value of 0.05 (5%) = 95% confidence.

About the authors:

Subscribe
Notify of
guest
4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Pritul Patel
1 month ago

Great attempt at explaining the modern day convoluted Frequentist statistics. Don't take it personally but there are still a few errors in explaining the fundamentals. The modern day convoluted frequentist statistics exists because someone somewhere in history decided to combine two different theories of frequentist statistics and make every future experimenters life miserable. Unfortunately, this article is not immune to it. The "p-value of statistical significance" (popularized by Ronald Fisher) takes a different philosophical approach to the "Theory of Hypothesis testing" (invented by a statistician duo with last names Neyman-Pearson). This article fails to distinguish between these two theories. In… Read more »

Last edited 1 month ago by Pritul Patel
Georgi Georgiev
1 month ago

Good intentions do not always result in good outcomes. The article is riddled with errors. A few from the section on p-values: 1) "Officially defined, p-value is the probability of observing a sample more extreme than the ones observed in your data, assuming the null hypothesis is true." - false 2) "P-value shows the probability of there actually being a conversion difference between variants." - incorrect 3) "Remember, an alpha of 0.05 is the accepted standard. " - not true. It is a minimum acceptable level in a few disciplines. Far from a standard of any kind. 4) "However low… Read more »

Other Posts You Might Enjoy

eyechevron-downtwitterfacebookarrow-circle-upenvelopelinkedinxingpaper-planepinterest-pwhatsappcommentingmagnifiercrossmenu-circlecross-circle
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram