# Everything You’ve Learned About A/B Testing Is Wrong!

*B**y: Deborah O’Malley| 2019*

There’s some bad news and good news.

Let’s get the bad news out of the way. Then, go into the good news.

The bad news is: what you’ve been taught about how to calculate a valid A/B test is probably wrong!

That cause you’ve probably been taught using **Frequentist Statistics**.

# Frequentist Statistics

While there’s nothing *wrong* with Frequentist Statistics, the model doesn’t work that well when applied to A/B testing.

The reason why is because the Frequentist approach can’t accurately answer the question “What’s the probability version B will perform better than version A?” Or vice versa.

In Frequentist statistics, the only way to validly address this question is by stating a null hypothesis. And here’s where it gets more mind boggling. . .

# Null Hypothesis

A null hypothesis is actually the ** opposite** of the result you’re trying to prove. In A/B testing, the goal is to show one variant outperforms the other. But, a null hypothesis assumes there’s no difference between variants A and B; both perform equally effectively.

When you run an A/B test and discover, for example, variant B converts 5% better than variant A, your null hypothesis is proven **incorrect**.

You find there is indeed a difference between versions A and B.

Since you’ve disproved the null hypothesis, you’re able to conclude your winning variant has a measurable impact. You can then empirically calculate the conversion rate and ultimately arrive at a p-value.

# P-Value and Confidence Intervals

But, here again, most marketers have been doing it wrong. We’ve been incorrectly led to believe that the p-value is the probability that a mistake was made while running the test. For example, if the p-value is 0.01 we’ve been taught to assume our test is 99.9% accurate. But, this interpretation isn’t correct.

P-value actually represents the strength of evidence we have to reject the null hypothesis and declare significance. (A significant test has a p-value of <0.05.) When you just calculate the p-value, there’s no indication of the significance of the result.

# Confidence Intervals

To overcome this drawback, we’ve been taught to use confidence intervals, which let us calculate uncertainty. But, here too, there are problems.

A 95% confidence level actually means the result will contain the true parameter with 95% probability. That’s not the same as saying there’s a 95% chance the true parameter falls within your interval. Confidence intervals also fluctuate the more experiments you run. So, you can’t *confidently* assert your results are fully accurate all the time.

# Fixed Sample Size

The next faux pas marketers make is not calculating the required sample size — before running the test. Typically, we run tests and stop them once we’ve achieved a significant difference between variants. We’re not taking into account sample size. But, most testing software is based on Frequentist Statistics and assumes sample size was determined in advance.

Since factors like statistical power and significance level are dependent on sample size, you absolutely need to properly calculate required sample size before running your test. To accurately calculate your sample size you can use a tool like this one.

Based on your pre-determined sample size, you should then let the test run its full course — until you’ve collected observations from the entire sample population.

# Peeking At Results

But, most of us don’t do that. We peek at our test results. The problem is, in many testing software, statistical significance is recalculated every time you look at your results.

The more you peek, the more the *reported *statistical significance will vary from the *actual* statistical significance.

So, you’re results aren’t going to be accurate.

If you’re interested in learning more about the reasoning behind this outcome, check out this article. It’s an oldie, but goodie.

# What We’re Doing Wrong

Because marketers tend to peek and stop their studies when their test has reached significance — not an adequate sample, results may be flawed.

Add that to the fact there’s no way to prove the null hypothesis, because you can’t accurately determine the probability that version B is better than version A, and you can see, many A/B tests run on incorrectly calculated assumptions – not solid evidence.

That’s the bad news.

The good news is, there’s a better approach. It’s called Bayesian statistics.

**Click here** to learn what Bayesian Statistics is, how it works, and how you can use it to improve your conversions. *(Note, this content is available to Pro Members only)*

What are your thoughts on using Frequentist statistics? Let us know in the comments section below.

I was surprised to know that till now the marketers were carrying out the A/B testing in a wrong way. Still, I believe that the marketing automation platform tool which I use does it in a better idea as I can enhance the conversions.

@Matthew – Thanks for your comment. Yes, it’s surprising to know that many testing software platforms are based on a statistical model that requires specific user knowledge and behavior. Enhancing conversions is fantastic, but before implementing any changes, you’ll want to ensure your data is completely accurate and valid. Otherwise, you risk implementing results which aren’t fully accurate.