By: Deborah O’Malley| 2019
There’s some bad news and good news.
Let’s get the bad news out of the way. Then, go into the good news.
The bad news is: what you’ve been taught about how to calculate a valid A/B test is probably wrong!
That cause you’ve probably been taught using Frequentist Statistics.
While there’s nothing wrong with Frequentist Statistics, the model doesn’t work that well when applied to A/B testing.
In Frequentist statistics, the only way to validly address this question is by stating a null hypothesis. And here’s where it gets more mind boggling. . .
A null hypothesis is actually the opposite of the result you’re trying to prove. In A/B testing, the goal is to show one variant outperforms the other. But, a null hypothesis assumes there’s no difference between variants A and B; both perform equally effectively.
You find there is indeed a difference between versions A and B.
Since you’ve disproved the null hypothesis, you’re able to conclude your winning variant has a measurable impact. You can then empirically calculate the conversion rate and ultimately arrive at a p-value.
P-Value and Confidence Intervals
But, here again, most marketers have been doing it wrong. We’ve been incorrectly led to believe that the p-value is the probability that a mistake was made while running the test. For example, if the p-value is 0.01 we’ve been taught to assume our test is 99.9% accurate. But, this interpretation isn’t correct.
P-value actually represents the strength of evidence we have to reject the null hypothesis and declare significance. (A significant test has a p-value of <0.05.) When you just calculate the p-value, there’s no indication of the significance of the result.
To overcome this drawback, we’ve been taught to use confidence intervals, which let us calculate uncertainty. But, here too, there are problems.
A 95% confidence level actually means the result will contain the true parameter with 95% probability. That’s not the same as saying there’s a 95% chance the true parameter falls within your interval. Confidence intervals also fluctuate the more experiments you run. So, you can’t confidently assert your results are fully accurate all the time.
Fixed Sample Size
The next faux pas marketers make is not calculating the required sample size — before running the test. Typically, we run tests and stop them once we’ve achieved a significant difference between variants. We’re not taking into account sample size. But, most testing software is based on Frequentist Statistics and assumes sample size was determined in advance.
Since factors like statistical power and significance level are dependent on sample size, you absolutely need to properly calculate required sample size before running your test. To accurately calculate your sample size you can use a tool like this one.
Based on your pre-determined sample size, you should then let the test run its full course — until you’ve collected observations from the entire sample population.
Peeking At Results
But, most of us don’t do that. We peek at our test results. The problem is, in many testing software, statistical significance is recalculated every time you look at your results.
The more you peek, the more the reported statistical significance will vary from the actual statistical significance.
So, you’re results aren’t going to be accurate.
If you’re interested in learning more about the reasoning behind this outcome, check out this article. It’s an oldie, but goodie.
What We’re Doing Wrong
Because marketers tend to peek and stop their studies when their test has reached significance — not an adequate sample, results may be flawed.
Add that to the fact there’s no way to prove the null hypothesis, because you can’t accurately determine the probability that version B is better than version A, and you can see, many A/B tests run on incorrectly calculated assumptions – not solid evidence.
That’s the bad news.
The good news is, there’s a better approach. It’s called Bayesian statistics.
What are your thoughts on using Frequentist statistics? Let us know in the comments section below.