By: Deborah O'Malley, M.Sc | Last updated July, 2021
Very simply stated, in A/B testing, sample size describes the number of visitors you need to accurately run a valid test.
The group of users who take part in your experiment comprise the sample population.
When running a split-test, it’s important to have a large enough sample so the portion of users accurately represents your entire audience.
If your sample size is too small, your test results will not be statistically valid or reliable at a high level of confidence. In other words, the results may not accurately represent how your entire audience actually behaves.
What’s the minimum sample size needed to run an accurate A/B test?
It’s such a simple question -- with a very difficult answer.
Ask 100 skilled testers their thoughts and they’ll all tell you the same thing: it depends!
In short, the larger the sample size, the better.
And, as a result, the more certain you can be that your test findings are representative and truly reflect your overall population.
But that answer really doesn’t really help at all. Does it?
The problem is, the population you’re sampling needs to, in theory, be representative of the entire audience you’re trying to target -- which is of course an abstract and impossible task.
You can’t truly ever tap every single individual in your entire audience. Especially since, over time, your audience will presumably grow and change.
But what you can do is capture a broad, or large enough slice of your audience to get a reasonably accurate picture of how most users are likely to behave.
There will, of course, always be outliers, or users who behave entirely differently than anyone else.
But, again, with a large enough sample, these individual discrepancies become smoothed out and larger patterns become apparent -- like whether your audience responds better to version A or B.
Keeping in mind, the actual answer is, “it depends.”
A general rule of thumb is: for a highly reliable test, you need a minimum of 1,000 visitors and 100 conversions per variant.
If you follow this guideline, you’ll generally achieve enough traffic and conversions to derive statistically significant results at a high level of confidence.
However, any testing purist will balk at this suggested guideline and tell you, you absolutely must calculate your sample size requirements.
A bit of good news.
If you’re able to wrap your head around a few statistical terms, it’s actually relatively easy to calculate your A/B test sample size requirements.
The fastest and most effective way to do so is using a sample size calculator.
There are many free, online split-test sample size calculators available.
Some require more detailed information than others, but all will yield the same answer: the minimum number of people you need to yield accurate A/B test results.
My favorite, go-to sample size calculator is Evan Miller’s calculator because it is so flexible and thorough.
You have the ability to tweak a number of different parameters to yield an accurate sample size calculation.
Every calculator will require you to plug-in different inputs or numbers.
This article specifically focuses on the inputs required in Evan Miller’s calculator because it’s one of the most complex.
If you can competently use his calculator, you can accurately use any.
Here’s a screenshot of the calculator:
And here’s where it gets a bit tricky. . .
In order to properly use this calculator, you need to understand every aspect of it. Let’s explore what each term means:
If you don’t have or don’t know the baseline conversion rate, make your best educated guess.
Very simply stated, in A/B testing, an “effect” essentially shows one version indeed outperformed the other(s).
The minimum detectable effect is the smallest conversion lift you’re hoping to achieve with the winning variant. The smaller the expected gain, the bigger your sample size needed.
Unfortunately, there’s no magic number; your MDE is going to depend on your specific needs.
As a starting point, ask yourself, what is the minimum improvement needed to make running this test worthwhile for myself or the client? Testing takes a lot of time, energy, and resources. You want it to payoff.
The actual raw number difference between the conversion rates of the control and variant.
For example, if the baseline conversion rate is 0.77% and you’re expecting the variant to achieve a MDE of 1%, the absolute difference is 0.23% (Variant: 1% - Control: 0.77% = 0.33% absolute difference).
The percentage difference between the baseline conversion rate and the MDE of the variant.
For example, if the baseline conversion rate is 0.77% and you’re expecting the variant to achieve a MDE of 1%, the relative difference between the percentages is 29.87% (increase from Control: 0.77% to Variant: 1% = 29.87% gain).
In general, clients are used to seeing a relative percentage lift, so it’s typically best way to use this calculation and report results this way.
Very simply stated, the probability of finding an “effect,” or difference between the performance of the control and variant(s), assuming there is one.
A power of 0.80 is considered standard best practice. So, you can leave it as the default range on this calculator.
A power of 0.80 means there's an 80% chance that, if there is an effect, you'll detect it. Meaning there's only a 20% chance you'd miss detecting the effect.
In testing, your aim is to ensure you have enough power to meaningfully detect a difference in conversion rates. Therefore, a higher power is always better. But the trade-off is, it requires a larger sample size.
In tune, the larger your sample size, the higher your power which is part of why a large sample size is required for accurate A/B testing.
As a very basic definition, signficance level alpha is how trustworthy the result is.
As an A/B testing best practice, your significance level should be 5% or lower.
This number means there’s less than a 5% chance you’ve happened to find a difference between the control and variant -- when no difference actually exists.
As such, you’re 95% confident results are accurate, reliable, and repeatable.
It’s important to note that results can never actually achieve 100% statistical significance.
Instead, you can only ever be 99.9% confident that a measurable conversion rate difference can truly be detected between the control and variant.
Once you’ve used the sample size calculator and crunched the numbers, you’re ready to start running your A/B test.
Now, you might be thinking, wait a minute. . . Why do I need to calculate the sample size I need BEFORE I run my A/B test?
The short answer is: so you don’t prematurely stop the test and incorrectly declare a winner before one has truly emerged.
Remember: you need a large enough sample size, or amount of traffic, to adequately represent your entire audience.
For most tests and websites, getting enough traffic takes time. But waiting is hard.
If you don’t set the sample size ahead of time, you might be tempted to “peak” at the results prematurely and declare a winner when, in reality, it’s far too early to tell.
Peeking at a test result is like baking a cake, opening the oven and deciding the cake is ready before it’s fully finished.
Even though it’s tempting to stop the test early and dig into that delicious win, the results are only half-baked.
Before pulling a cake out of the oven, a good baker pokes a tooth pick, or fork, into the cake batter to make sure it’s truly ready. If the batter sticks to the fork, the cake needs to stay in the over longer longer.
The same goes for an A/B test.
In testing, calculating sample size ahead of time is like an experimenter’s tuning fork.
It’s a definitive, quantitative metric you can use to know when you’ve achieved enough traffic to stop the test and reliably declare a test version has won.
When your "fork" shows the targeted sample size hasn't yet been reached, you need to keep the test running longer.
If you’re running a test with using the Bayesian methodology, you don’t need to worry about calculating sample size ahead of time.
If you're interested in learning more about these testing methods checkout this GuessTheTest article.
(And, if you're into baking, love cake, and want to A/B try and test the world's best chocolate icing recipe, checkout this recipe post).
Again, it depends.
What does it depend on? The short answer is: everything.
The longer answer is: A myriad of factors including, but not limited to:
Because every website is different, there’s no set time period for how long a test should run.
The right answer is: long enough to accurately take into account factors like seasonality and your company’s sales cycles.
The general A/B testing best practice is to let a test run for a minimum of 2-weeks, but no longer than 6-8 weeks.
This time period is selected so that any trends observed over a one-week period, or less, can be confirmed and validated over again.
For example, if users behave differently on the weekend, you’ll want to see that same pattern repeat across two weekends, rather than just once, to smooth out any questions or discrepancies in the data.
However, if you need to run a test for longer than 6 weeks, you likely don’t have enough traffic to get a significant result.
As well, more than about 6 weeks and the data starts to become muddied. Things like user patterns may shift or cookies become deleted, introducing a whole new set of variables into the equation.
And, as a result, you won’t know if it’s changing user behavior or something else that’s contributing the test results.
Plus, it can be expensive or frustrating when tests need to run for several months.
So, if you have to run your test longer than the ideal 2-6 week timeframe, you should question how worthwhile it is to run the test.
A testing duration calculator, like this one, can help determine how long it will likely take for a test to conclude, based on projected site traffic.
Previously collected analytics data can give you advanced insight into traffic trends.
If you have the time, resources, and budget, running a low-traffic A/B test is still better than not testing anything at all.
A low traffic test that takes longer to run will still give you some indication of how site visitors are likely to act and how a variant will most probably perform.
Results, however, are more of an approximation of what is likely to work rather than absolute evidence of what has proven to work.
So, keep this outcome in mind when deciding to implement any so-called “winning” designs.
Hope you found this article helpful! If so, please share widely.
Do you have any questions or thoughts? Give your feedback in the comments section below: