🎉Be Part of the Best in Test Awards! 🎉 Find out more
Account
guess_the_test-white-green
guess_the_test-white-green
Menu
Free Sign up
Member Login

How to Correctly Calculate Sample Size in A/B Testing

By: Deborah O'Malley, M.Sc | Last updated December, 2023

What is sample size?

Very simply stated, in A/B testing, sample size describes the number of visitors you need to accurately run a valid test.

The group of users who take part in your experiment comprise the sample population.

When running an experiment, it’s important to have a large enough sample so the portion of users accurately represents your entire audience.

If your sample size is too small, your test results will not be adequately powered to detect a meaningful effect. In other words, the results may appear exaggerated, inaccurate, and will not truly represent how your entire audience actually behaves.

How large does your sample size need to be to run a valid A/B test?

What’s the minimum sample size needed to run an accurate A/B test?

It’s such a simple question -- with a very difficult answer.

Ask 100 skilled testers their thoughts and they’ll all tell you the same thing: it depends!

In short, the larger the sample size, the better.

And, as a result, the more certain you can be that your test findings are representative and truly reflect your overall population.

But that answer really doesn’t really help at all. Does it?

The problem with trying to calculate sample size

The problem is, the population you’re sampling needs to, in theory, be representative of the entire audience you’re trying to target -- which is of course an abstract and impossible task.

You can’t truly ever tap every single individual in your entire audience. Especially since, over time, your audience will presumably grow and change.

But what you can do is capture a broad, or large enough slice of your audience to get a reasonably accurate picture of how most users are likely to behave.

There will, of course, always be outliers, or users who behave entirely differently than anyone else.

But, again, with a large enough sample, these individual discrepancies become smoothed out and larger patterns become apparent -- like whether your audience responds better to version A or B.

So how big does my sample size need to be?

Keeping in mind, the actual answer is, “it depends.”

A general rule of thumb is: for a highly reliable test, you need a minimum of 30,000 visitors and 3,000 conversions per variant.

If you follow this guideline, you’ll generally achieve enough traffic and conversions to derive statistically significant results at a high level of confidence.

However, any testing purist will balk at this suggested guideline and tell you, you absolutely must calculate your sample size requirements.

How do you calculate sample size requirements?

A bit of good news.

If you’re able to wrap your head around a few statistical terms, it’s actually relatively easy to calculate your A/B test sample size requirements.

The fastest and most effective way to do so is using a sample size calculator.

Using a sample size calculator

There are many free, online A/B test sample size calculators available.

Some require more detailed information than others, but all will yield the same answer: the minimum number of people you need to yield accurate A/B test results.

My Favorite sample size calculator

My favorite, go-to sample size calculator is Evan Miller’s calculator because it is so flexible and thorough.

You have the ability to tweak a number of different parameters to yield an accurate sample size calculation.

But, there are many calculators out there. Here are some other good ones:

Every calculator will require you to plug-in different inputs or numbers.

This article specifically focuses on the inputs required in Evan Miller’s calculator because it’s one of the most complex.

If you can competently use his calculator, you can accurately use any.

Here’s a screenshot of the calculator:

And here’s where it gets a bit tricky. . .

Sample size calculator terminology

In order to properly use this calculator, you need to understand every aspect of it. Let’s explore what each term means:

Baseline conversion rate: 

The conversion rate for your control or original version. (Usually labelled “version A” when setting up an A/B test).

You should be able to find this conversion rate within your analytics platform.

If you don’t have or don’t know the baseline conversion rate, make your best educated guess.

For an eCommerce site, the average conversion rate across desktop and mobile hovers between 2-5%. So, if you're at a loss, plug in a number between 2-5%.

If you want to be conservative in your guess, go with the lower end -- which will push your sample size requirements up, and vise versa.

Minimum Detectable Effect (MDE): 

The MDE sounds complicated, but it's actually quite simple if you break the concept down into each of the three terms:

  • Minimum = smallest
  • Effect = conversion difference between the control and treatment
  • Detectable = want you want to see from running the experiment

Therefore, the minimum detectable effect is the smallest conversion lift you’re hoping to achieve.

Unfortunately, there’s no magic number for your MDE. Again, it depends.

What does it depend on?

As this article explains, a few things:

  • Historical data: observations you've made overtime that show, in general, most tests tend to achieve a certain lift, so this one should too.
  • What's worth it: a number you choose, based on what you consider worth it to take the time and resources to run the experiment. For example, a testing agency may, by default, set the MDE at 10% for every experiment at because that's the minimum needed to declare a win for the client.
  • Organizational maturity: a large, mature testing organization, with a lot of traffic, may set the MDE at 1-3% because, through ongoing optimization, getting gains any higher would be unrealistic.

As a very general rule of thumb, an MDE of 2-5% is reasonable.

Therefore, if you don't have enough data to historically inform your MDE, plug in a range between 2-5%.

If you don't have power to detect an MDE of 5%, the test results aren’t trustworthy. The larger the organization and traffic, the sample the MDE will likely be.

In, and in that tune, the smaller the effect, the bigger your sample size needed.

An MDE can be expressed as an absolute or relative amount.

Absolute: 

The actual raw number difference between the conversion rates of the control and variant.

For example, if the baseline conversion rate is 0.77% and you’re expecting the variant to achieve a MDE of ±1%, the absolute difference is 0.23% (Variant: 1% - Control: 0.77% = 0.23%) OR 1.77% (Variant 1% + Control 0.7&% = 1.77%).

Relative:

The percentage difference between the baseline conversion rate and the MDE of the variant.

For example, if the baseline conversion rate is 0.77% and you’re expecting the variant to achieve a MDE of ±1%, the relative difference between the  percentages is 29.87% (increase from Control: 0.77% to Variant: 1% = 29.87% gain) or -23% (decrease from Control: 0.77% to Variant 1% =-23%).

In general, clients are used to seeing a relative percentage lift, so it’s typically best way to use a relative percentage calculation and report results this way.

Statistical power 1−β:

Very simply stated, this concept is the probability of finding an “effect,” or difference between the performance of the control and variant(s), assuming there is one.

A power of 0.80 is considered standard best practice. So, you can leave it as the default range on this calculator.

A power of 0.80 means there's an 80% chance that, if there is an effect, you'll accurately detect it without error. Meaning there's only a 20% chance you'd miss properly detecting the effect. A risk worth taking.

In testing, your aim is to ensure you have enough power to meaningfully detect a difference in conversion rates. Therefore, a higher power is always better. But the trade-off is, it requires a larger sample size.

In tune, the larger your sample size, the higher your power which is part of why a large sample size is required for accurate A/B testing.

Significance Level α: 

Significance level α, also called "alpha (α)" is like a check point set before calling a test win or loss.

It acts as a tool to control how often we make incorrect conclusions.

As a very basic definition, significance level alpha represents the probability of committing a type I error -- which happens when there's a false positive, or think you've spotted a conversion difference that doesn't actually exist.

The closer to 0, the lower the probability of a type I error/false positive, but the higher the probability of a type II error/false negative.

Therefore, a happy middle ground, and commonly accepted level for α is 0.05.

This level mean accepting a (0.05) 5% chance of a false positive; which in turn, suggests there is a 95% chance the null hypothesis is correct and the treatment is no better than the control.

Calculating sample size ahead of running an A/B test

Once you’ve used the sample size calculator and crunched the numbers, you’re ready to start running your A/B test.

Now, you might be thinking, wait a minute. . . Why do I need to calculate the sample size I need BEFORE I run my A/B test?

The answer is two-fold:

  1. So you know you have a large enough sample size for your test to be adequately powered to accurately detect a meaningful effect.
  2. So you don’t prematurely stop the test and incorrectly declare a winner before one has truly emerged.

Remember: you need a large enough sample size, or amount of traffic, to adequately represent your entire audience.

What is peeking?

For most tests and websites, getting enough traffic takes time. But waiting is hard.

If you don’t set the sample size ahead of time, you might be tempted to “peak” at the results prematurely and declare a winner when, in reality, it’s far too early to tell.

Peeking at a test result is like baking a cake, opening the oven and deciding the cake is ready before it’s fully finished.

Even though it’s tempting to stop the test early and dig into that delicious win, the results are only half-baked.

Before pulling a cake out of the oven, a good baker pokes a tooth pick, or fork, into the cake batter to make sure it’s truly ready. If the batter sticks to the fork, the cake needs to stay in the over longer longer.

The same goes for an A/B test.

In testing, calculating sample size ahead of time is like an experimenter’s tuning fork.

It’s a definitive, quantitative metric you can use to know when you’ve achieved enough traffic to stop the test and reliably declare a test version has won.

When your "fork" shows the targeted sample size hasn't yet been reached, you need to keep the test running longer.

(If you liked this analogy, are into baking, and love cake, you've got to A/B test the world's best chocolate icing recipe! Checkout the recipe post here).

An important caveat is you only need to calculate sample size ahead of time if you’re running a traditional hypothesis-based test using the frequentist testing methodology.

If you’re running a test with using the Bayesian methodology, you don’t need to worry about calculating sample size ahead of time.

(If you're interested in learning more about the frequentist and bayesian testing methods checkout this GuessTheTest article).

As well, occasionally, you may find your test variant is drastically losing.

You may, then, decide you want to stop the test early to mitigate losses. Doing so should be done with caution, however, and only after you've truly given the test variant a true chance.

In the first few days of running a test, especially on lower traffic sites, test results can shift radically.

So, as a rule of thumb, it's best to let a test run it's full course, for a minimum of at least two weeks.

If you don't have the risk tolerance to do so, remember, testing itself is a way to mitigate risk since only 50% of the variant is exposed to the treatment.

How long do you need to run your test to achieve a valid sample size?

Again, it depends.

What does it depend on? The short answer is: everything.

The longer answer is a myriad of factors including, but not limited to:

  • The type of test you're running
  • How many variants you're testing
  • Seasonal factors
  • Sales cycles
  • And more. . .

However, the general A/B testing best practice is to let a test run for a minimum of 2-weeks but no longer than 6-8 weeks.

This time period is selected so that any trends observed over a one-week period, or less, can be confirmed and validated over again.

To learn more about the ideal timeframe to run your A/B test, what factors to consider, and how to calculate test duration, check out this GuessTheTest article.

Summary

  • To run a valid A/B test, the larger the sample size, the better.
  • As a general guideline, test results are valid when you achieve at least, 30,000 visitors per variant with at least 3,000 conversions on that variant. However, generally even higher numbers are preferred.
  • A sample size calculator, like this one, will help you determine how large your sample size needs to be to run a valid test that yields statistically significant results.
  • You should always calculate your sample size needs AHEAD of actually running your A/B test and stop the test only when you've achieved a large enough sample size, as pre-determined by your sample size calculations.
  • In situations with lower traffic, it's best to limit the number of variants tested, usually to 2 (A vs. B), so each version receives enough traffic to draw conclusive results in an adequate timeframe.
  • The ideal testing time period is generally between 2-6 weeks. Anything shorter, results may not hold true overtime. Anything longer and other factors may start to confound and muddy test results.

Happy testing!

Hope you found this article helpful! If so, please comment and share widely.

Your thoughts?

Do you have any questions or thoughts? Give your feedback in the comments section below:

Subscribe
Notify of
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
KateR
KateR
2 years ago

Hi, cool post, you described this question very clearly, thanks! In the post "how to do a/b testing" Calculators Optimizely , AB Tasty , Unbounce , AB Test Guide , tools.driveback.ru are also given as an example. In general, this is a good guide about testing, its goals, types.

Keith
Keith
8 months ago

I've been trying to understand and apply these concepts for 6 months. This article actually helped me get it

Other Posts You Might Enjoy

magnifiercrossmenu-circlecross-circle
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram