Free Sign up
Member Login

By: Deborah O'Malley | Last updated September, 2022

What is Minimum Detectable Effect (MDE)?

If you've been into experimentation long enough, you've likely come across the term MDE -- which stands for Minimum Detectable Effect (MDE).

The MDE sounds big and fancy, but the concept is actually quite simple when you break it down. It's the:

  • Minimum = smallest
  • Effect = conversion difference
  • Detectable = you want to see from running the experiment

Why is MDE important?

As this GuessTheTest article explains, in order to run a trustworthy experiment -- one that's properly powered experiment, based on an adequate sample -- it's crucial you calculate you calculate the MDE.

But not just calculate it.

Calculate it AHEAD of running the experiment.

The problem is, doing so can feel like a tricky, speculative exercise.

After all, how can you possibly know what effect, or conversion lift you want to detect from the experiment?! If you knew that, you wouldn't need to run the experiment to begin with!

Adding insult to injury, things get even more hazy because the MDE is directly tied into your sample size requirements.

The larger the MDE, the smaller the sample size needed to run your experiment. And vice versa. The smaller the MDE, the bigger the sample required for your experiment to be adequately powered.

But if your sample size requirements are tied into your MDE, and you don't know your MDE, how can you possibly know the required sample size either?

The answer is: you calculate them. Both. At the same time.

There are lots of head spinning ways to do so. This article outlines a few.

But, if you're not mathematically inclined, here's the good news. . .

You can use a pre-test analysis calculator, like this one, to do all the hard work for you:

Now, as said, that's the good news!

The bad news is, even a calculator like this one isn't all that intuitive.

So, to help you out, this article breaks down exactly what you need to input into an MDE calculator, with step-by-step directions and screenshots so you'll be completely clear and feel fully confident every step of the way.

Let's dig in:

Working the MDE calculator

To work this calculator, you’ll need to know your average weekly traffic and conversion numbers.

If you’re using an analytics platform, like Google Analytics, you’ll be able to easily find this data by looking at your traffic and conversion trends.


In Google’s current Universal Analytics, traffic data can be obtained by going to the Audience/Overview tab:

It’s, typically, best to take a snapshot of at least 3 months to get a broader, or bigger picture view of your audience over time.

For this example, let’s set our time frame from June 1 - Aug. 31.

Now, you can decide to look at these numbers three ways:

  • Users: the total number of users, or visitors, coming to your site during the date range.
  • New users: those visitors who come to your site for the first time during that date range.
  • Sessions: users who interact with your website within a particular timeframe. As this article explains, the same user can have multiple sessions on your website.

Given these differences, calculating the total number of users will probably give you the most accurate indication of your traffic trends.

With these data points in mind, over the 3-month period, this site saw 67,678 users. There are, typically, about 13 weeks in 3 months, so to calculate users per week you’d divide 67,678/13=5,206.

In other words, the site received about 5,206 users/week.

You’d then plug this number into the calculator.


To calculate the number of conversions over this time period, you’ll need to have already set-up conversion goals in Google Analytics. Here’s more information on how to do so.

Assuming you’ve set-up conversion goals, you’ll next assess the number of conversions by going to the Goals/Overview tab, selecting the conversion goal you want to measure for your test, and seeing the number of conversions:

In this example, there were 287 conversions over the 3-month time period which amounts to an average of 287/13=22 conversions/week. 

Now, imagine you want to test two variants: version A (the control, or original version) and B (the variant).

You’d now plug the traffic, conversion, and variant numbers into the calculator:

Now you can calculate your baseline conversion rate, which is the rate at which your current (control) version is converting at.

This calculator will automatically calculate your baseline conversion rate for you, based on the numbers above.

However, if you want to confirm the calculations, simpley divide the number of goals completed by the traffic which, in this case, is 22 conversions per week/5,206 visitors per week (22/5,206=0.0042). To get a percentage, times this amount by 100 (0.0042*100=0.42%).

You’d end up with a baseline conversion rate of 0.42%:

Next, plug in the confidence level and power at which you want to obtain results.

As a general A/B testing best practice, you want a confidence level of +95% and statistical power of +80%:

Based on these numbers, the pre-test sample size calculator is indicating to you that you’ll want to run your test for:

  • At least 6 weeks
  • With at least 15,618 visitors/variant
  • Based on a relative MDE of at least 46.43%

The optimal MDE

As a very basic rule of thumb, some optimization experts, like Ronny Kohavi, suggests setting the relative MDE up to a maximum of 5%.

If the experiment isn't powered enough to detect a 5% effect, the test results can't considered trustworthy.

However, it's also dangerous to go much beyond 5% because, at least in Ronny's experience, most trustworthy tests don't yield more than a 5% relative conversion lift.

As such, for a mature testing organization which large amounts of traffic and an aggressive optimization program, a relative 1-2% MDE is more reasonable and is still reason to celebrate.

MDE guidelines to follow

In the example shown above, the relative MDE was 46.43%, which is clearly above the 5% best practice.

This MDE indicates traffic is on the lower side and your experiment may not be adequately powered to detect a meaningful effect in a reasonable timeframe.

In this case, if you do decide to proceed with running the test, make sure to follow these guidelines:

  • Calculate the sample size requirements ahead of time. Make sure you have enough traffic to reach the suggested sample size in an adequate timeframe.
  • Don't stop the experiment early before you've reached this calculated sample size target -- even if results appear significant earlier.
  • Run the test for the minimum stated testing time period recommended by the calculator, or at they very least two weeks to round out any discrepancies in user behavior. 
  • Consider if the test is truly worth running, and use the outcome only as an indicator of results, not gospel. Low sample sites (traffic or conversion numbers) are tricky to test on.
  • Focus on making more pronounced changes that should, hopefully, create a bigger positive impact and have a larger effect on conversions.

Hope this article has been useful for you. Share your thoughts and comments below:

Last updated September, 2022

Written by Deborah O'Malley

Deborah O'Malley is a top A/B testing influencer who founded GuessTheTest to connect digital marketers interested in A/B testing with helpful resources and fun, gamified case studies that inspire and validate testing ideas.

With a special contribution from Ishan Goel

Ishan Goel is a data scientist and statistician. He's currently leading the data science team at Wingify (the parent company of VWO) to develop the statistical algorithms powering A/B testing. An avid reader and writer, Ishan shares his learnings about experimentation on his personal blog, Bagels for Thought.

Special thanks Ronny Kohavi

Ronny Kohavi is an esteemed A/B testing consultant who provided valuable feedback on earlier drafts of this article and raised some of the key points presented through his Accelerating Innovation with A/B Testing class and recent paper on A/B Testing Intuition Busters.


An astounding +364% lift in conversions, an enormous +337% improvement in clickthrough rate, and cruelly disappointing -60% drop in for submissions.

What do all these jaw-dropping conversion rates have in common?

They’re all extreme results!

And, as this article explains, they’re probably not real.

Because, according to something known as Twyman’s Law, any figure that's interesting or different is usually wrong. 

The trustworthiness is suspect.

Example of +364% lift in conversions

A great example of this concept comes from Ryan Thomas, co-founder of the optimization agency Koalatative.

He shared a seething sarcastic LinkedIn post announcing he had achieved a record-breaking +364% lift running a client A/B test:

Sounds impressive!

But, as Ryan, and others aptly explained, the tiny sample of 17 vs. 11 visitors, with 1 vs. 3 conversions, was so low, the results were incredibly skewed.

This extreme result is, unfortunately, not a one-off case.

Example of a +337% conversion lift

In fact, as a learning exercise for experimenters, GuessTheTest recently published a similar case study in which the testing organization claimed to achieve a +337% conversion lift.

That’s massive!

But, taking a closer look at the conversion figures, you can see at just 3 vs. 12, the traffic and conversion numbers were so low, the lift appeared artificially huge: 

And, according to Twyman's Law, makes the test trustworthiness suspect.

Example of a -60% drop in conversions

Okay, so you’re probably thinking, yeah, but these examples are of very low traffic tests.

And everyone knows, you shouldn’t test with such low traffic!


But high traffic sites aren’t immune to this issue either.

In fact, in a study recently ran for a prominent SEO site -- with thousands of daily visitors -- one test yielded an extremely disappointing -60% drop in conversions.

However, on closer inspection, the seemingly enormous drop was the difference between just 2 vs. 5 conversions. 

Although the page had thousands of visitors per variant, very, very few were converting. 

And of those that did, there were far too few conversions to know if one version truly outperformed or if the conversion difference was just due to random chance.

Statistically significant results aren't always trustworthy

The obvious problem with all these tests was the sample -- either the traffic and/or the conversion numbers -- were so low, the estimate of the lift was unreliable.

It appeared enormous when, in reality, it was just the difference between a few random conversions.

But, the problem is, you can get these kinds of test outcomes and still achieve statistically significant results.


Take this example of 3 vs. 12 conversions based on a sample size of 82 vs. 75 visitors.

As you can see, plugging the numbers into a statistical significance calculator shows the result is indeed significant at a 95% level of confidence with a p-value of 0.009:

Which goes to show: a test can appear significant after only a few conversions, but that doesn't actually equate to a trustworthy A/B test result.

How is this outcome possible?

It's all about the power 

As highly regarded regarded stats guru, Ronny Kohavi, explains in his class, Accelerating Innovation With A/B Testing, it’s all about the power.

A result can appear statistically significant, and in an underpowered experiment, the lift will be exaggerated.

What is power?

Power measures the likelihood of accurately detecting a real effect, or conversion difference, between the control and treatment(s), assuming a difference exists.

Power is a function of delta.

Delta describes the statistical sensitivity or ability to pick up a conversion difference between versions. 

This conversion difference is the minimum effect size, or smallest conversion difference, you want to detect. 

Smaller values make the test more sensitive but require more users.

If these terms all seem a bit confusing, Ishan Goel, lead data scientist at the A/B testing platform VWO’s parent company, Wingify, offers a more relatable, real-life example.

He suggests you can best understand power, and its relationship to delta, by thinking of a thermometer used to detect a fever.

If you have a low fever, a cheap thermometer that's not very sensitive to slight temperature changes might not pick up your mild fever. It's too low-powered.

You need a thermometer that's very sensitive or high-powered.

The same is true in testing.

To accurately detect small conversion differences, you need high power. The higher the power, the higher the likelihood of accurately detecting a real effect, or conversion difference.

But, there’s a trade off. The higher the power, the larger the sample size also needs to be. 

In turn, when sample sizes are low, power is reduced.

Underpowered Experiments

A test is what's known as underpowered when the sample is so low the effect, or conversion difference detected, isn't accurate. 

All the examples we just saw were of low sample, underpowered tests.

Sure, the results may have been statistically significant. But the conversion rate was artificially skewed because the samples were so low the test wasn't adequately powered.

The sample – whether it be traffic, conversions, or both – was too low to be adequately powered. 

The lower the power, the more highly exaggerated the effect.

The winner's curse

Statistically significant, low-powered tests happen more than most experimenters would like to admit.

In fact, as highly regarded experimentation expert, Ronny Kohavi explains, because of the way statistics works, if you pick the standard alpha (p-value threshold) of 0.05, you’ll get a statistically significant result at least 5% of the time – whether there’s a true difference or not. 

In those +5% of cases, the estimated effect will be exaggerated.

Here’s a document Ronny created showing this phenomenon:

With a p-value of 0.05, 1 in 20 experiments will be statistically significant even when there is actually no real difference between the control and treatment.

Ronny remarks that, for experienced experimenters running trustworthy tests, it’s rare to see lifts of more than a few percentage points. 

In fact, Ronny recalls, in the history of Bing, which ran 10’s of thousands of experiments, only 2 impacted revenue by more than 10%. 

He adds that it’s very unlikely that simple design changes, like in the examples above, can create over a 10% lift -- let alone a 20% gain!

It’s a lot more likely the lift is the outcome of a poorly designed, very underpowered experiment.

This phenomenon is known as the winner’s curse.

Statistically significant results from under-powered experiments exaggerate the lift, so the so-called "winning result" is not as rosy as initially believed.

The apparent win is a curse that becomes more worthy of a cry than a celebration.

Overcoming low powered tests

Great. So, other than not running an experiment, how do you overcome the pitfalls of underpowered, low sample tests? 🤔

To answer this question, we’ve turned to several experts. 

Here’s what they advise:

1. Do a power calculation AHEAD of running the experiment

According to Ronny, the first and most important step is to do a power calculation before running the test.

Remember, power is the percentage of time the minimum effect will be detected, assuming a conversion difference actually exists. 

A power of 0.80 (80%) is the standard. 

This amount means you’ll successfully detect a meaningful conversion difference at least 80% of the time.

As such, there's only a (0.20) 20% chance of missing this effect and ending up with a false negative. A risk we’re willing to take.

To calculate power, you need to:

  1. Estimate the variance of the conversion metric from historical data. Variance is expressed as the historical conversion rate*(1-historical conversion rate).
    • For example, If you're looking at conversions, and historical data shows a (0.04) 4% conversion rate, then the variance is conversion rate*(1-historical conversion rate) = 0.04*(1-0.04) which is: 0.04*(0.96) = 0.0384.
  2. Decide on the absolute delta you want to detect by multiplying the relative lift with the conversion rate.
    • For example, let's assume a (0.04) 4% conversion rate and that you want to detect a (0.05) 5% relative lift. Your absolute delta is (conversion rate*relative lift) 0.04*0.05 = 0.002.
  3. Plug these numbers into the power formula. The power formula for 80% power and p-value of 0.05 is the constant 16*variance/delta^2 where N = 16*P*(1-P)/(P*RD)^2
    • Using these numbers as an example, 16*0.0384/0.002^2 = 153,600 users

You now know you need at least 153,600 users, per variant, for the experiment to be adequately powered.

The lower the delta, the more users needed since you're trying to detect a smaller effect.

The opposite is also true. The higher the delta, the fewer users needed to detect a larger effect.

2. Determine sample size requirements AHEAD of running the experiment

If this power calculation comes across as complex, the good news is, you can come at it another way and instead first calculate your required sample size.

After all, there are many ways to skin a cat.

But pay attention here: the key is to calculate your sample size AHEAD of running the experiment. 

If you don’t, you may fall into the trap of stopping the test early if results appear significant -- even if the study is underpowered.

So. . . just how large of a sample do you need so your test isn't underpowered?

Here’s where it gets tricky. In true CRO style, it depends.

Some experimenters will tell you that, as a basic rule of thumb, you need at least 1,000 visitors per variant with at least 100 conversions per variant.

Others will say you need a whole lot more.

In this insightful Twitter post, Carl Weische, founder of the eCommerce testing agency, Accelerated  – who has run thousands of successful experiments for clients – claims you need 20,000-50,000 users per variant and at least 1,000 conversions per variant: 

Talk to other experiments and they may tell you otherwise.

So, really, across most experimenters, there’s no clear consensus. . .

Unless you do the math. Then, according to Ronny, you just need to know your variance and delta where the variance is p*(1-p) for conversion rate p.

But if formulas and math calculations leave your head spinning a bit, take heart.

You can also use a sample size calculator, like this one, to do all the hard work for you:

Here, you’ll just need to input:

  • Baseline conversion rate: the current conversion rate of the control.
  • Minimum detectable effect (MDE): the smallest conversion rate lift you expect to see.
  • Power (1−β): the percentage of time the minimum effect will be detected, assuming an effect exists.
  • Significance level alpha (α): the percentage of time a difference will be detected, assuming it doesn’t exist. With a p-value of 0.05, 5% of experiments will yield a false positive showing a difference when one doesn’t really exist.

So, in this example, assuming a baseline conversion rate of 4%, based on an MDE of 5%, with a standard power of 80% and a significance level of 5%, you’ll need a sample of 151,776 visitors per variant.

*Note, Evan Miller's calculator uses a slightly different value than 16 as the leading constant, so the estimates from his calculator are also slightly smaller.

3. Select the Minimum Detectable Effect (MDE)

The problem is, by relying on this calculator to determine the sample size, you now also need to consider and input the Minimum Detectable Effect, MDE (which is referred to as delta in the power formula above).

The MDE sounds big and fancy, but the concept is actually quite simple when you break it down. It's the:

  • Minimum = smallest
  • Effect = conversion difference
  • Detectable = you want to see from running the experiment

But, now it becomes a bit of a catch-22 because sample size requirements change based on the MDE you input. 

The lower the MDE, the greater the sample needed. And vice versa. 

So, the challenge becomes setting a realistic MDE that will accurately detect small differences between the control and treatments you’re testing, based on a realistic sample. 

Clearly, the larger the sample, the more time it will take to obtain it. So you also need to consider traffic constraints and what’s realistic for your site.

As a result, the MDE also presents a bit of an it depends scenario.

Every test you run may have a different MDE, or each may have the same MDE. There's no hard rule. It's based on your testing needs.

The MDE may be calculated from historical data in which you've observed that, in general, most tests tend to achieve a certain effect, so this one should too.

Or it can be a number you choose, based on what you consider worth it to take the time and resources to run an experiment.

For example, a testing agency may, by default, set the MDE at 5% because that's the minimum threshold needed to declare a test a winner for a client.

In contrast, a mature testing organization may set the MDE at 3% because, through ongoing optimization, eeking out gains any higher would be unrealistic.

What's the optimal MDE?

As a rule of thumb, Ronny suggests setting the MDE to a maximum of 5% for online A/B tests, but lowering it as the organization grows or gets more traffic.

For an established company, or mature testing organization, a 1-2% MDE is realistic. But it's hard to imagine an executive that doesn't care about big improvements to the business, so 5% is a reasonable upper bound.

This 5% upper limit exists because if you don't have the power to detect at least a 5% effect, the test results aren’t trustworthy.

However, take note, if you’re thinking your MDE should be 10% or higher, Ronny remarks it ain’t likely going to happen.

Most trustworthy tests -- that are properly powered -- just don't achieve that kind of lift.

In fact, Ronny recalls, across the thousands of experiments he was involved in, including at Bing, the sum of all relevant tests, over the year, was targeted at 2% improvement. A 3% improvement was a reason for true celebration!

How to calculate your MDE

If honing in on an MDE sounds like a highly speculative exercise that's going to leave you super stressed, don't worry!

You can simply use a pre-test analysis calculator like this one to determine  the MDE for you:

And, if looking at this screenshot leaves you wondering how you’re possibly supposed to calculate all these inputs, here’s an in-depth GuessTheTest article outlining exactly how to use this calculator.

Note, this calculator is similar to Evan Miller's, referenced above, but gives you the MDE as a specific variable with the number of weeks you'll need to run the test making it a useful tool to calculate the MDE. The constant for this calculator is also different from Evan Miller's, so if you compare directly, you might get slightly different results.

MDE pitfalls to avoid

With your MDE and sample size calculations worked out, the trick, then, is to:

  • Make sure you have enough traffic to reach the suggested sample size in an adequate time frame.
  • Not stop the experiment early before you've reached this calculated sample size target -- even if results appear significant earlier.
  • Run the test for the minimum stated timeframe, or at the very least two weeks to round out any discrepancies in user behavior. 

In case of very low or high traffic

If your pre-test sample size calculator shows you need thousands of visitors, and you only get hundreds per month, you'll want to consider whether the experiment is truly worth running since your data won't yield truly valid or significant results.

Additionally, Ronny cautions that, if you decide to proceed, a low sample test may give you statistically significant results, but the results may be a false positive and the lift highly exaggerated.

As this diagram (originally published in Ronny’s Intuition Busters’ paper) shows, when power is any lower than 10%, the effect, or conversion lift detected is incorrect up to half of the time!

So be aware of this pitfall before going into low sample testing. 

And if your sample is low, but you still decide to test, Ishan Goel recommends focussing on making more pronounced changes that will, hopefully, create a bigger impact and create a larger, more measurable effect.

Conversely, as your traffic increases, you’ll get the luxury of testing more nuanced changes. 

For instance, Google ran the famous 41 shades of blue experiment which tested the perfect shade of blue to get users clicking more. The experiment was only possible due to large traffic resulting in high-powered testing.

4. Assess the p-value and try to replicate it

Assuming you plan ahead, pre-calculate your required sample size and input the MDE but still get surprising results, what do you do?

According to Ronny, in his paper A/B Testing Intuition Busters, extreme or surprising results require a lower p-value to override the prior probability of an extreme result.

Typically, a p-value of <0.05 is considered adequate to confidently declare statistically significant results.

The lower the p-value, the more evidence you have to show the results didn't just occur through random chance.

As Ronny explains in this post on p-values and surprising results, the more extreme the result, the more important it is to be skeptical, and the more evidence you need before you can trust the result.

An easy way to get a lower p-value, Ronny details, is to do a replication run and combine the p-values. In other words, repeat the experiment and average the p-values obtained each time the experiment was run.

To compute the combined p-values, a tool like this one can be used.

As this example shows, if the experiment was run twice, and a p-value of 0.0044 was achieved twice, the combined p-value would be 0.0001. With this p-value, you could then declare, with greater certainty, that the results are truly significant:

5. Don't throw cats in tumble dryers

But even with all these safety checks in place, you never can be quite sure test results will be completely accurate.

As Ishan aptly points out, it can be valuable to second-guess all findings and question anything that looks too good to be true.

Because, as Ishan says, “in experimentation, you can find ways to disprove the validity of a result, but you can never find a way to prove the validity of the result." 

Or, as Maurice Beerthuyzen of the agency ClickValue puts it, it’s possible to put a cat in a tumble dryer, but that doesn’t mean it gives the right output:

So, the morale of the story is, don’t put cats in dryers. And don’t do low sample testing, underpowered testing – at least not without following all these checkpoints!

Share your thoughts

Hope this article has been useful for you in explaining the pitfalls of low sample testing – and how to avoid them.

Share your thoughts and comments below:

Last updated September, 2022

Written by Deborah O'Malley & Timothy Chan

With special thanks to Ronny Kohavi for providing feedback on an earlier draft of this article. Ronny's Accelerating Innovation With A/B testing course provided much of the inspiration for this piece and is a must-take for every experimenter!


Awesome. You’ve just run an A/B test, crunched the numbers, and achieved a statistically significant result.

It’s time to celebrate. You’ve got another test win. 🙂

Or have you?

As this article explains, understanding and calculating statistical significance is actually quite complex.

To properly call a winning (or losing) test, you need to understand what a statistically significant result really means. 

Otherwise, you’re left making incorrect conclusions, random decisions, or money-losing choices.

Many experimenters don’t truly know what statistical significance is or how to derive a statistically significant test result.

So, in plain English, this guide is here to set it all straight for you so you can accurately declare and interpret a statistically significant A/B test with accuracy and ease.

Conventions used

Before we get too far in, it’s important to lay the groundwork so you’re clear on a few important points:

#1. Simplify the complex

Because this article has been written with the express purpose of simplifying a complex topic, we’ve extracted only the most meaningful concepts to present to you.

We’re trying to avoid bogging you down with all the nitty, gritty details that often cause more confusion than clarity.

As such, this article does not offer an in-depth examination of every aspect of statistics. Instead, it covers only the top topics you need to know so you can confidently declare a statistically significant test.

Aspects like sample size, power, and its relationship to Minimum Detectable Effect (MDE) are separate topics not included in this article. While all tied to getting statistically significant results, these concepts need a whole lot more explanation outside of statistical significance. 

If you're interested, you can read more about sample size here, and its relationship to power and MDE here.

#2. Everything is tied to conversions

Scientists use statistical significance to evaluate everything from differences in bunny rabbit populations to whether a certain dietary supplement lowers obesity rates.

That’s all great. And important.

But as experimenters interested in A/B testing, this article focuses on the most common evaluation criteria in online testing – and the one that usually matters most: conversion rates. 

Although, in other scenarios there may be a more complicated Overall Evaluation Criterion (OEC), in this article, anytime we talk about a metric or result, we’re referring, specifically, to conversion rates. Nothing else.

So, you can drop bunny rabbits and diet pills out of your head. At least for now.

#3. It’s all about winning 

And while conversion rates are important, what's usually key is increasing them!

Therefore, in this article, we focus only on one-sided tests – which is a fancy stats way of saying, we’re looking to determine if the treatment has a higher conversion rate than the control.

A one-sided, also known as a one-tailed test, measures conversions only going one-way, compared to the control. So either just up OR just down.

In contrast, a two-sided test measures conversion results up AND down, compared to the control.

One-side vs. two-sided tests are a whole topic to explore in and of themselves. 

In this article, we’ve focussed just on one-sided tests because they're used for detecting only a positive (or negative) result.

But, you should know, there’s no clear consensus on whether a one-sided or two-sided test is best to use in A/B testing. For example, this in-depth article states one-tailed tests are better. But, this one argues two-tailed tests are best.

As a general suggestion, if you care only whether the test is better than the control, a one-sided test will do. But if you want to detect whether the test is better or worse than the control, you should use a two-sided test.

It’s worth mentioning that there are other types of tests better suited for other situations. For example, non-inferiority tests check that a test is not any worse than the control.

But, this article assumes that, at the end of the day, what most experimenters really want to know is: did the test outperform better the control?

Statistical significance helps us answer this question and lets us accurately identify a true conversion difference not just caused by random chance.


Now that we’ve got all these stipulations out of the way, we’re ready to have some fun learning about statistical significance for A/B testing.

Let’s dig in. . .

Statistical significance defined

Alright, so what is statistical significance anyway?

Google the term. Go ahead, we dare you!

You’ll be bombarded with pages of definitions that may not make much sense.

Here’s one: according to Wikipedia, “in statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis. More precisely, a study's defined significance level, denoted by alpha, is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, is the probability of obtaining a result at least as extreme, given that the null hypothesis is true.”

Say what?!?

If you find this definition seems confusing, you’re not alone!

Here’s what’s important to know: a statistically significant finding is the standard, accepted way to declare a winning test. It provides evidence to suggest you’ve found a winner. 

Statistical noise

This evidence is important! 

Because A/B testing itself is imperfect. With limited evidence, there’s always the chance you'll get random, unlucky results misleading you to wrong decisions.

Statistical significance helps us manage the risk.

To better understand this concept, it can be helpful to think of a coin flip analogy.

Imagine you have a regular coin with heads and tails. You make a bet with your friend to flip a coin 5 times. Each time it lands on heads, you win $20. But every time it lands on tails, you pay them $20.

Sounds like an alright deal. You can already see the dollar signs in your eyes.

But what happens when you flip the coin and 4 out of 5 tosses it lands on tails? Your friend is happy.

But you're left pretty upset. You might start to wonder if the coin is rigged or it’s just bad luck. 

In this example, the unfavorable coin flips were just due to random chance. Or what we call, in fancy stats speak, statistical noise. 

While the results of this simple coin bet may not break the bank, in A/B testing, you don’t want to be at the whim of random chance because incorrectly calling A/B tests may have much bigger financial consequences.

Fortunately, you can turn to statistical significance to help you minimize the risk of poor decisions.


To fully understand, you’re going to need to wrap your head around a few fancy stats definitions and their relationship to each other. These terms include: 

  • Bayesian and frequentist testing
  • Null hypothesis 
  • Type I and Type II errors
  • Significance level alpha (α
  • P-value

While these terms may not make much sense yet, don’t worry.

In the sections below, we’re going to clearly define them and show you how each concept ties into statistical significance so you can accurately declare a winning A/B test with absolute confidence.

Statistical methods in A/B testing

Let’s start with the basics.

In A/B testing, there are two main statistical frameworks, Bayesian and Frequentist statistics.

Don’t worry. While these words sound big and fancy, they’re nothing to be intimidated by. 

They simply describe the statistical approach taken to answer the question: does one A/B test version outperform another?

Bayesian statistics 

Bayesian statistics measures how likely, or probable, it is that one version performs better than another.

In a Bayesian A/B test, results are NOT measured by statistical significance. 

This fact is important to realize because popular testing platforms like VWO Smarts Stat Mode and Google Optimize currently use the Bayesian framework.

So if you’re running a test on these platforms, know that, when determining if you have a winner, you’ll be evaluating the probability of the variant outperforming – not whether the result is statistically significant.

Frequentist statistics

Statistical significance, however, is entirely based on frequentist statistics.

And, as a result, frequentist statistics takes a comparatively different approach. 

Using the frequentist method is more popular and involves fewer assumptions. But, the results are more challenging to understand. 

Which is why this article is here for you.

In A/B testing, frequentist statistics  asks the question, “is one version better than the other?” 

To answer this question, and declare a statistically significant result, a whole bunch of checkpoints have to be met along the way, starting with the null hypothesis.

Null hypothesis

Say what? What the heck is a null hypothesis?

To better understand this term, let’s break it down in the two words, starting with hypothesis.


Hypothesis testing is the very basis of the scientific method.

Wikipedia defines hypothesis as a proposed explanation for a phenomenon. But, more simply stated, we can call it an educated guess that can be proven wrong.

In A/B testing, you make an educated guess about which variant you think will win and try to prove or disprove your belief through testing.

Through the hypothesis, it's assumed the current state of scientific knowledge is true. And, until proven otherwise, everything else is just speculation.

Null hypothesis (H0)

Which means, in A/B testing, you need to start with the assumption that the treatment is not better than the control. The real conversion difference is null or nothing at all; it's ≤ zero.

And you've gotta stick with this belief until there's enough evidence to reject it.

Without enough evidence, you fail to reject the null hypothesis, and, therefore, deem the treatment is not better than the control. (Note, you can only reject or fail to reject the null hypothesis; you can't accept it).

There’s not a large enough conversion rate difference to call a clear winner.

However, as experimenters, our goal is to decide whether we have sufficient evidence – known as a statistically significant result – to reject the null hypothesis.

With strong evidence, we can conclude, with high probability, that the treatment is indeed better than the control.

Our amazing test treatment has actually outperformed! 

Alternative hypothesis (Ha or H1)

In this case, we can reject the null hypothesis and accept an alternative viewpoint.

This alternate view is aptly called the alternative hypothesis.

When this exciting outcome occurs, it’s deemed noteworthy and surprising.

How surprising?

Errors in testing

To determine that answer, we need to calculate the probability of whether we made an error in rejecting, or failing to reject the null hypothesis, and calling a winner when there really isn’t one.

Because as mere mortals struggling to understand all this complex stats stuff, it’s certainly possible to make an incorrect call.

In fact, it happens so often, there’s names for these kinds of mistakes. They’re called type I and type II errors.

Type I error, false positive

A type I error occurs when the null hypothesis is rejected – even though it was correct and shouldn’t have been.

In A/B testing, it occurs when the treatment is incorrectly declared a winner (or loser) but it’s not – there's actually no real conversion difference between versions; you were misled by statistical noise, an outcome of random chance.

This error can be thought of as a false positive since you’re claiming a winner that’s not there.

While there’s always a chance of getting misleading results, sound statistics can help us manage this risk.

Thank goodness! 

Because calling a test a winner – when it’s really not – is dangerous. 

It can send you chasing false realities, push you in the wrong direction and leave you doubling down on something you thought worked but really didn’t. 

In the end, a type I error can drastically drag down revenue or conversions. So, it’s definitely something you want to avoid.

Type II error, false negative

A type II error is the opposite.

It occurs when you incorrectly fail to reject the null hypothesis and instead reject the alternative hypothesis by declaring that there’s no conversion difference between versions – even though there actually is.

This kind of error is also known as a false negative. Again, a mistake you want to avoid. 

Because, obviously, you want to be correctly calling a winner when it’s right there in front of you. Otherwise, you’re leaving money on the table.

If it’s hard to keep these errors all straight in your head, fear not.

Here’s a diagram summarizing this concept:

Source: Reddit

Statistical noise & errors in testing 

Making errors sucks. As an experimenter, your goal is to try to minimize them as much as possible.

Unfortunately, given a fixed number of users, reducing type I errors will end up increasing the chances of a type II error. And vice versa. So it’s a real tradeoff.

But here is, again, a silver lining in the clouds. Because in statistics, you get to set your risk appetite for type I and II errors.


Through two stats safeguards known as alpha (α) and beta (β).

To keep your head from spinning, we’re only gonna focus on alpha (α) in this article. Because it’s most critical to accurately declaring a statistically significant result.

While Beta (β) is key to setting the power of an experiment, we’ll assume the experiment is properly powered with beta ≤ 0.2, or power of ≥ 80%. Cause, if we delve into beta's relationship with significance, you may end up with a headache. So, we'll skip it for now. 

Significance level alpha (α)

With alpha (α) top of mind, you may be wondering what function it serves.

Well, significance level alpha (α) helps us set our risk tolerance of a type I error.

Remember, a type I error, also known as a false positive, occurs when we incorrectly reject the null hypothesis, claiming the test treatment converts better than the control. But, in actuality, it doesn’t.

Since type I errors are bad news, we want to mitigate the risk of this mistake.

We do so by setting a cut-off point at which we’re willing to accept the possibility of a type I error. 

This cut-off point is known as significance level alpha (α). But most people often just call it significance or refer to it as alpha (denoted α).

Experimenters can choose to set α wherever they want. 

The closer it is to 0, the lower the probability of a type I error. But, as mentioned, a low α is a trade-off. Because, in turn, the higher the probability of a type II error. 

So, it’s a best practice to set α at a happy middle ground. 

A commonly accepted level used in online testing is 0.05 (α = 0.05) for a two-tailed test, and 0.025 (α = 0.025) for a one-tailed test.

For a two-tailed test, this level means we accept a 5% (0.05) chance of a type I error, or of incorrectly rejecting the null hypothesis by calling a winner when there isn’t one.  

In turn, there’s a 95% probability the null hypothesis is correct; the test treatment is indeed no better than the control -- assuming, of course, the experiment is not under-powered and we have a large enough sample size to adequately detect a meaningful effect in the first place.

That’s it. That’s all α tells us: the probability of making a type I error, assuming the null hypothesis is correct.

That said, it’s important to clear a couple misconceptions:

At α = 0.05, the chance of making a type I error is not 5%; it’s only 5% if the null hypothesis is correct. Take note because a lot of experimenters miss the last part – and incorrectly believe it means there's a 5% chance of making an error, or wrong decision.

Also, a 5% significance level does not mean there’s a 5% chance of finding a winner. That’s another misconception.

Some experimenters extrapolate even further and think this result means there's a 95% chance of making a correct decision. Again, this interpretation is not correct. Without introducing subjective bias, it’s very difficult to know the probability of making a right or wrong decision. Which is why we don’t go down this route in stats.

So, if α is the probability of making a type I error, assuming the null hypothesis is correct, how do we possibly know if the null hypothesis is accurate?

To determine this probability, we rely on alpha’s close sibling, p-value.

P-value explained

According to Wikipedia, p-value is the probability of the test producing an observed result at least as or more extreme than the ones observed in your data, assuming the null hypothesis is true.

But, if that definition doesn’t help much, don’t worry.

Here’s another way to think of it: 

P-value tells us how likely it is that the outcome, or a more extreme result occurred, if the null hypothesis is true.

Remember, the null hypothesis assumes the test group’s conversion rate is no better than the control group’s conversion rate.

So we wanna know the likelihood of results if the test variant is truly no better than the control.


Because, just like a coin toss, all outcomes are possible. But some are less probable.

P-value is how we measure this probability. 

In fact, some fun cocktail trivia knowledge for you, the “p” in p-value stands for probability! 

And as you can probably imagine, the value part of p-value is because we express the probability using a specific value. This value is directly tied into its sibling, alpha (α). Remember, we can set α at anything we want. But, for two-tailed tests, the cut-off is generally α ≤ 0.05.

Well, guess what?

When the p-value is less than α (p≤α), it means the chance of getting the result is really low – assuming the null hypothesis is true. And, if it’s really low, well then, the null hypothesis must be incorrect, with high probability.

So we can reject the null hypothesis!

In rejecting the null hypothesis, we accept the less likely alternative: that the test variant is truly better than the control; our results are not just due to random chance or error. 

Hallelujah! We’ve found a winner. 🥳

An unusual and surprising outcome, the result is considered significant and noteworthy.

Therefore, a p-value of ≤0.05 means the result is statistically significant.

However, while a p-value of ≤0.05 is, typically, accepted as significant, many data purists will balk at this threshold.

They’ll tell you a significant test should have a p-value ≤0.01. And some data scientists even argue it should be lower.

However low you go, it seems everyone can agree: the closer the p-value to 0, the stronger the evidence the conversion lift is real – not just the outcome of random chance or error. 

Which, in turn, means you have stronger evidence you’ve actually found a winner. 

Or written visually: ↓ p-value, ↑ significant the finding.

And that’s really it! 

A very simplified, basic, incredibly stripped down way to tell you the essentials of what you absolutely need to know to declare a statistically significant A/B test. 

Once it’s made clear, it’s really not that hard. Is it?

Of course, there’s SO MUCH we’ve left out. And much, much more to explain to properly do the topic justice. 

So consider this article your primer. There’s plenty more to learn. . .

Correctly calling a winning test

But, now that you more clearly understand the basics, you probably get how the null hypothesis ties into a type I error, based on significance level α, expressed as a p-value. 

So it should be clear:

When calling a result “statistically significant,” what you’re really saying in stats-speak is: the test showed a statistically significant better conversion rate against the control.You know this outcome occurred because the p-value is less than the significance level (α) which means the test group’s higher conversion rate was unlikely under the null hypothesis.

In other words, the test variant does, in fact, seem to be better than the control.

You’ve found a winner! 🙂

Calculating and verifying if your test is a winner

Although most standard A/B testing platforms may declare a statistically significant winning test for you, not all get it right. 

So it’s strongly advocated that you verify test results yourself.

To do so, you should use a test validation calculator, like this one from AB Testguide.

Simply plug in your visitor and conversion numbers, declare if your hypothesis was one- or two-sided, and input your desired level of confidence.

If the result is statistically significant, and you can confidently reject the null hypothesis, you’ll get a notice with a nicely visualized chart. It will look something like this:

If the result is not significant, you’ll see a similar notification informing you a winning test was not observed. It will look like this:

However, an important word of caution: just because your test result is statistically significant doesn’t always mean it’s truly trustworthy. 

There are lots of ways to derive a statistically significant result by “cheating.” But that’s a whole topic for another conversation. . .

Statistical significance summarized

In essence, evaluating statistical significance in A/B testing asks the question: is the treatment actually better than the control?

To determine the answer, we assume the null hypothesis: that the treatment is no better than the control. So there is no winner – until proven otherwise.

We then use significance level alpha (α) to safeguard against the probability of committing a Type I error and to set a reasonable bar for the proof we need.

P-value works with α as truth barometer. It answers the question: if the null hypothesis is true, how likely is it the outcome will occur?

The answer can be stated with a specific value.

A p-value lower than or equal to α, usually set at 0.05, means the outcome is unlikely due to random chance. The result is surprising and, therefore, significant.

The lower the p-value, the more significant the finding. 

A statistically significant result shows us: the result is unlikely due to just random chance or error under the null hypothesis. There’s strong evidence to reject the null hypothesis and declare a winner.

So we can take our winning result and implement it with confidence, knowing that it will likely lift conversions on our website.

Without a statistically significant result, we’re left in the dark. 

We have no way to know whether we’ve actually, accurately called a winner, made an error, or achieved our results just due to random chance.

That’s why understanding what statistical significance is and how to apply it is so important. We don’t want to implement supposed winners without strong evidence and rigour.

Especially when money, and possibly our jobs, are on the line.

Currently, statistical significance is the standard, accepted way to declare a winner in frequentist A/B testing. Whether it should be or not is a whole other topic. One that’s up for serious debate.

But for now, statistical significance is what we use. So apply it. Properly.

Final thoughts

Congrats! You’ve just made it through this article, and have hopefully learned some things along the way.

Now, you deserve a nice break. 

Grab yourself a bubble tea, kick your feet up, and breathe out a sigh of relief.

The next time you go to calculate a winning test, you’ll know exactly what to look for and how to declare a statistically significant result. If you’ve actually got one. 😉

Hope this article has been helpful for you. Please share widely and post your questions or comments in the section below.

Glossary: simplified key terms to know

  • Statistical significance: a type of result that validates the conversion difference between variants is real – not just due to error or random chance.
  • Frequentist testing: a statistical method of calculation used in traditional hypothesis-based A/B testing in which you aim to prove or disprove the null hypothesis. Attempts to answer the question: Is one variant better than another?
  • Null hypothesis: in a one-sided hypothesis, the assumption the test group is no better than the control. Your aim is to reject the null hypothesis.
    Type I error: a fault that occurs when the null hypothesis is incorrectly rejected and you claim the treatment is better when, really, it isn’t. What appears a winner is just statistical noise or random chance.
  • Type II error: a fault that occurs when the treatment is actually better than the control, but you fail to reject the null hypothesis.
  • Significance level alpha (α): the probability of a type I error. The standard significance level (α) = 0.05 for a two-tailed t-test. For a one-tailed t-test, 0.025 is generally recommended. 
  • P-value: A measure of how likely a result could have occurred under the null hypothesis. It is used to safeguard against making a decision based on random chance. It’s a probability expressed as a specific value. A p-value lower than α (usually set at  ≤0.05) means the results are extreme enough to call the null hypothesis into question and reject it. The result is surprising and, therefore, significant. The lower the p-value, the more significant the finding. 

Happy testing!

About the authors

Deborah O’Malley  is one of the few people who’s earned a Master’s of Science (M.Sc.) degree with a specialization in eye tracking technology. She thought she was smart when she managed to avoid taking a single math class during her undergrad. But confronted reality when required to take a master’s-level stats course. Through a lot of hard work, she managed to earn an “A” in the class and has forever since been trying to wrap her head around stats, especially in A/B testing. She figures if she can learn and understand it, anyone can! And has written this guide, in plain English, to help you quickly get concepts that took her years to grasp.

Timothy Chan is among one of the select few who holds a Ph.D. and an MBA. Well-educated and well-spoken, he’s a former Data Scientist at Facebook. Today, he’s the Lead Data Scientist at Statsig, a data-driven experimentation platform. With this experience, there truly isn’t anyone better to explain statistical significance in A/B testing.

Ronny Kohavi is one of the world’s foremost experts on statistics in A/B testing. In fact, he’s, literally, written the book on it! With a Ph.D. from Stanford University, Ronny's educational background is equally impressive as his career experience. A former vice president and technical fellow at Microsoft and Airbnb, and director of data mining at personalization at Amazon, Ronny now provides data science consulting to some of the world’s biggest brands. When not working directly with clients, he’s teaching others how to accelerate innovation in A/B testing. A course absolutely every experimenter needs to take!

By: Deborah O'Malley, M.Sc. | Last updated May, 2022

Nav bars defined

A Navigational Menu, also known as a navigational bar, or nav bar, is a menu system, usually located at the top of a website.

Its purpose is to help users find categories of interest and access key pages on a website.

The nav bar is usually organized into relevant sections and may have a dropdown menu or sub-sections directly below.

Most have clear section and sub-section menu titles.

These titles should state what the users will get, or where the user will arrive, by clicking into the menu category.

A typical nav menu may look something like this:

The nav menu is, usually, present across all pages of your site. That means optimizing it can pay big returns, impacting every stage of your conversion funnel.

Testing the organization, presentation, and wording in the nav menu presents a great optimization opportunity with potentially incredible conversion returns.

In fact, there's not many other site-wide, high-ticket, low-effort tests like nav menu formatting.

However, in order to optimize a nav bar effectively, there are several important "do's" and "don'ts" you must follow.

Here are the top-10 key do's, don'ts, and test ideas to try:

1. Think about SEO & keywords

Before redesigning or reorganizing any nav bar, always think about SEO and the keywords your visitors are using to find your site and navigate through.

Don't remove or rename any nav menu titles that will lower your SEO value, SERP, or keyword rankings.

This advice is especially important if a large portion of your site traffic comes from paid ads. You want to be using and showcasing the keywords you're paying for or that are bringing visitors to your site.

Once on your site, users will expect to see these keywords and will look for them to orient themselves on your website.

So cater to visitors needs and expectations by showcasing these keywords in your nav menu.

2. Assess your existing site structure

Don't add or remove nav bar links without thinking about the site structure and internal linking. Otherwise, you risk adding or removing access to pages that link throughout your site.

Ideally do an XML map before making any changes involving internal links. As XML sitemap will look something like this:

Here's a good article on how to create an XML sitemap.

3. Use heatmapping data

Heatmapping data is a powerful way to see how visitors are interacting with your site and the nav menu.

As you can see in this example heatmap, the bigger and darker the hotspot, the more likely it is visitors are interacting with that section of the site:

Use heatmapping data to see what nav categories your visitors are clicking on most or least.

But don't stop there. Explore trends across both desktop and mobile.

Take note if users are heavily clicking on the home icon, menu, or search box as all of this behavior may indicate users are not able to easily find the content they're looking for. Of course, eCommerce clients with a lot of products to search are the exception.

Also take note if a user is clicking on the nav page link, say pricing, when they're already within the pricing section. This trend provides clues the nav or page structure may be confusing.

And visitors might not realize where they are on site.

If you detect this type of behavior, test the effect of making specific changes to optimize the interaction for users.

One great way to do so is through breadcrumb links.

4. Test including breadcrumb navigation

Breadcrumb links, or breadcrumbs, are links placed directly under the main nav bar. They look something like this:

Breadcrumbs can be helpful to orient users and better enable them to navigate throughout the site. Because when users know where they are, they can navigate to where they want to be. Especially on large sites with a lot of pages.

If you don't already include breadcrumbs on your site, consider testing the addition of breadcrumb navigation, especially if you have a large sites which many pages and sections.

5. Determine if ALL CAPS or Title Case is optimal

At their essence, words are just squiggly lines on a page or website, but they have a visual presence that can be subconsciously perceived and understood.

That's why whether you put your nav menu titles in ALL CAPS <-- (like this) or Title Case <-- (like this) may make a surprising conversion difference.



In fact, according to renowned usability expert, Jakob Nielsen, deciphering ALL CAPS on a screen reduces reading speed by 35% compared to the same font in Title Case on paper.

The reason ALL CAPS is so difficult, tiring, and time consuming to read is because of the letter shape.

In APP CAPS format, the height of every letter is the same, making each letter in every word create a rectangular shape.

Because the shapes of all the letters are the same, readers are forced to decipher every letter, reducing readability and processing speed.

Need proof? Take a look at this example:

With Title Case, the tops of each letter helps us decipher the text, increasing readability and reading speed.


It's great for drawing attention and making you take notice. That's why it works well for headings and highlighting key points.

But to be most effective, it needs to be used sparingly -- and when isn't not necessary to quickly decipher a big chunk of text.

With these points in mind, test whether it's best to use ALL CAPS or Title Case with your audience on your site.


If you need some inspiration, checkout this real-life caps case study. Can you guess which test won?

6. Optimize menu categorization

Sometimes the navigational format we think will be simplest for our users actually ends up causing confusion.

Because nav menus are such a critical aspect of website usability, testing and optimizing their formatting is critical to improving conversions.

Which brings up the question, should you use a top "callout," that looks something like this, with the bolded text to categorize, highlight and make certain categories pop at the top?

Doing so may save visitors from having to hunt down and search each item in the menu.

But, it also may create confusion if you're repeating the same product categories below. Users may be uncertain if the callout section leads to different a page on site than in the rest of the menu.

An optimal site, and nav system, is one that leaves no questions in the user’s mind.

So test the effect of adding or removing a top callout section in your nav menu. See this real-life case study for inspiration.

Can you guess right?

7. Test using text or "hamburger" menu

Back in the olden days, all websites had text menus that streched across the screen.

But that's because mobile wasn't much of a thing. As mobile usage exploded, a nifty little thing called a "hamburger" menu developed.

A funny name, but so-called because those stacked little bars look like kinda a hamburger:

Hamburger menus on mobile make a lot of sense. The menu system starts small and expands bigger. So it's a good use of space on a small-screened device.

Hamburger menus have become the standard format on mobile.

But does that mean it should also be used on desktop? Instead of a traditional text-based menu?

It's a quesiton worth testing!

In fact, the big-name computer brand, Intel, tested this very thing and found interesting findings. Can you guess which test won?

And while you're at it, if you are going to use a hamburger menu -- whether on your mobile or desktop site -- test placement.

Many a web developers put the hamburger menu on the right side. But it may not be best placed there. For two reasons:

i. The F-shaped pattern and golden triangle

In English, we read from left to right and our eyes naturally follow this reading pattern.

In fact, eye tracking studies show we, typically, start at the top of the screen, scan across, and dash our attention down the page. This reading pattern emerges into what's known as an "f-shaped pattern."

And because of this viewing behavior, the top left of the screen is the most coveted location. Known as the "golden triangle," it's the place where users focus most.

Here's a screenshot so you can visualize both the F-shaped pattern and golden triangle:


Given our reading patterns, it makes sense to facilitate user behavior by placing the hamburger menu in the top left corner.

ii. Unfolds from left outwards

As well, as this article by Adobe design succinctly says, because we read (in English) from left to right, it naturally follows that the nav menu should slide open from left to right.

Otherwise, the experience is unexpected and unintuitive; that combination usually doesn't convert well.

But do test for yourself to know for sure!

8. Test if a persistent menu is optimal

A persistent, or "sticky" nav menu is one that stays with users as they scroll up or down.

How well does a persistent or sticky nav bar work?

It can works wonders, especially if you include a sticky CTA within the nav bar, like this:

But a sticky nav bar doesn't always have to appear upon page load.

It can also appear upon scroll or upon a set scroll depth. What works best?

See this case study for timing ideas and challenge yourself. Can you guess the right test?

9. Determine which wording wins

The best wording within your nav menu are titles and categories that resonate with your user.

To know what's going to win, you need to deeply understand your audience and the terminology they use.

This advice will differ for each client and site. But, as a starting point, delve into your analytics data.

If you have Google Analytics set-up, you can find this information by going to the Behavior > Site Search > Search Terms tab. (Note, you'll need to have previously set-up this search function to get data).

For example, here's the search terms used for a client covering divorce in Canada:

Notice how the keyword "divorce" barely makes it into the top-5 keywords users are searching? In times of distress, these users are looking for other resources.

Showing keywords that resonate in the top nav can facilitate the user journey and will make it more likely visitors click into your site and convert.

Pro tip: using analytics data combined with heatmapping data can provide a great indication of whether users are clicking on the nav menu title giving a good clue whether the wording resonates.

Need more inspiration for high-converting copy?

See this case study examining whether the nav menu titles "Buy & Sell" or "Classifieds" won for a paid newspaper ad site. Which won one? The results may surprise you:

10. Test button copy with your sticky nav

Testing whether a sticky nav bar works best is a great, easy test idea. Looking at what nav menu copy resonates is also a simple, effective way to boost conversions.

So why not combine both and test the winning wording within a sticky nav bar with a Call To Action (CTA) button!?

Here, again, you'll need to really understand your audience and assess their needs to determine and test what wording will win.

For example, does "Get a Quote" or "Get Pricing" convert better? The answer is, it depends whose clicking from where. . . See this case study for testing inspiration:

But just because an overall trend appears, doesn't mean it will hold true for all audiences.

If you have distinct audiences across different provinces, states, or countries, optimal copy may differ because the terminology, language, and needs may also change.

When catering to a local audience, try to truly understand your audience so you can best hone in on their needs.

Test the optimal copy that will most resonate with the specific cohort.

Segment results by geolocation to determine which tailored approach converts with each audience segment.


It may seem like a small change, but optimizing a navigational menu can have a big impact on conversions.

To optimize, start by using analytics data to inform your assumptions. Then test the highest-converting copy, format, organization, and design. And segment results by audience type.

Doing so will undoubtably lift conversions.

Hope these test ideas are helpful for you!

Please share this post widely and comment below.

By: Tim Waldenback, co-founder Zutobi | Last updated April, 2022


Apps are everywhere. Many businesses have them, consumers expect them, and more and more businesses are creating them.

Apps are a great business opportunity, but also come with stiff competition. Making an app isn’t enough -- you also have to market it effectively and ensure it offers a top-notch user experience.

On top of these challenges, even small changes in an app’s user experience can have a detrimental impact on conversion and engagement rates.

So it’s best to test your features with A/B testing.

In this comprehensive article, you'll learn everything you need to know about A/B testing apps to get appealing results.

A/B Test Mobile Apps

When most people think of A/B testing, they imagine testing websites, webpages, or landing pages.

A/B testing for mobile apps is really no different. It involves testing individual variables and seeing how the audience responds to these variables.

The audience may comprise a single cohort or multiple, segmented audiences, but the goal is the same: to identify which option provides the best user experience.

For example, say you want to test your app for your driver’s permit practice test among teens in New Jersey.

Let's imagine your goal is to drive more app downloads, so you start A/B testing to see which variables entice users to download.

You may start with the icon displayed in the store to determine if one gets more attention and leads to more download. Everything else stays the same.

Once you have results for this test, you may move onto testing keywords, the product title, description, screenshots, and more.

Benefits of A/B Testing Apps

A/B testing is a technique used by many app creators because it provides valuable, verifiable results.

With testing, and the subsequent result, you’re no longer relying on assumptions. Instead, you have concrete data to inform your decisions.

There are several other benefits of A/B testing apps, including:

  • Optimizing in-app engagement
  • Observing the impact of features
  • Learning what works for different audiences and segments
  • Gaining insights into user preferences and behavior

The benefit of each example goes back to data.

You're no longer basing decisions on assumptions, personal preferences, or bias. Rather, you know exactly what works and have the numbers to prove it.

Types of A/B Testing for Mobile Apps

Most mobile apps are tested using two different types of A/B testing.

1. In-App A/B Testing

This method is primarily used by developers and tests the UX and UI impact, including retention rate, engagement, session time, and lifetime value. You may want to add other metrics to test for specific functions.

2. A/B Testing for Marketing

For marketing, A/B testing can optimize conversion rates, retarget users, and drive downloads. You can test which creative ad is more effective, down to the call-to-action, font, images, and every other granular detail.

For example, this GuessTheTest case study tested the optimal app thumbnail image while this study looked at the best CTA button format.

How to Conduct A/B Testing

One of the best aspects of A/B testing is that it’s repeatable and scalable. You can use it to continuously optimize your app and its marketing campaigns. Here’s how:

Start with a Hypothesis

Your testing should always have a hypothesis that you’re trying to prove or disprove. Clearly stating a hypothesis is how you know which variables to test.

For example, you may want to test whether having screenshots of your practice permit test inspires more people to download your app. Testing the number of screenshots in the app page of the store gives you a starting point, orienting you to know where to begin.

Create a Checklist

You should also create a checklist to ensure you cover all the information you need, including:

  • What are you testing?
  • What audience(s) are you testing with?
  • What will you do if your hypothesis is proven or disproven?

If you can’t come up with a defined testing variable, begin with the problem for which you’re seeking a solution. Then investigate what testing approach can help you best solve the problem.

Segment the Audience

Once you know what to test, you need to segment and define the audiences on which you’re testing.

Ideally, in A/B testing, you should isolate one variable to test at a time against one audience cohort.

The reason why is because testing across different audiences, or diverse audiences, adds another variable to the mix which makes it more difficult to accurately define what worked and what didn’t.

It's also valuable to segment your audience by factors like traffic source or new versus returning visitors.

However, when segmenting your audience, it's important the sample size is large enough to glean important insights. If the sample size is too small, your A/B test data won't be valid, and you’ll miss out on part of the big picture.

Analyze the Results

Now is the time to analyze your results and determine which variable offers better results. Consider all the available data to get a comprehensive picture. For example, if you notice that your change increased session time, rather than conversions as you hoped, that’s still a valuable learning.

Implement the Changes

You’ve determined your best results from each valuable, so now you know that you can use that winning variable and implement it across your entire audience. If you didn’t get clear results from A/B testing, you should still have data that you can use to inform your next test.

Adjust Your Hypothesis and Repeat the Test

A/B testing is repeatable, so you can keep refining and testing until you get optimal results. It’s important to continue testing regularly, no matter what, and use the information to improve your app and user experience.

A/B Testing Best Practices

Always Understand Why You’re Testing

You always need to understand why you’re testing a variable with a clear hypothesis and how you will move forward once you have an outcome. This statement may seem obvious, but knowing why you’re testing ensures that you’re not wasting time and money on a test that won’t serve your larger goals.

Don’t Stop Short

A/B tests have a lot of value. Even if things don’t go the way you hoped or you get results earlier than expected, it’s important to stick with your tests long enough to be confident in your decisions.

Stay Open

Don’t get too invested in the result you hope to get. User behavior is anything but simple, and sometimes your testing will show you something unexpected. You need to stay open-minded and implement the necessary changes, then test again. Testing is an iterative process towards continual optimization.

Test Seasonally

Ideally, with A/B testing, you're only changing only one variable at a time, but you can’t control everything. The season in which you test matters, and you can’t control that. So, be sure to test the same variables in different seasons to see what results you get.

Summary: Leverage A/B Testing for Success

A/B testing is essential to ensuring your app delivers the best possible experience for users and validates your assumptions and ideas. Include A/B testing in your app development and updates to keep users downloading and engaging with your app.

About the Author

By: Deborah O'Malley & Shawn David | Last updated April, 2022


In A/B testing, planning and prioritizing which tests to run first is a process marred in mystery.

It's seems there's no great system for organizing ideas and turning them into executable action items.

Keeping track of which tests you've run, plan to run, or are currently running is an even bigger challenge.

That is until now.

In this short 12-minute interview, you'll hear from Shawn David, Operations Engineer, of CXL's sister site, Speero, a leading A/B testing firm.

He shares the secrets on how Speero tracks, manages, plans, and prioritizes their A/B tests.

Check out the video to:

  • Get an inside view into Speero's planning and prioritization methodology, based on the CXL framework known as PXL.
  • Watch a demo of Speero's custom-made test planning and prioritization tool in action, and apply the insights to optimize your own planning and prioritization process.
  • See how this tool, built through Airtable, can be customized and used to develop or enhance your own test planning and prioritization model.

Hope you’ve found this content useful and informative. Please share your thoughts and comments in the section below. 

By: Deborah O'Malley, M.Sc. | Last updated November, 2022


If you've been in the A/B testing field for a while, you've probably heard the term Sample Ratio Mismatch (SRM) thrown around.

Maybe you've talked with people who tell you that, if you're not looking at it, you're not testing properly. And you need to correct for it.

But, in order to do so, you first need to know what SRM is and how to spot it.

This article outlines the in's and out's of SRM, describing what it is, why you need to be looking at it when testing, and how to correct for it, if an SRM issue occurs.

Let's go!

Understanding Sample Ratio Mismatch (SRM)

The term Sample Ratio Mismatch (SRM) sounds really intimidating.

It's a big mumble, jumble of words. With "sample," and "ratio," and what exactly is "mismatch" anyway?

Well, let's break it all down, starting with the word sample.

In A/B testing, a sample applies to 2 separate but related concepts that impact test traffic:

1) Traffic allocation

2) The test sample

What's the difference?

1) Traffic allocation

Traffic allocation is the way traffic is split.

Typically, in an A/B test, traffic is split equally, or 50/50, so half of users see the control version and the other half the variant.

Visually, equally split traffic looks something like this:

If a test has more than one variant, for example in an A/B/C test, traffic can still be equally split if all versions receive approximately the same amount of traffic.

In a test with 3 variants, equally allocated traffic would be split 33/33/33.

Visually it would look something like this:

While traffic can be divided in other ways, say 70/30 or 40/30/10, for example, this unequal allocation is not considered best practice. As explained in this article, traffic should, ideally, always be equally allocated.

However, whether traffic has been equally or unequally allocated, SRM can still occur and should always be calculated.

2) Test sample

In addition to traffic allocation, there's the sample of traffic, also known as the sample population.

In an A/B test, the sample of traffic comprises the sample size, or the number of visitors in a test.

Despite how traffic is allocated, if the sample of traffic is routed so one variant receives many more visitors than the other, the ratio of traffic is not equal. And you have a Sample Ratio Mismatch (SRM) issue.

Visually, the test sample should be routed like this:

If it's not, a SRM issue occurs. One version has far more traffic routed to it than the other and the ratio of traffic is off. Visually, a SRM issue looks like this:

Sample Ration Mismatch (SRM) defined

According to Georgi Georgiev, of Analytics-Toolkit, in his article, Does Your A/B Test Pass the Sample Ratio Mismatch Check?

"Sample ratio mismatch (SRM) means that the observed traffic split does not match the expected traffic split. The observed ratio will very rarely match the expected ratio exactly."

And Ronny Kohavi, in his book Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, adds:

If the ratio of users (or any randomization unit) between the variants is not close to the designed ratio, the experiment suffers from a Sample Ratio Mismatch (SRM).

Note that the SRM check is only valid for the randomization unit (e.g., users). The term "traffic" may mislead, as page views or sessions do not need to match the design.

In other words, very simply stated: SRM occurs when one test variant receives noticeably more users than expected.

In this case, you've got an SRM issue. And that's a big problem because it means results aren't fully trustworthy.

Why SRM is such a problem

When SRM occurs, test results can’t be fully trusted because the uneven distribution of traffic skews conversion numbers.

Take this real-life case study example:

An A/B test was run with traffic equally split, 50/50. The test ran for 4 weeks achieving a large sample of 579,286 total visitors.

At the end of the experiment, it would have been expected that, based on this 50/50 split, each variation should have received roughly 289,643 visitors (579,286/2=289,643):

But, SRM ended up occurring and the variation received noticeably more users than the control:

The ratio of the traffic was clearly mismatched – even though it was supposed to be split evenly.

This finding was itself problematic.

But the real issue was that, because the sample was mismatched, the conversion rates were completely skewed. 

Looking to increase conversions, the control achieved 4,735 conversions, but the variant slightly outperformed with 5,323 conversions.

Conversion rate without and with SRM

Without SRM, the variant appears to be the winner:

But, with SRM, the ratio of traffic became unevenly mismatched and altered the conversion rate so the control appeared to outperform the variant, making the test seem like a loser:

Which was the real winner or loser? With SRM, it became unknown.

SRM leaves your real results in the dark so you can't be certain if you truly have a test win or loss. The data is inaccurate.

So SRM is definitely something you want to avoid in your own testing practice. 

SRM a gold standard of reliable test results

However, by checking for SRM, and verifying your test doesn't have an SRM issue, you can be more confident results are trustworthy.

In fact, according to this insightful SRM research paper, "one of the most useful indicators of a variety of data quality issues is (calculating) Sample Ratio Mismatch."

Furthermore, according to Search Discovery,

"When you see a statistically significant difference between the observed and expected sample ratios, it indicates there is a fundamental issue in your data (and even Bayesian doesn't correct for that). This bias in the data causes it to be in violation of our statistical test's assumptions."

Simply put: if you're not looking out for SRM in your A/B tests, you might think your data is reliable or valid when it actually isn't.

So make sure to check for SRM in your tests!

How to spot SRM in your A/B tests

Calculating SRM should be a standard part of your data confirmation process before you declare any test a winner.

The good news: determining an SRM issue is actually pretty easy to do. In fact, some testing platforms, like, will now automatically tell you if you have a SRM issue.

But, if the testing platform you're using doesn't automatically detect SRM, no worries. You can easily calculate it yourself, ideally before you've finished running your test.

If you're a stats guru, you can determine if there's a SRM issue through a Chi-Square Calculator for Goodness of Fit. However, if the name alone scares you off, don't worry.

You can also simply plug your traffic and conversion numbers into an existing SRM calculator, like this free calculator or Analytics-Toolkit's SRM calculator.

These calculators can be used for both Bayesian and Frequentist methods and can be used whether your traffic is equally or unequally allocated.

If you don't have a SRM issue, you'll see a message that looks like this:

Assuming everything checks out and no SRM is detected, you're golden.

So long as the test has been run properly, your sample size is large enough to be adequately powered to determine a significant effect, and you've achieved statistical significance at a high level of confidence, you can declare your test a winner.

However, if you have an SRM issue, the calculator will alert you with a message that looks like this:

A p-value of  0.01 shows a significant result. The lower the p-value, the more likely an SRM issue has occurred.

If SRM is detected, that's a problem. It shows the ratio of traffic hasn't been directed to each variant equally and results might be skewed.

How frequent is SRM?

According to experimentation specialist, Iqbal Ali, in his article The essential guide to Sample Ratio Mismatch for your A/B tests, SRM is a common issue.

In fact, it happens in about 6-10% of all A/B tests run.

And, in redirect tests, where a portion of traffic is allocated to a new page, SRM can be even more prevalent.

However, if any test you've run has an SRM issue, you need to be aware of it, and you need to be able to mitigate the issue.

Does SRM occur with all sample sizes?

Yes, SRM can occur with samples of all sizes.

According to Iqbal in this LinkedIn post:

The Type I error rate is always 1% with a Chi Test with the p-value at < 0.01. This means it doesn't matter if we check with 100 users or 100,000 users. Our false-positive rate remains at about 1 out of 100 hundred tests. (See the green line on the chart below).

Having said that, we need to be wary as with low volumes of traffic, we can see larger differences happen by random chance, WITHOUT it being flagged by our SRM check. (See the red line on the chart below).

It should be rare to see those extremes though.

It should be even rarer to see even larger outliers, where SRM alert triggers (false positives). (See the yellow line on the chart below). I wanted to see this number as it's a good indication of volatility. The smaller the traffic volumes, the larger the volatility.

At 10,000+ users assigned, the % difference between test groups before SRM triggers is <5%.

At 65,000+ users this % difference drops to < 2%. So, the Chi Test becomes more accurate.

So beware, no matter how big or small your traffic volume, SRM is a possibility. But, the larger the sample, the more likely for a SRM issue to be accurately detected.

Why SRM happens

If your test shows a SRM issue, the first step is to figure out why it may have occurred.

SRM usually happens because:

  • The test is not set-up properly or is not randomizing properly
  • There's a bug in the test or test set-up
  • There are tracking or reporting issues

Or a combination of all these factors.

What to do about SRM issues

1. Review your test set-up

When any test shows an SRM issue, the first step is to review the set-up and confirm everything has been allocated and is tracking properly. You may find a simple (and often hard to detect) issue led to the error.

2. Look for broader trends

However, if you can't isolate any particular problems, it's worthwhile reviewing a larger collection of your tests to see if you can find trends across a broader swath of studies.

For example, for one client who had a lot of tests with SRM issues, I did a meta analysis of all tests run within a 1-year period. Through the analysis, I noticed for every test with over 10,000 visitors per variant, an SRM issue occurred on a particular testing platform.

While this finding was clearly problematic, it meant the SRM issue could be isolated to this variable.

After this issue was detected, the testing platform was notified of the issue; I've since seen updates that the platform is now working to fix this problem.

3. Re-run the study

In his paper Raise the Bar on Shared A/B Tests: Make Them Trustworthy, data guru, Ronny Kohavi, describes that a trustworthy test must meet 5 criteria, including checking for SRM.

In this LinkedIn post, Ronny compares not checking for SRM to a car without seatbelts. Seatbelts save lives; SRM is a guardrail that saves you from incorrectly declaring untrustworthy results as trustworthy.

If your study shows an SRM issue, the remedy is simple: re-run the test.

Make sure you get similar results, but without the SRM issue.

4. Run an A/A test

As the GuessTheTest article explains, an A/A test is exactly as it sounds. A split-test that pits two identical versions against each other.

To perform an A/A test, you show half of visitors version 1, and the other half version 2.

The trick here is both versions are exactly the same!

Doing so enables you to validate that you’ve set-up and run the test properly and that the data coming back is clean, accurate, and reliable -- helping you ensure there isn't an SRM problem.

With an A/A test, you expect to see both variants receive roughly equal traffic with about the same conversion rate. Neither version should be a winner.

In an A/A test, when you don't get a winner, you should celebrate -- and know you're one step closer to being able to run a test without an SRM issue.

An important note, an A/A test won't fix a SRM issue, but it will tell you if you have one and can bring you one step closer to being able to diagnosing the problem.

5. Create or use an SRM alert system

To mitigate against undetected SRM issues, you might also want to consider building an SRM alerting system.

As experimentation mastermind Lukas Vermeer of Vista remarked in this LinkedIn post, his team created their own SRM altering system in addition to the testing platform they're using. (Lukas has also created a free SRM chrome extension checker available here).

The progressive experimentation folks at Microsoft have done something similar:

Some platforms, like Kameleoon and Convert also now have built-in SRM systems that send you alerts if SRM is suspected or detected.

6. Choose your poison

If you're not willing to invest in any of these strategies because they're too time or resource intensive, fine.

You then need to willingly accept the tests you're declaring as winners aren't fully trustworthy. And you really shouldn't be making implementation decisions based on the results.

In that case, you need to choose your poison: either accept untrustworthy data or attempt to obtain accurate test results that may take more effort to yield.


If your A/B tests have fallen victim to a SRM issue, the results aren't as valid, and reliable, as they should be. And that's a problem.

Because, if you're making data-driven decisions without accurate data, you're clearly not in a good position.

Testing for SRM is like putting on your seatbelt. It could save you from implementing a so-called "winning" test that actually amounts to a conversion crash.

So remember: always be testing. And, always be checking for SRM.

Hope you’ve found these suggestions useful and informative. Please share your thoughts and comments in the section below. 

By: Lawrence Canfield | Last updated April, 2022


Virtually any new web venture today can benefit from A/B testing as a means of curating content to visitor preferences.

Without a doubt, personalizing holds true for gaming platforms, too.

As a segment of platforms that sees and relies on directly interfacing a large volume of active consumers on a day-to-day basis, gaming sites benefit from testing in significant and direct ways.

With this outlook in mind, here are some elements those managing gaming sites might want to consider for A/B testing.

Game Previews: Visual Covers vs. Video

A popular method for many gaming sites is to showcase a variety of image-only click-throughs when giving options, with video previews available on game-specific pages.

Some sites also opt for embedded GIFs as ways of quickly previewing gameplay (although these might actually be better replacements for images than video due to lack of audio).

Generally speaking, the marketing industry has trended toward video of late, but it’s a good idea to brainstorm and test how users might best engage with game previews. These GuessTheTest A/B test case study shows just how important it is to test your optimal app, video, or gaming and product thumbnail images.

Pull-Down Menus vs. Horizontal Sliders

This comparison needs to be considered in the context of both desktop and mobile versions of gaming platforms since the transition between them has, historically, been a bit of a nightmare.

The reality, the majority of gamers are now on mobile, and UX designers often opt to adapt mobile sites to display pull-downs as horizontal sliders.

While the choice is going to depend partly upon the architecture of the site, it’s important to know how users perceive accessibility and enticement.

With a lot of CSS templates being easy to copy and paste nowadays, quickly mocking up a couple of menu options to test shouldn’t take too long.

As these GuessTheTest case studies show, testing and finding the optimal navigational format and presentation for your audience has a big impact on conversion rates across desktop and mobile devices.

Purchase Recommendations

Deciding how to order and display buying options is a key business decision for any game site operator.

As this GuessTheTest case study, conducted on behalf of T-Mobile, shows testing how to display different subscription tiers illustrates how careful design may yield substantially different results in terms of what users consider the “default” and how willing they are to be upsold.

For gaming sites, the test could be specifically between tiers of bundles or other models.

Methods of Payment

When a potential gamer takes the plunge is crucial to optimize. And on the user side, a balance needs to be struck between convenient options and perceived safety, assuming all options are safe.

With gaming sites, there are both new and established options for users to choose between.

Regarding the new, the notion of poker sites considering cryptocurrency for deposits and payouts has become very real in recent years.

Crypto is becoming a fresh option alongside more long-standing alternatives –– such that an A/B/C test could differentiate between preferences for a processor (like PayPal or Skrill), paying directly via credit card, or paying via crypto.

It may be that platforms are best off supporting all three options, but testing for preferences will nevertheless influence what’s most sensible to emphasize.

Displaying Offers

Timing and method are keys when it comes to displaying special offers.

Pop-ups can create a bit of an allergic reaction for internet users -- even gamers who are accustomed to fairly “busy” casino sites. To this point, many users still take active steps to block them.

For this reason, it's worthwhile testing if it's better to use a pop-up of display the offer as a banner carousel.

If you do go for pop-ups, test the impact of using less intrusive formats and minimizing the number of overall pop-ups used.

As this GuessTheTest notification bar case study shows, timing and placement are important variables to get right to improve the effectiveness and conversion rates of pop-ups or other notification displays.


There are a lot of decisions that need to be made with a mass user-base platform like a gaming site.

There are many elements that you can be optimized to appeal to more users. A/B testing is a powerful tool to determine what works best with your users.

For more A/B testing inspiration, checkout the full library of tests. There are hundreds of inspiring A/B tests for you to look at and game the system with.

Take your best guess, see if you win, and apply the findings to optimize your own success.

Hope you’ve found these suggestions useful and informative. Please share your thoughts and comments in the section below. 

By: Deborah O’Malley and Joseph A. Hayon | Last updated January, 2022


Did you know that 96% of adult users report being concerned about their privacy on the Internet? And 74% say they’re wary of how their privacy and personal data is being used.

With rising fears over Internet privacy and security, there’s been a push to protect users and tighten privacy policies.

As a result, 3rd party cookies will soon be blocked by most major web browsers, leaving optimizers and experimenters desperately picking up the crumbled pieces.

We'll have no easy way to track user events or behaviors -- let alone measure conversions.

What's an experimenter to do?

In this article, you’ll learn everything you need to know about cookies, their implication on A/B testing, and how to ensure continued success in your testing practice.

Specifically, you’ll find out:

  • What cookies are
  • The two most important types of cookies
  • How a cookieless world may impact on A/B testing
  • Server-side tag management and what it means for your testing practice
  • What you should do to adapt to the changing landscape

What are cookies? 

Beside those delicious, tasty snacks, what are cookies anyway?

Well, in the context of the World Wide Web, cookies are text files sent from a website to a visitor’s computer or other Internet connected device.

That text file identifies when a visitor is browsing a website and collects data about the user and their behavior.

Collected data may include elements like a user's personal interests, geographical location, or device type.

These data points are stored on the user's browser. The information storage occurs at the time the user is engaging with a website, app, or online ad.

Through this collection and storage process, cookies achieve three things.

They can:

  1. Gather and store information about the user
  2. Capture the actions or events a user takes on a website or app
  3. Track a user’s activity across multiple websites

Cookies – the good, the bad, and the ugly

The good

Cookies attempt to provide users with a more personalized experience when engaging with a website or app.

When marketers analyze tracking data, it can provide them with guidance into where optimizers' focus should be placed. The insights available are often richer than what standalone spreadsheet results or visual dashboards can provide.

So, despite their bad rep, cookies can, in fact, help digital marketers detail what, where, and why certain web events happen.

For these reasons, cookies, can be a valuable (and delicious) part of the web.

The bad

However, because cookies track and store a user’s browser activity, they’ve been demonized as invasive and intrusive monsters. Cookie monsters, that is.

The ugly

With growing concerns over user privacy, there’s been a call to arms for the abolishment of cookies.

Getting rid of cookies may spell bad news for digital advertisers, marketers, experimenters, optimizers, and A/B testers reliant on using tracked data to understand user behavior and measure conversions.

The two types of cookies you need to know

While there are plenty of tasty cookies out there, as a digitals marketer, there are really only two kinds you need to know: 1st and 3rd party cookies.

The difference between these two types of cookies is based on how a user's browsing activity is collected and where that data gets sent. 

1st Party Cookies Defined

1st party cookies are directly stored by the website or domain you visit or the app the user has logged into.

Once the user’s data has been captured, it is sent to an internal server.

3rd Party Cookies Defined

3rd party cookies have a slightly different flavor.

3rd party cookies are created by outside domains and relayed back to an outside third-party server, like Google, Facebook, or LinkedIn.

It's because the data is collected and sent out, to a 3rd party, that 3rdparty cookies are so named.

While this information might be interesting, as a digital marketer, what you really need to know is, 3rd party cookies are the troublesome ones of the batch.

They’re essentially tracking pixels.

And because they track and send data back to advertisers’ servers, they're seen as intrusive and highly invasive of user privacy.

For this reason, privacy protection advocates don’t like them and want them abolished.

And so, big-name brands, like Google and Facebook, are being forced to shift into a cookieless world -- one where 3rd party cookies will no longer exist.

In fact, Google is planning to eliminate the use of 3rd party cookies as soon as early 2022 and will likely move instead into privacy sand box initiatives.

How do 1st and 3rd party cookies work?

Now that you intricately understand what a cookies is, and the differences between a 1st and 3rd party cookies, you can explore the intricacies between how a 1st and 3rd party cookie works.

Grasping this concept will help you aptly shift from a reliance on 3rd party cookies to adopting other solutions that still enable the data tracking, collection, and conversion measurements you need for A/B testing.

The inner workings of a 1st party cookie

To fully understand how a 1st party cookie works, you have to start with realizing the internet has what’s known as a “stateless protocol.”

This term means, websites inherently “forget” what a user has done, or what activity they’ve taken -- unless they’re given a way to remember this information.

1st party cookies solve this forgetfulness problem.

They track and capture user activity information and, in doing so, aim to deliver a highly identifiable and authenticated user experience.

For example. . .

Imagine a user goes to your eCommerce website, adds an item to their cart, but doesn’t end up making a purchase during their initial session.

Well, the cookie is what enables the items added to cart to be saved in the user’s basket.

So, if the visitor later returns to your site, they can continue their shopping experience exactly where they left off --with the items still in their basket from the previous session.

All thanks to 1st party cookies.

This process can aid users and provide a better user experience.

So, you can see, 1st party cookies really aren’t that bad!

And, the good news is, most everyone seems to agree, 1st party cookies are a-okay. So they're here to stay. At least for now. . .

How 3rd party cookies work

An easy way to understand  how 3rd party cookies work is by imaging this scenario:

Pretend you’re searching online for a local sushi restaurant and land on “sushi near me” website.

But then, your kid abruptly grabs your phone out of your hand and insists they play you their latest, favorite Facebook video.

It’s 3rd party cookies that will have tracked this behavior and try to serve you ad content believed relevant based on your browsing activity.

The mystery for why you're getting Nyan Cat memes in your Facebook feed is now explained. . .

And, 3rd party cookies are also the reason why you might get served Facebook ads for that local sushi restaurant you just searched.

What happens to user data after a 3rd party vendor gets ahold of it?

After the data is relayed back to the 3rd party, it’s then pooled together by all the data warehouses owned by the various ad tech vendors.

This data collection and pooling process is what enables marketers to:

  1. Deliver targeted ads relevant to users’ interests
  2. Develop look-a-like audiences
  3. Build machine learning processes aimed at predictive ads
  4. Control the number of times a user sees an ad through any given ad network

With these capabilities, it’s possible to measure ad effectiveness, assess key conversion metrics, and gauge user engagement across multiple website networks.

None of these assessments would be easily possible without 3rd party cookies. Which means, without them, A/B testing may become significantly difficult!

How a cookieless world may impact A/B testing

If you’re concerned about protecting your data and privacy, the abolishment of 3rd party cookies may be welcomed news.

But, if you’re like most marketers, you might be a little freaked out and wondering how the abolishment of 3rd-party tracking cookies is going to impact the tracking and data collection capabilities in your Conversion Rate Optimization (CRO) and A/B testing work.

How will the abolishment of 3rd party cookies affect optimizers?

The answer is -- likely in three main ways:

  1. Legislatively
  2. Through cookie banner requirements
  3. Within browser tracking

Let’s look at each concept in more depth. . .

1) Legislative Reforms

When the Internet surfaced to the mainstream in the late 1990’s, www, might have just as well stood for wild, wild west.

There weren’t many laws or privacy controls enacted back then.

But, as the web has matured, more privacy policies have become instituted.

GDRP requirements were just the starting point. As states,

“Governments are creating laws and regulations designed to encourage tracking transparency. These laws are geared towards preventing companies from unethical or unregulated user tracking. In a nutshell, it's all for the sake of user privacy. As a result of GDPR, CCPA, and ePR, websites that don't reveal which user data is collected or how it's used can face civil penalties.”

Here's a timeline showing recently implemented privacy policies:

From a legal perspective, digital marketers are being required to comply with more, and more aggressive, data-privacy regulations.

And this trend will likely only continue as time marches on.

2) Cookie banner requirements

Website owners must now also provide transparency at the start of a user’s journey by prominently displaying a cookie banner on their website.

We’ve all seen these cookie notification banners. They look something like this:

Now you know why they’re on so many websites! 🙂

And, while cookies banners may slightly disrupt an A/B test experience, they're not the worst offender.

3) Browser tracking

Worse yet are increasingly stringent browser privacy policies.

For example, Apple’s Safari browser recently instituted aggressive policies against cookie tracking.

Safari now “kills” cookies after 7 days, and kills cookies from ad networks after just 24 hours!

These timelines are a big problem, especially for testers following best practice and running an experiment for +2 weeks.


Well, let's say, for example, a Safari user enters a website being tested, leaves the site and comes back 8 days later?

By this time, the cookie has been dropped, and, as a result, the user is now classified as a new visitor and may see a different test variation.

The result: the experiment data may be dramatically skewed and unreliable.

What's going to happen to data collection with A/B testing?!

If this kind of scenario sounds doomsday to you, you might be wondering how you're now possibly going to accurately track and measure browser behavior, events, or conversions.

Here’s the likely outcome:

Fragmented data collection

Data collection and measurement will likely become more fragmented.

As 3rd party cookies are phased out, experimenters will be forced into relying largely on 1st party cookie data.

But, this data will be accessible only from users who consent to the site’s cookie policies and data collection or login terms.

Since fewer users are likely to accept and consent to all privacy policy, data profiles will become more anonymous and less complete.

Net sum: there will be big gaps in the data.

These gaps will likely only grow larger as privacy laws evolve and become more stringent.

Modeled user behavior

Currently, the digital marketing realm seems to be moving towards more of an event-style tracking model, similar to Google Analytics 4 (GA4) data collection reports.

This movement means, in the end, we’ll have more algorithms pointing to modeled user behavior -- rather than actual individual user data.

And, we’ll likely need to rely more on data modeled through statistical calculations, methodologies, and technologies.

Essentially, we’ll move from an individual data collection approach to an aggregated, non-user-segmented overview.

This approach will work if we compile enough data points and invest in developing robust predictive analytics data collection and measurement infrastructure.

Untrackable user data

As privacy policies rachet down, tracking user behavior is going to get far more challenging.

For example, Apple recently released its newest operating system, iOS 15.

Built into this OS are a host of security measures that are great for user privacy, but prevent the tracking of any user behavior.

In fact, right now, the only new Apple device without privacy controls are the Airpods!

In addition to Apple, other browser brands are also cracking down. As such, an anonymous user browsing experience -- with untrackable data -- is becoming the norm.

In turn, private, secure browsers, like Brave, will likely become mainstay.

Meaning, yet again, less data and metrics available to marketers.

The final outlook?

But, in the end, A/B testing likely won’t die. It will just change.

Conversions are most likely going to be based on broader machine-driven learning algorithms that model users, predicts and analyzes expected behavior, then provides anticipated outcomes.

Data will be less about observed insights and more about predicting how we think users are likely to behave.

Which, when you really get down to it is kinda like how most experimenters declare A/B test winners now anyway. 😉

Server-side tag management

Big changes are coming.

Major privacy and security enhancements will shift the way experimenters track, access, and measure conversion data.

3rd party cookies will soon be abolished, and marketers will be left with big holes in their data collection and analysis capabilities.

For data purists, the situation sounds a bit scary.

But, the good news is, server-side tag management may be a potential solution. 

What is server-side tag management is, how does it work, and how might it help A/B testers?

What is server-side tag management?

According to data wizard, Simo Ahava, server-side tag management occurs when you run a Google Tag Manager (GTM) container in a server-side environment.

If that terminology doesn't mean much to you, don't fret.

Simply stated, it means server-side tagging is a bypass around 3rd party cookies.

How does server-side tag management work?

You'll recall, with a 1st party cookie, the cookie is deployed by the owner of the website domain.

Well, with server-side tracking, the data is sent to your own server infrastructure. A cloud server is then used to send data back to 3rd party platforms.

Doing so gets around tracking restrictions and privacy laws. At least for now.

How does server-side tag management help A/B testers?

This solution helps enable experimenters to take more control of their data. With server-side tag management, you can choose how to prep and clean the data as well as where to unload users' data.

As a result, there are 3 main advantages. With server-side tagging, you can continue to:

  1. Serve 3rd party content
  2. Achieve better, faster site performance
  3. Reduce data gaps

Let's explore each advantage in a bit more depth. . .

1. Serve 3rd party content

In server-side tag management, typically, the data is ported into a 3rd party ad network. This data flow means, marketers can serve and target content based on users’ indicated interests. 

Marketers can also mask the identity of the browser by changing the container name or hostname.

As a result, an adblocker won't recognize it as something trying to identify the users, so the ad won't be as redily blocked. 

2. Achieve better performance

From a performance standpoint, server-side tagging is faster and more efficient because less code needs to load on the client-side.

So, A/B tests may be able to run more quickly potentially without lag or flicker.

3. Reduce data gaps

Server-side tag management can also prevent the data from "leaking" or getting lost in the shuffle.

But, most importantly, server-side tag management can reduce the gap between conversion events and actual user behavior.

Final thoughts on server-side tagging

While the shift to server-side tagging may not be the end-all solution, it is a means currently available to help digital marketers take better control of how user data is captured.

For now, it's a viable solution that's worth investing and learning more about.

Adapting to change

To stay ahead of the curve, here are the top 3 things you can do in this changing environment as we shift into a cookieless world.

1. Build your relationship with your customers

Don’t just rely solely on your business web analytics data.

Learn about your users.

Really get to know your audience. Find out what they like and what keeps them away from your site.

You don’t need to rely on analytics data to do so.

You can talk to your customers, poll them, apply qualitative data or do User eXperience (UX) studies to learn about your customers their needs, wants, desires, and how they behave on your website.

2. Get to know your analytics account

Audit the data collection integrity of your analytics data one or twice a month depending on your organization’s needs.

Note any visible changes, and try to dig into elements that are less guided by analytics tools. 

3. Stay up to date

Keep up with new updates to ad tech platforms, like Facebook’s new API implementation for tracking Facebook events, as well as updates with GTM server side tagging.

Some good resources for staying on top of the latest trends include Simo Ahava’s blog and Google’s Developer Resources.


We're moving into a cookieless world where user privacy and personal data protection are paramount.

This paradigm shift may present many challenges for data-driven marketers reliant on tracking users to personalize the customer experience and accurately calculate conversions.

Despite the changing landscape, as an experimenter, you can still continue to build relationships with your customers within a first-party environment.

Also remember, as data collection methods change, you can’t expect to achieve 100% data accuracy.

Rather, you’ll be using more sampled data to piece together an approximation of how customers behave.

In the end, A/B testing likely won’t die. It will just change.

For even more information about A/B testing in a cookieless world, checkout this webinar replay.

Hope you’ve found these suggestions useful and informative. Please share your thoughts and comments in the section below. 

The complete guide: a comprehensive step-by-step tutorial to create and launch A/B tests

By: Deborah O'Malley, M.Sc | Last updated December, 2021


Good news!

In the free A/B testing platform, Google Optimize, it's easy to set-up your first A/B test.

But, so that you can accurately, and confidently, perform the set-up, this article breaks down exactly what you need to do -- and know -- with definitions, examples, and screenshots.

Setting-Up a Test Experience

Imagine that you're about to set-up and run your first A/B test. Exciting times!

Let's say you've decided to test the button copy on your sign-up form. You want to know if the text "sign up now" converts better than "get started".

Well, by running an A/B test in Google Optimize, you can definitively determine if one text variant will outperform.

Creating Your Experience

To begin, you simply need to open Google Optimize, ensure it's set-up for your specific site and click the "Create" button in the upper right hand corner of your screen in Google Optimize.

Doing so will create a new test experience:

A screen will slide out from the right side, asking you both to name your experiment, as well as define what kind of test you’re going to run:

Naming Your Test

You'll first need to name the test.

The test name should be something that is clear and recognizable to you. 

*Hint* - the name should be both memorable and descriptive. A title both you and your colleagues or clients will immediately understand and recognize when you come back to it in several months or even years time.

For this example, we’ll call the test “CTA Button Test”

Site URL

You’ll next be prompted to type in the URL of the webpage, or website, you’d like to use. 

This URL is very important to get right. 

It’s crucial you input the URL of the control -- which is the original version you want to test against the variant(s).

If you’re running a re-direct test and you select the URL of the variant, the variant will be labeled as the control in Google Optimize and it will totally mess up your data analysis -- since you won’t be able to accurately compare the baseline, or performance of the original version against the variant.

Trust me! I’ve seen it happen before; it’s a mistake you want to avoid.

So, make sure you properly select the URL of the page where your original (the control) version currently sits on your site. 

If the element you want to test is a global element, for example, a top nav bar button that shows on all pages across the site, you’ll use the homepage URL.

In this example, the button we’re testing is in the top nav header and shows on every page of the site. So, we’ll use the website's homepage URL:

We’ll enter that URL into the URL field:

Defining Your Test Type

In order to accurately run the proper test, you'll need to choose the correct type of test to run.

In Google Optimize, there are four different test type options:

  • A/B test
  • Multivariate test
  • Redirect test
  • Personalization

Most of the time, you'll be a running a straight-up A/B test. But to know for sure which test type is best for your needs, check out this article.

For our example, we just want to set-up a simple A/B test looking at the effect of changing the CTA button.

For your test, once you've confirmed the type of test you're going to run, you're ready to officially set-up the test, or as Google Optimize calls it, "create the experience."

To do so, simply press the blue "Create" button in the upper right hand corner:

Adding the Variants

Once you've created the experience, you’re now ready to continue setting up the test, starting with adding your variant(s).

The variant, or variants, are the versions you want to test against your control. Typically, in an A/B test, you have one variant (Version B).

But, as long as you have enough traffic, it's perfectly legitimate to run an A/B/n test with more than one variant (where n stands for any number of other variants.)

To set-up your variant(s), sSimply click the blue “Add variant” button:

Next, name the variant with a clear, descriptive, memorable title that make sense. Then click “Done”:

Woo hoo! You’ve just set-up your A/B test! 🙂

But, your work isn’t done yet. . . now you need to edit your variant. 

Editing the Variant

Remember, the variant is the version you want to test against the control.

To create the variant, you can use Google Optimize’s built-in WYSIWYG visual editor.

Or, for more complex tests and redesigns, you might want to inject code to create the new version.

To do so, you click “Edit”:

Note, the original is the version currently on your site. You can not edit this page in anyway through Optimize. The name cannot be changed, nor can the page URL.

You’re now brought into an editor where you can inject code or get a visual preview of the webpage itself. 

Using the visual editor, you can click to select the element you want to edit, and make the change directly. 

Optimize’s visual editor is pretty intuitive, but if you’re unsure what elements to edit, you can always refer to this guide.

In this example, you see the visual editor.

To make changes, you'd first click to select the “Get a Free Analysis” button text, and then click to edit the text:

Now, type in the new text, “Request a Quote” and click the blue “Done” button at the bottom right of the preview editor screen:

When you're happy with all changes, click the top right "Done" button again to exit out of the preview mode:

You're now brought back into the Optimize set-up:

Here, you could continue adding additional variants, in the same way, if you wanted.

You could also click the "Preview" button to preview the variants in real-time.

Assigning Traffic Weight

Once you've assigned and defined your variants, you're going to want to state the weight of traffic, or what percentage of traffic will be allocated to each variant.

The default percentage is a 50/50% split meaning half of visitors (50%) will see the original version and the other half (50%) will see the variant.

As a general testing best practice, traffic should be evenly split, or when testing two versions, weighted 50/50.

As explained in this article unequal allocation of traffic can lead to data discrepancies and inaccurate test results.

So, as a best practice, don't change this section.

But, if for some reason, do you need to change the traffic weight, you can do so by clicking on the link that says "50% weight":

A slide-out will then appear in which you can click to edit the weight to each variant. Click the "Custom percentages" dropdown and assign the weight you want:

If you were to assign an 80/20% split, for example, that would mean the bulk, 80% of traffic, would be directed towards the control and just 20% of visitors would see the variant.

This traffic split is very risk adverse because so much of the traffic is diverted to the control -- where there is no change.

If you find yourself wanting to allocate traffic in this way, consider if the test itself should be run.

Testing is itself a way to mitigate risk.

So, if you feel you further need to decrease risk by only showing a small portion of visitors the test variant, the test may not actually be worth doing.

After all, testing takes time and resources. So, you should do it properly. Start with evenly splitting traffic.

Setting-Up Page Targeting

You're now ready to set-up the page targeting.

Page targeting means the webpage that is being "targeted" or tested in the experiment.

You can target a specific, single page, like the homepage, a subset of pages, like all pricing pages, or all pages on the site.

In this test example, we want to select any URL that contains, or includes a certain URL path within the site.

We're, therefore, going to set-up our Google Optimize test with a URL that “Contains” the substring match. We'll do so by going to the pencil or edit symbol, clicking on it:

And, selecting “Contains” from the dropdown menu:

This rule is saying, I want all the URLs that contain, or have, the URL to be part of the test. 

In contrast, if we had selected “Matches”, the test would only be on the homepage, because it would be matching that URL.

If you’re unsure which parameter to select for your test, you can consult this Google article.

If you want to verify the URL will work, you can check your rule. Otherwise, click “Save”:

Audience Targeting

Now, you can customize the test to display for only certain audiences, or behavior.

To do so, simply click “Customize”:

Here you can set a variety of parameters, like device type and location, if you only want certain viewers taking part in the test:

Note, if you want to parse out results by device type, that reporting is done in Google Analytics, and should NOT be activated within Google Optimize.

However, if you only wanted mobile visitors, for example, to take part in the test, then you’d select the “Device Category” option and choose only mobile visitors.

In this test example, we don’t have any rules we’d like to segment by, so we’ll leave everything as is.

Describing the Test

Next, you can add a “Description” about the test.

This step is optional, but is a good practice so you can see your test objective and remind yourself of the hypothesis. 

Adding a description also helps keep colleagues working with you on the same page with the test.

To add a “Description” simply, click the pencil to edit:

Then, add your description text:

Defining Test Goals

You're now ready to input your test goals.

"Goals" are defined as the conversion objectives or Key Performance Indicator (KPI) of what you're measuring from and hoping to improve as a result of the experiment.

You may have one single goal, like to increase form submissions. Or many goals, like increasing Clickthrough Rates (CTRs) and form submissions.

Your goals may be conversion objectives that you've newly set, or might tie-in to the existing goals you've already created, defined, and are measuring in Google Analytics.

To set-up a new goal, or select from your existing goals, simply click the “Add experiment objective” button:

You'll then have the option to either choose from already populated goals in Google Analytics, or custom create new goals.

Note, if you're using existing goals, they need to have already been set-up in Google Analytics and integrated with Google Optimize. Here's detailed instructions on how to link Google Analytics into Google Optimize.

For this example, we want to “Choose from list” and select from the goals already created in Google Analytics (GA):

The GA goals now show-up as well as other default goals that you can select:

In this example, we want to measure those people who reached the thank you page, indicating they filled out the contact form. We, therefore, select the "Contact Us Submission" goal:

We can now add an additional objective. Again, we’ll “Choose from list”:

In this case, we also want to see if the button text created a difference in Clickthrough rate (CTR) to the form page.

Although this goal is very important it's labelled as the secondary goal because contact submissions are deemed more important than CTR conversions:

Email Notifications

It's completely optional, but under the “Settings” section, you can also select to receive email notifications, by sliding the switch to on:

Traffic Allocation

Traffic allocation is the percentage of all visitors coming to your site who will take part in the test. 

Note, this allocation is different than the weight of traffic you assign to each variant. As described above, weight is the way you split traffic to each variant, usually 50/50. 

Of that weighted traffic, you can allocate a percentage of overall visitors to take part in the test.

As a general best practice, you should plan to allocate all (100%) of traffic coming to your site to the test experiences, as you’ll get the most representative sample of web visitors. 

Therefore, you shouldn't need to change any of the default settings.

However, if you’re not confident the test will reveal a winner, you might want to direct less than 100% of the traffic to the experiment.

In this case, you can change the traffic allocation from 100% of visitors arriving at your site to a smaller percentage by clicking the pencil to edit the value here (simply drag the slider up or down):

Note that the smaller the percentage of total traffic you allocate to your test, the longer it will take for you to reach a statistically significant result.

As well, as explained in this article unequal allocation of traffic, or reallocation of traffic mid-test can lead to data discrepancies and inaccurate test results. So once you've allocated, ideally, 100% of your traffic to the test, it's best to set it and forget it.

Activation Event

By default, “Page load” is the activation event, meaning the experiment you've set-up will show when the webpage loads. 

So long as you want your test to show when the page loads, you’ll want to stick with the default “page load” setting.

If you’re testing a dynamic page -- one which changes after loading -- or a single page application that loads data after the page itself has populated, you’ll want to use a custom “Activation event” by clicking the pencil tool and selecting the activation event from the dropdown menu that fits best for you:

An activation event requires a data layer push using this code: dataLayer.push({‘event’: ‘optimize.activate’}); You can learn more here.

Note, in the free version of Google Optimize, you can choose up to one primary objective and two secondary objectives. 

Once these objectives are selected and your experiment launched, you can’t go back and change them. 

So, make sure you think about how you want to track and monitor conversions before you launch your test!

Prior to Launch

With all your settings optimized, you’re nearly ready to start your test! 

Preview Your Experiences

But, before launching, it’s always a good idea to preview your experiences to make sure everything looks good.

To do so, click on the “Preview” button in the Variants section and select the appropriate dropdown menu for the view you want to see:

My recommendation is to individually preview each variant in web, tablet, and mobile preview mode:

Confirm and Debug Your Tags

Next, you’ll want to click the “Debug” option within the Preview mode:

Clicking into the Debug mode will bring up the website you’re testing and will show you a Google Tag Manager (GTM) screen, with the targeting rules for the tags that will be firing:

If there are any issues, you can debug them now -- before launching your experiment.

Get Stakeholder Approval

If you’re working with stakeholders, or clients, on the test, it’s a good idea to let them know the test is set-up and ready to launch, then get their approval before you start the test.

Previewing the test variants, and sending them screenshots of the preview screens will enable you to quickly and efficiently gain stakeholder approval.

You're then ready to launch the test! 🙂

Launching the Test

With all your i’s dotted and t’s crossed, you’re ready to launch your test!


To do so, simply click the “Start” button at the top right of the screen.

And, ta da! You’ve just set-up an A/B test in Google Analytics.

Congratulations. Well done!

Your Thoughts

Hope you found this article helpful!

Do you have any thoughts, comments, questions?

Share them in the Comments section below.

magnifiercrossmenu-circlecross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram