The answer isn’t so simple. . .

In this article, you’ll see several historic A/B test case studies examples that support the notion sliders don’t convert. And, you’ll find out the one crucial exception to the rule.

One design element that has stood the test of time is the image slider.

Sliders, also called image carousels, or rotating offers, are web elements where the pictures displayed automatically rotate through a series of images on the web page.

This popular women’s clothing retailer, Sammy Dress, shows an example of a slider:

Unlock Access. Sign up to become a Pro Member. Get complete access to this helpful content, plus so much more.

Already a Pro Member? Login here.

By: Deborah O'Malley & Shawn David | Last updated March, 2023

Test planning and prioritizing requires more than just figuring out the best tests to run first.

To be done well, you need a system in which you can track, map, and show accountability for your testing roadmap.

In this information 14-minute interview, you'll hear from Shawn David, Operations Engineer, of Speero, a leading A/B testing consulting firm.

Shawn shares how Speero researches, test ideas, turns insights into actions, and runs experiments.

Check out the video to:

- Get an inside glimpse into Speero's
*Experiments, Insights, Action*planning and prioritization methodology. - Understand why "receipts," or keeping a track record of test ideas is crucial for managing client relationships.
- See how this tool, built can be used to develop or enhance your own test planning and prioritization model.

**Hope you’ve found this content useful and informative. Please share your thoughts and comments in the section below. **

*By: Deborah O'Malley, M.Sc | Last updated January, 2023*

It’s official!

Google Optimize, the free A/B testing platform, is shutting down.

According to an email statement released by Google, as of September 30, 2023, Google Optimize and Optimize 360 will no longer be available:

As an experimenter running tests on Google Optimize, this news could spell bad news for you -- especially if you’re heavily invested in using Optimize.

What should you do now?

Here's the top-3 next steps I recommend you take. Now.

If you're a testing agency, or an organization who has run loads of tests on Google Optimize, you may have a large testing repository housed within Optimize.

Now is the time to take those test results and start documenting them in a place where you'll be able to access them in the future.

You may be thinking, why would I need old tests?

The reason is you're going to want to be able to look back on tests you've run, see what worked, what you've already tried, and show clients or management already vetted ideas.

Documenting your old tests may take a lot of time, but it will be one of the most valuable things you can do, long-term.

To begin the documentation process, Google recommends you export your Optimize data by exporting a CSV file. They provide more info here.

But I recommend you take it a step further and capture screenshots of the test images, plus all your test data.

Yes, it'll take some time, but it'll be worth it in the end!

Cause, trust me, there's nothing worse than having all the tests you want access to just gone!

To make this job easiest for you, I've created a free test repository template you can use. Simply download and fill in the data relevant to your tests.

You can access the free template here. It looks like this:

Note, if you don't have time to be filling in the testing template yourself, Fiverr and Upwork are great places to hire someone to do this task cheaply.

Right now, it remains in question if and when Google will provide a viable alternative to Optimize.

Officially, Google has said they're "committed” to bringing effective solutions to customers:

But what that *really* means is unclear.

As **Johann Van Tonder** eloquently put it in this LinkedIn thread, that statement is light on detail and heavy on vague BS:

Analytics experimentation experts like, **Shawn David**, say, to work, any new Optimize alternative will have to be structured radically differently because the current Optimize format is cookie-based, but anything new will be sessions-based, so won't work the same way:

However, Data Analytics Architect, **Juliana Jackson** of **Media.Monks** suggests Google's statement means you'll be able to make Google Analytics 4 (GA4) your hub to measure and analyze test results, but will need to find a new testing platform altogether. See this LinkedIn thread:

At this point, nobody really knows for certain.

We can only speculate and wait till the almighty Google tells all.

But waiting too long is a dangerous move. You risk getting left far behind.

So, it’s highly recommended you start exploring alternative testing platform options now.

As optimizer, Rich Page, outlined in this helpful LinkedIn post, right now both Convert and VWO are offering new plans to help you migrate from and replace Google Optimize:

And, if you’re willing to go paid, there’s certainly plenty of options out there for you.

Although not an exhaustive list, in this post, Speero CEO, Ben Labay, suggests several A/B testing vendors to consider, including Optimizely, Kameleoon, AB Tasty, Adobe Target, and SiteSpect, to name just a few:

The best testing platform for you depends on your unique testing needs.

Most platforms are priced on traffic, so the more traffic you have, the more you’re likely to pay.

But, behind cost, there’s factors to consider like ease of migration, features and functionality, and trustworthiness of the data collection methods.

In this post, Ben suggests you should make your platform choice by evaluating how it assigns variables and metrics:

Some experimenters are delighting at the downfall of Google Optimize.

They’ve stated things like only amateurs incorrectly used the free testing platforms and, according to stats guru Georgi Georgiev, don’t get accurate results anyway:

While some have argued, the market share for Google Optimize tends to be amateur, with only junior testing teams reliant on the free platform, Convert has compelling evidence showing 61% of experimenters have at the very least signed up for an Optimize account – even if it’s not active:

And, according to this post, Optmize users include big name brands like Swarovski crystals and Garnter:

That said, not every experimentation team using Optimize is a big-name brand. Many are, most certainly, smaller.

If you’re one of them, now is the time to be evaluating if A/B testing is truly for you.

As I’ve shared before, truly trustworthy test results are hard to obtain. Properly-powered A/B tests need huge sample sizes, typically of +120 thousand visitors per variant.

If you’re in the 88% of organizations that don’t have anywhere near this type of traffic, you might want to consider other experimentation methods, including user experience testing, consumer surveys, exit polls or customer interviews.

None of which need any A/B testing software whatsoever.

I also recommend engaging in what I’ve coined “B/A” testing.

B/A testing is the opposite of A/B testing.

In B/A testing, you take the **B**aseline conversion rate, form data-driven hypotheses, implement changes, and **A**ssess the conversion rate over a reasonable period of time, based on your sales cycle.

If the conversion rate is flat or down, your assessment shows you tweaks need to be made.

You then go back to the drawing board, form data-driven hypotheses on what you think is lagging, and change those aspects.

Then, again, assess, comparing the **B**efore to the **A**fter.

Baseline/assess. Before/after. Baseline/assess. Before/after. B/A. B/A.

You continue this iterative process until you see conversion rates increase.

This solution is particularly good for lower traffic sites that don’t have the volume to run trustworthy tests, but still want to experiment and optimize conversions.

The downside is, all traffic is exposed to the “variant” – the change(s) you’re implementing on the site. But, the upside is, by exclusively monitoring the conversion rate, you know for certain sales are increasing. Or not.

There’s no more guesswork or relying on data you can’t fully trust. Just results.

Here’s a real-life client example of B/A testing in action:

As you can see, over the time period measured, the eCommerce conversion rate went from 2.32% to 3.14%, contributing to a solid 35.48% increase for the company -- which resulted in many more thousands of dollars over the sales period! And counting.

And this example isn't isolated. I’ve used this method for many happy clients and have seen conversion rates and sales figures increase +30% in just a couple months.

So, B/A testing may be a viable alternative when A/B testing just doesn’t make sense.

If you’re interested in learning more about how B/A testing could help you, get in touch for a free conversion audit.

These results show, A/B testing isn't the only way to optimize and increase conversion rates.

The testing landscape is quickly changing. Experimenters need to be on their toes, and ready to spring into action.

In the immediate term, experimenters should:

- Start saving their testing data
- Look for viable experimentation platform alternatives
- Assess if A/B testing is best or if B/A testing is a better fit

While the sun may be setting on Google Optimize, the sun also rises, providing innovative experimenters the chance to experiment with new optimization opportunities.

Hope you found this article helpful!

Do you have any thoughts, comments, questions?

Share them in the *Comments* section below.

By: Deborah O'Malley, M.Sc | Last updated January, 2023

Google the question, "How long should you let an A/B test run" and you'll get a variety of responses. Most of them incorrect:

In actuality, how long you need to run your A/B test is determined by your sample size requirements.

To run a properly-powered test, you need to begin by calculating your sample size requirements AHEAD of running the study.

You can use a sample size calculator, like this one, to calculate your required sample size. (See this GuessTheTest article on how to best use the calculator).

You can then calculate approximately how long it'll take to reach this sample size requirement. This calculation can be most easily done by using a test duration calculator like this one.

Once calculated, don't stop your test ahead of reaching the pre-calculated sample size requirement – even if results appear significant sooner.

Prematurely declaring a winner, or loser, before meeting sample size requirements is a dangerous testing practice that can cause you to make incorrect calls before the results are fully flushed out.

Assuming sample size requirements can be met, on average an A/B test should run between 2-6 weeks.

A 2-week timeframe ensures the test runs all days of the week and smooths out any data discrepancies in consumer shifts, for example, over the weekend.

Much longer than 6-weeks and the data may start to become muddied.

Things like user patterns may shift or cookies become deleted, introducing a whole new set of variables into the equation. As a result, you won't know if it's changing user behavior or something else that's contributing the test results.

That said, testing timing depends not only on sample size requirements, but also the type of test you're running.

For example, an email test may run just once over 1 hour. As long as the test has a large enough email list to achieve properly-powered, statistically significant results, you're covered.

Other tests may need to run for different durations to take into account factors like seasonality or sales cycles.

In the end, how long your A/B test should run is an "it depends" scenario -- which can be clearly calculated ahead of starting your study.

Do you have any questions or thoughts? Give your feedback in the comments section below:

By: Deborah O'Malley | Last updated September, 2022

If you've been into experimentation long enough, you've likely come across the term *MDE* -- which stands for **M**inimum **D**etectable **E**ffect (MDE).

The MDE sounds big and fancy, but the concept is actually quite simple when you break it down. It's the:

*Minimum*= smallest*Effect =*conversion difference*Detectable*= you want to see from running the experiment

As this GuessTheTest article explains, in order to run a trustworthy experiment -- one that's properly powered experiment, based on an adequate sample -- it's crucial you calculate you calculate the MDE.

But not just calculate it.

Calculate it AHEAD of running the experiment.

The problem is, doing so can feel like a tricky, speculative exercise.

After all, how can you possibly know what effect, or conversion lift you want to detect from the experiment?! If you knew that, you wouldn't need to run the experiment to begin with!

Adding insult to injury, things get even more hazy because the MDE is directly tied into your sample size requirements.

The larger the MDE, the smaller the sample size needed to run your experiment. And vice versa. The smaller the MDE, the bigger the sample required for your experiment to be adequately powered.

But if your sample size requirements are tied into your MDE, and you don't know your MDE, how can you possibly know the required sample size either?

The answer is: you calculate them. Both. At the same time.

There are lots of head spinning ways to do so. This article outlines a few.

But, if you're not mathematically inclined, here's the good news. . .

You can use a pre-test analysis calculator, like this one, to do all the hard work for you:

Now, as said, that's the good news!

The bad news is, even a calculator like this one isn't all that intuitive.

So, to help you out, this article breaks down exactly what you need to input into an MDE calculator, with step-by-step directions and screenshots so you'll be completely clear and feel fully confident every step of the way.

Let's dig in:

To work this calculator, you’ll need to know your average weekly traffic and conversion numbers.

If you’re using an analytics platform, like Google Analytics, you’ll be able to easily find this data by looking at your traffic and conversion trends.

In Google’s current Universal Analytics, traffic data can be obtained by going to the Audience/Overview tab:

It’s, typically, best to take a snapshot of at least 3 months to get a broader, or bigger picture view of your audience over time.

For this example, let’s set our time frame from June 1 - Aug. 31.

Now, you can decide to look at these numbers three ways:

**Users:**the total number of users, or visitors, coming to your site during the date range.**New users:**those visitors who come to your site for the first time during that date range.**Sessions:**users who interact with your website within a particular timeframe. As this article explains, the same user can have multiple sessions on your website.

Given these differences, calculating the total number of users will probably give you the most accurate indication of your traffic trends.

With these data points in mind, over the 3-month period, this site saw 67,678 users. There are, typically, about 13 weeks in 3 months, so to calculate users per week you’d divide 67,678/13=5,206.

In other words, the site received about 5,206 users/week.

You’d then plug this number into the calculator.

To calculate the number of conversions over this time period, you’ll need to have already set-up conversion goals in Google Analytics. Here’s more information on how to do so.

Assuming you’ve set-up conversion goals, you’ll next assess the number of conversions by going to the Goals/Overview tab, selecting the conversion goal you want to measure for your test, and seeing the number of conversions:

In this example, there were 287 conversions over the 3-month time period which amounts to an average of 287/13=22 conversions/week.

Now, imagine you want to test two variants: version A (the control, or original version) and B (the variant).

You’d now plug the traffic, conversion, and variant numbers into the calculator:

Now you can calculate your baseline conversion rate, which is the rate at which your current (control) version is converting at.

This calculator will automatically calculate your baseline conversion rate for you, based on the numbers above.

However, if you want to confirm the calculations, simpley divide the number of goals completed by the traffic which, in this case, is 22 conversions per week/5,206 visitors per week (22/5,206=0.0042). To get a percentage, times this amount by 100 (0.0042*100=0.42%).

You’d end up with a baseline conversion rate of 0.42%:

Next, plug in the confidence level and power at which you want to obtain results.

As a general A/B testing best practice, you want a confidence level of +95% and statistical power of +80%:

Based on these numbers, the pre-test sample size calculator is indicating to you that you’ll want to run your test for:

- At least 6 weeks
- With at least 15,618 visitors/variant
- Based on a relative MDE of at least 46.43%

As a very basic rule of thumb, some optimization experts, like Ronny Kohavi, suggests setting the relative MDE up to a maximum of 5%.

If the experiment isn't powered enough to detect a 5% effect, the test results can't considered trustworthy.

However, it's also dangerous to go much beyond 5% because, at least in Ronny's experience, most trustworthy tests don't yield more than a 5% relative conversion lift.

As such, for a mature testing organization which large amounts of traffic and an aggressive optimization program, a relative 1-2% MDE is more reasonable and is still reason to celebrate.

In the example shown above, the relative MDE was 46.43%, which is clearly above the 5% best practice.

This MDE indicates traffic is on the lower side and your experiment may not be adequately powered to detect a meaningful effect in a reasonable timeframe.

In this case, if you do decide to proceed with running the test, make sure to follow these guidelines:

**Calculate the sample size requirements ahead of time.**Make sure you have enough traffic to reach the suggested sample size in an adequate timeframe.**Don't stop the experiment early**before you've reached this calculated sample size target -- even if results appear significant earlier.**Run the test for the minimum stated testing time period**recommended by the calculator, or at they very least two weeks to round out any discrepancies in user behavior.**Consider if the test is truly worth running**, and use the outcome only as an indicator of results, not gospel. Low sample sites (traffic or conversion numbers) are tricky to test on.**Focus on making more pronounced changes**that should, hopefully, create a bigger positive impact and have a larger effect on conversions.

Hope this article has been useful for you. Share your thoughts and comments below:

*Last updated September, 2022*

Written by Deborah O'Malley

**Deborah O'Malley** is a top A/B testing influencer who founded GuessTheTest to connect digital marketers interested in A/B testing with helpful resources and fun, gamified case studies that inspire and validate testing ideas.

With a special contribution from Ishan Goel

**Ishan Goel** is a data scientist and statistician. He's currently leading the data science team at Wingify (the parent company of VWO) to develop the statistical algorithms powering A/B testing. An avid reader and writer, Ishan shares his learnings about experimentation on his personal blog, Bagels for Thought.

Special thanks Ronny Kohavi

**Ronny Kohavi** is an esteemed A/B testing consultant who provided valuable feedback on earlier drafts of this article and raised some of the key points presented through his Accelerating Innovation with A/B Testing class and recent paper on A/B Testing Intuition Busters.

An astounding +364% lift in conversions, an enormous +337% improvement in clickthrough rate, and cruelly disappointing -60% drop in for submissions.

What do all these jaw-dropping conversion rates have in common?

They’re all extreme results!

And, as this article explains, they’re probably not real.

Because, according to something known as Twyman’s Law, any figure that's interesting or different is usually wrong.

The trustworthiness is suspect.

A great example of this concept comes from Ryan Thomas, co-founder of the optimization agency Koalatative.

He shared a seething sarcastic LinkedIn post announcing he had achieved a record-breaking +364% lift running a client A/B test:

Sounds impressive!

But, as Ryan, and others aptly explained, the tiny sample of 17 vs. 11 visitors, with 1 vs. 3 conversions, was so low, the results were incredibly skewed.

This extreme result is, unfortunately, not a one-off case.

In fact, as a learning exercise for experimenters, GuessTheTest recently published a similar case study in which the testing organization claimed to achieve a +337% conversion lift.

That’s massive!

But, taking a closer look at the conversion figures, you can see at just 3 vs. 12, the traffic and conversion numbers were so low, the lift appeared artificially huge:

And, according to Twyman's Law, makes the test trustworthiness suspect.

Okay, so you’re probably thinking, yeah, but these examples are of very low traffic tests.

And everyone knows, you shouldn’t test with such low traffic!

True.

But high traffic sites aren’t immune to this issue either.

In fact, in a study recently ran for a prominent SEO site -- with thousands of daily visitors -- one test yielded an extremely disappointing -60% drop in conversions.

However, on closer inspection, the seemingly enormous drop was the difference between just 2 vs. 5 conversions.

Although the page had thousands of visitors per variant, very, very few were converting.

And of those that did, there were far too few conversions to know if one version truly outperformed or if the conversion difference was just due to random chance.

The obvious problem with all these tests was the sample -- either the traffic and/or the conversion numbers -- were so low, the estimate of the lift was unreliable.

It appeared enormous when, in reality, it was just the difference between a few random conversions.

But, the problem is, you can get these kinds of test outcomes and still achieve statistically significant results.

Surprised?

Take this example of 3 vs. 12 conversions based on a sample size of 82 vs. 75 visitors.

As you can see, plugging the numbers into a statistical significance calculator shows the result is indeed significant at a 95% level of confidence with a p-value of 0.009:

Which goes to show: a test can ** appear** significant after only a few conversions, but that doesn't

How is this outcome possible?

As highly regarded regarded stats guru, Ronny Kohavi, explains in his class, Accelerating Innovation With A/B Testing, it’s all about the power.

A result can appear statistically significant, and in an underpowered experiment, the lift will be exaggerated.

Power measures the likelihood of accurately detecting a real effect, or conversion difference, between the control and treatment(s), assuming a difference exists.

Power is a function of *delta.*

*Delta* describes the statistical sensitivity or ability to pick up a conversion difference between versions.

This conversion difference is the *minimum effect* size, or smallest conversion difference, you want to detect.

Smaller values make the test more sensitive but require more users.

If these terms all seem a bit confusing, Ishan Goel, lead data scientist at the A/B testing platform VWO’s parent company, Wingify, offers a more relatable, real-life example.

He suggests you can best understand power, and its relationship to delta, by thinking of a thermometer used to detect a fever.

If you have a low fever, a cheap thermometer that's not very sensitive to slight temperature changes might not pick up your mild fever. It's too low-powered.

You need a thermometer that's very sensitive or high-powered.

The same is true in testing.

To accurately detect small conversion differences, you need high power. The higher the power, the higher the likelihood of accurately detecting a real effect, or conversion difference.

But, there’s a trade off. The higher the power, the larger the sample size also needs to be.

In turn, when sample sizes are low, power is reduced.

A test is what's known as *underpowered* when the sample is so low the *effect*, or conversion difference detected, isn't accurate.

All the examples we just saw were of low sample, underpowered tests.

Sure, the results may have been statistically significant. But the conversion rate was artificially skewed because the samples were so low the test wasn't adequately powered.

The sample – whether it be traffic, conversions, or both – was too low to be adequately powered.

The lower the power, the more highly exaggerated the effect.

Statistically significant, low-powered tests happen more than most experimenters would like to admit.

In fact, as highly regarded experimentation expert, Ronny Kohavi explains, because of the way statistics works, if you pick the standard alpha (p-value threshold) of 0.05, you’ll get a statistically significant result ** at least** 5% of the time – whether there’s a true difference or not.

In those +5% of cases, the estimated effect will be exaggerated.

Here’s a document Ronny created showing this phenomenon:

*With a p-value of 0.05, 1 in 20 experiments will be statistically significant even when there is actually no real difference between the control and treatment.*

Ronny remarks that, for experienced experimenters running trustworthy tests, it’s rare to see lifts of more than a few percentage points.

In fact, Ronny recalls, in the history of Bing, which ran 10’s of thousands of experiments, only 2 impacted revenue by more than 10%.

He adds that it’s very unlikely that simple design changes, like in the examples above, can create over a 10% lift -- let alone a 20% gain!

It’s a lot more likely the lift is the outcome of a poorly designed, very underpowered experiment.

This phenomenon is known as *the winner’s curse*.

Statistically significant results from under-powered experiments exaggerate the lift, so the so-called "winning result" is not as rosy as initially believed.

The apparent win is a *curse* that becomes more worthy of a cry than a celebration.

Great. So, other than not running an experiment, how do you overcome the pitfalls of underpowered, low sample tests? 🤔

To answer this question, we’ve turned to several experts.

Here’s what they advise:

According to Ronny, the first and most important step is to do a power calculation before running the test.

Remember, power is the percentage of time the minimum effect will be detected, assuming a conversion difference actually exists.

A power of 0.80 (80%) is the standard.

This amount means you’ll successfully detect a meaningful conversion difference at least 80% of the time.

As such, there's only a (0.20) 20% chance of missing this effect and ending up with a false negative. A risk we’re willing to take.

To calculate power, you need to:

- Estimate the variance of the conversion metric from historical data. Variance is expressed as the historical conversion rate*(1-historical conversion rate).
- For example, If you're looking at conversions, and historical data shows a (0.04) 4% conversion rate, then the variance is conversion rate*(1-historical conversion rate) = 0.04*(1-0.04) which is: 0.04*(0.96) = 0.0384.

- Decide on the absolute delta you want to detect by multiplying the relative lift with the conversion rate.
- For example, let's assume a (0.04) 4% conversion rate and that you want to detect a (0.05) 5% relative lift. Your absolute delta is (conversion rate*relative lift) 0.04*0.05 = 0.002.

- Plug these numbers into the power formula. The power formula for 80% power and p-value of 0.05 is the constant 16*variance/delta^2 where N = 16*P*(1-P)/(P*RD)^2
- Using these numbers as an example, 16*0.0384/0.002^2 = 153,600 users

You now know you need at least 153,600 users, per variant, for the experiment to be adequately powered.

The lower the delta, the more users needed since you're trying to detect a smaller effect.

The opposite is also true. The higher the delta, the fewer users needed to detect a larger effect.

If this power calculation comes across as complex, the good news is, you can come at it another way and instead first calculate your required sample size.

After all, there are many ways to skin a cat.

But pay attention here: the key is to calculate your sample size AHEAD of running the experiment.

If you don’t, you may fall into the trap of stopping the test early if results appear significant -- even if the study is underpowered.

So. . . just how large of a sample do you need so your test isn't underpowered?

Here’s where it gets tricky. In true CRO style, it depends.

Some experimenters will tell you that, as a basic rule of thumb, you need at least 1,000 visitors per variant with at least 100 conversions per variant.

Others will say you need a whole lot more.

In this insightful Twitter post, Carl Weische, founder of the eCommerce testing agency, Accelerated – who has run thousands of successful experiments for clients – claims you need 20,000-50,000 users per variant and at least 1,000 conversions per variant:

Talk to other experiments and they may tell you otherwise.

So, really, across most experimenters, there’s no clear consensus. . .

Unless you do the math. Then, according to Ronny, you just need to know your variance and delta where the variance is p*(1-p) for conversion rate p.

But if formulas and math calculations leave your head spinning a bit, take heart.

You can also use a sample size calculator, like this one, to do all the hard work for you:

Here, you’ll just need to input:

**Baseline conversion rate:**the current conversion rate of the control.**Minimum detectable effect (MDE):**the smallest conversion rate lift you expect to see.**Power (1−β):**the percentage of time the minimum effect will be detected, assuming an effect exists.**Significance level alpha (α):**the percentage of time a difference will be detected, assuming it doesn’t exist. With a p-value of 0.05, 5% of experiments will yield a false positive showing a difference when one doesn’t really exist.

So, in this example, assuming a baseline conversion rate of 4%, based on an MDE of 5%, with a standard power of 80% and a significance level of 5%, you’ll need a sample of 151,776 visitors per variant.

**Note, Evan Miller's calculator uses a slightly different value than 16 as the leading constant, so the estimates from his calculator are also slightly smaller.*

The problem is, by relying on this calculator to determine the sample size, you now also need to consider and input the Minimum Detectable Effect, MDE *(which is referred to as delta in the power formula above).*

The MDE sounds big and fancy, but the concept is actually quite simple when you break it down. It's the:

*Minimum*= smallest*Effect =*conversion difference*Detectable*= you want to see from running the experiment

But, now it becomes a bit of a catch-22 because sample size requirements change based on the MDE you input.

The lower the MDE, the greater the sample needed. And vice versa.

So, the challenge becomes setting a realistic MDE that will accurately detect small differences between the control and treatments you’re testing, based on a realistic sample.

Clearly, the larger the sample, the more time it will take to obtain it. So you also need to consider traffic constraints and what’s realistic for your site.

As a result, the MDE also presents a bit of an it depends scenario.

Every test you run may have a different MDE, or each may have the same MDE. There's no hard rule. It's based on your testing needs.

The MDE may be calculated from historical data in which you've observed that, in general, most tests tend to achieve a certain effect, so this one should too.

Or it can be a number you choose, based on what you consider *worth it* to take the time and resources to run an experiment.

For example, a testing agency may, by default, set the MDE at 5% because that's the minimum threshold needed to declare a test a winner for a client.

In contrast, a mature testing organization may set the MDE at 3% because, through ongoing optimization, eeking out gains any higher would be unrealistic.

As a rule of thumb, Ronny suggests setting the MDE to a maximum of 5% for online A/B tests, but lowering it as the organization grows or gets more traffic.

For an established company, or mature testing organization, a 1-2% MDE is realistic. But it's hard to imagine an executive that doesn't care about big improvements to the business, so 5% is a reasonable upper bound.

This 5% upper limit exists because if you don't have the power to detect at least a 5% effect, the test results aren’t trustworthy.

However, take note, if you’re thinking your MDE should be 10% or higher, Ronny remarks it ain’t likely going to happen.

Most trustworthy tests -- that are properly powered -- just don't achieve that kind of lift.

In fact, Ronny recalls, across the thousands of experiments he was involved in, including at Bing, the sum of all relevant tests, over the year, was targeted at 2% improvement. A 3% improvement was a reason for true celebration!

If honing in on an MDE sounds like a highly speculative exercise that's going to leave you super stressed, don't worry!

You can simply use a pre-test analysis calculator like this one to determine the MDE for you:

And, if looking at this screenshot leaves you wondering how you’re possibly supposed to calculate all these inputs, here’s an in-depth GuessTheTest article outlining exactly how to use this calculator.

*Note, this calculator is similar to Evan Miller's, referenced above, but gives you the MDE as a specific variable with the number of weeks you'll need to run the test making it a useful tool to calculate the MDE. The constant for this calculator is also different from Evan Miller's, so if you compare directly, you might get slightly different results.*

With your MDE and sample size calculations worked out, the trick, then, is to:

- Make sure you have enough traffic to reach the suggested sample size in an adequate time frame.
- Not stop the experiment early before you've reached this calculated sample size target -- even if results appear significant earlier.
- Run the test for the minimum stated timeframe, or at the very least two weeks to round out any discrepancies in user behavior.

If your pre-test sample size calculator shows you need thousands of visitors, and you only get hundreds per month, you'll want to consider whether the experiment is truly worth running since your data won't yield truly valid or significant results.

Additionally, Ronny cautions that, if you decide to proceed, a low sample test may give you statistically significant results, but the results may be a false positive and the lift highly exaggerated.

As this diagram (originally published in Ronny’s* Intuition Busters’* paper) shows, when power is any lower than 10%, the effect, or conversion lift detected is incorrect up to half of the time!

So be aware of this pitfall before going into low sample testing.

And if your sample is low, but you still decide to test, Ishan Goel recommends focussing on making more pronounced changes that will, hopefully, create a bigger impact and create a larger, more measurable effect.

Conversely, as your traffic increases, you’ll get the luxury of testing more nuanced changes.

For instance, Google ran the famous 41 shades of blue experiment which tested the perfect shade of blue to get users clicking more. The experiment was only possible due to large traffic resulting in high-powered testing.

Assuming you plan ahead, pre-calculate your required sample size and input the MDE but still get surprising results, what do you do?

According to Ronny, in his paper *A/B Testing Intuition Busters*, extreme or surprising results require a lower p-value to override the prior probability of an extreme result.

Typically, a p-value of <0.05 is considered adequate to confidently declare statistically significant results.

The lower the p-value, the more evidence you have to show the results didn't just occur through random chance.

As Ronny explains in this post on p-values and surprising results, the more extreme the result, the more important it is to be skeptical, and the more evidence you need before you can trust the result.

An easy way to get a lower p-value, Ronny details, is to do a replication run and combine the p-values. In other words, repeat the experiment and average the p-values obtained each time the experiment was run.

To compute the combined p-values, a tool like this one can be used.

As this example shows, if the experiment was run twice, and a p-value of 0.0044 was achieved twice, the combined p-value would be 0.0001. With this p-value, you could then declare, with greater certainty, that the results are truly significant:

But even with all these safety checks in place, you never can be quite sure test results will be completely accurate.

As Ishan aptly points out, it can be valuable to second-guess all findings and question anything that looks too good to be true.

Because, as Ishan says, “in experimentation, you can find ways to disprove the validity of a result, but you can never find a way to prove the validity of the result."

Or, as Maurice Beerthuyzen of the agency ClickValue puts it, *it’s possible to put a cat in a tumble dryer, but that doesn’t mean it gives the right output:*

So, the morale of the story is, don’t put cats in dryers. And don’t do low sample testing, underpowered testing – at least not without following all these checkpoints!

Hope this article has been useful for you in explaining the pitfalls of low sample testing – and how to avoid them.

Share your thoughts and comments below:

Last updated May, 2023

Written by Deborah O'Malley* & *Timothy Chan

With special thanks to Ronny Kohavi for providing feedback on an earlier draft of this article. Ronny's Accelerating Innovation With A/B testing course provided much of the inspiration for this piece and is a must-take for every experimenter!

Awesome. You’ve just run an A/B test, crunched the numbers, and achieved a statistically significant result.

It’s time to celebrate. You’ve got another test win. 🙂

Or have you?

As this article explains, understanding and calculating statistical significance is actually quite complex.

To properly call a winning (or losing) test, you need to understand what a statistically significant result really means.

Otherwise, you’re left making incorrect conclusions, random decisions, or money-losing choices.

Many experimenters don’t truly know what statistical significance is or how to derive a statistically significant test result.

So, in plain English, this guide is here to set it all straight for you so you can accurately declare and interpret a statistically significant A/B test with accuracy and ease.

Before we get too far in, it’s important to lay the groundwork so you’re clear on a few important points:

Because this article has been written with the express purpose of simplifying a complex topic, we’ve extracted only the most meaningful concepts to present to you.

We’re trying to avoid bogging you down with all the nitty, gritty details that often cause more confusion than clarity.

As such, this article does not offer an in-depth examination of every aspect of statistics. Instead, it covers only the top topics you need to know so you can confidently declare a statistically significant test.

Aspects like sample size, power, and its relationship to Minimum Detectable Effect (MDE) are separate topics not included in this article. While all tied to getting statistically significant results, these concepts need a whole lot more explanation outside of statistical significance.

If you're interested, you can read more about sample size here, and its relationship to power and MDE here.

Scientists use statistical significance to evaluate everything from differences in bunny rabbit populations to whether a certain dietary supplement lowers obesity rates.

That’s all great. And important.

But as experimenters interested in A/B testing, this article focuses on the most common evaluation criteria in online testing – and the one that usually matters most: conversion rates.

Although, in other scenarios there may be a more complicated Overall Evaluation Criterion (OEC), in this article, anytime we talk about a metric or result, we’re referring, specifically, to conversion rates. Nothing else.

So, you can drop bunny rabbits and diet pills out of your head. At least for now.

And while conversion rates are important, what's usually key is increasing them!

Therefore, in this article, we focus only on *one-sided tests* – which is a fancy stats way of saying, we’re looking to determine if the treatment has a higher conversion rate than the control.

A one-sided, also known as a one-tailed test, measures conversions only going one-way, compared to the control. So either just up

ORjust down.In contrast, a two-sided test measures conversion results up

ANDdown, compared to the control.One-side vs. two-sided tests are a whole topic to explore in and of themselves.

In this article, we’ve focussed just on one-sided tests because they're used for detecting only a positive (or negative) result.

But, you should know, there’s no clear consensus on whether a one-sided or two-sided test is best to use in A/B testing. For example, this in-depth article states one-tailed tests are better. But, this one argues

two-tailed tests are best.As a general suggestion, if you care only whether the test is better than the control, a one-sided test will do. But if you want to detect whether the test is better or worse than the control, you should use a two-sided test.

It’s worth mentioning that there are other types of tests better suited for other situations. For example, non-inferiority tests check that a test is not any worse than the control.

But, this article assumes that, at the end of the day, what most experimenters really want to know is: did the test outperform better the control?

Statistical significance helps us answer this question and lets us accurately identify a true conversion difference not just caused by random chance.

Great!

Now that we’ve got all these stipulations out of the way, we’re ready to have some fun learning about statistical significance for A/B testing.

Let’s dig in. . .

Alright, so what is statistical significance anyway?

Google the term. Go ahead, we dare you!

You’ll be bombarded with pages of definitions that may not make much sense.

Here’s one: according to Wikipedia, “*in statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis. More precisely, a study's* *defined significance level, denoted by alpha, is the probability of the study rejecting the null hypothesis, given that the null hypothesis is true; and the p-value of a result, is the probability of obtaining a result at least as extreme, given that the null hypothesis is true*.”

Say what?!?

If you find this definition seems confusing, you’re not alone!

Here’s what’s important to know: a statistically significant finding is the standard, accepted way to declare a winning test. It provides ** evidence** to suggest you’ve found a winner.

This evidence is important!

Because A/B testing itself is imperfect. With limited evidence, there’s always the chance you'll get random, unlucky results misleading you to wrong decisions.

Statistical significance helps us manage the risk.

To better understand this concept, it can be helpful to think of a coin flip analogy.

Imagine you have a regular coin with heads and tails. You make a bet with your friend to flip a coin 5 times. Each time it lands on heads, you win $20. But every time it lands on tails, you pay them $20.

Sounds like an alright deal. You can already see the dollar signs in your eyes.

But what happens when you flip the coin and 4 out of 5 tosses it lands on tails? Your friend is happy.

But you're left pretty upset. You might start to wonder if the coin is rigged or it’s just bad luck.

In this example, the unfavorable coin flips were just due to random chance. Or what we call, in fancy stats speak, statistical noise.

While the results of this simple coin bet may not break the bank, in A/B testing, you don’t want to be at the whim of random chance because incorrectly calling A/B tests may have much bigger financial consequences.

Fortunately, you can turn to statistical significance to help you minimize the risk of poor decisions.

How?

To fully understand, you’re going to need to wrap your head around a few fancy stats definitions and their relationship to each other. These terms include:

**Bayesian and****frequentist testing****Null hypothesis****Type I and Type II errors****Significance level alpha (α**)**P-value**

While these terms may not make much sense yet, don’t worry.

In the sections below, we’re going to clearly define them and show you how each concept ties into statistical significance so you can accurately declare a winning A/B test with absolute confidence.

Let’s start with the basics.

In A/B testing, there are two main statistical frameworks, Bayesian and Frequentist statistics.

Don’t worry. While these words sound big and fancy, they’re nothing to be intimidated by.

They simply describe the statistical approach taken to answer the question: does one A/B test version outperform another?

Bayesian statistics** **measures how likely, or probable, it is that one version performs better than another.

In a Bayesian A/B test, results are NOT measured by statistical significance.

This fact is important to realize because popular testing platforms like VWO Smarts Stat Mode and Google Optimize currently use the Bayesian framework.

So if you’re running a test on these platforms, know that, when determining if you have a winner, you’ll be evaluating the probability of the variant outperforming – not whether the result is statistically significant.

Statistical significance, however, is entirely based on frequentist statistics.

And, as a result, frequentist statistics takes a comparatively different approach.

Using the frequentist method is more popular and involves fewer assumptions. But, the results are more challenging to understand.

Which is why this article is here for you.

In A/B testing, frequentist statistics asks the question, “is one version better than the other?”

To answer this question, and declare a statistically significant result, a whole bunch of checkpoints have to be met along the way, starting with the null hypothesis.

Say what? What the heck is a** null hypothesis?**

To better understand this term, let’s break it down in the two words, starting with ** hypothesis**.

Hypothesis testing is the very basis of the scientific method.

Wikipedia defines **hypothesis** as a proposed explanation for a phenomenon. But, more simply stated, we can call it an educated guess that can be proven wrong.

In A/B testing, you make an educated guess about which variant you think will win and try to prove, or disprove, your belief through testing.

Through the hypothesis, it's assumed the current state of scientific knowledge is true. And, until proven otherwise, everything else is just speculation.

Which means, in A/B testing, you need to start with the assumption that the treatment is not better than the control, so you can't count on it winning.

Instead, you have to assume the real conversion difference between variants is *null* or nothing at all; it's ≤ *zero*.

And you've gotta stick with this belief until there's enough evidence to reject it.

Without enough evidence, you fail to reject the null hypothesis, and, therefore, deem the treatment is not better than the control. (Note, you can only reject or fail to reject the null hypothesis; you can't accept it).

There’s not a large enough conversion rate difference to call a clear winner.

However, as experimenters, our goal is to decide whether we have sufficient evidence – known as a statistically significant result – to reject the null hypothesis.

With strong evidence, we can conclude, with high probability, that the treatment is indeed better than the control.

Our amazing test treatment has actually outperformed!

In this case, we can reject the null hypothesis and accept an ** alternative** viewpoint.

This alternate view is aptly called the **alternative hypothesis**.

When this exciting outcome occurs, it’s deemed noteworthy and* surprising*.

How surprising?

To determine that answer, we need to calculate the probability of whether we made an error in rejecting, or failing to reject the null hypothesis, and in calling a winner when there really isn’t one.

Because as mere mortals struggling to understand all this complex stats stuff, it’s certainly possible to make an incorrect call.

In fact, it happens so often, there’s names for these kinds of mistakes. They’re called type I and type II errors.

A** type I error** occurs when the null hypothesis is rejected – even though it was correct and shouldn’t have been.

In A/B testing, it occurs when the treatment is incorrectly declared a winner (or loser) but it’s not – there's actually no real conversion difference between versions; you were misled by statistical noise, an outcome of random chance.

This error can be thought of as a *false positive* since you’re claiming a winner that’s not there.

While there’s always a chance of getting misleading results, sound statistics can help us manage this risk.

Thank goodness!

Because calling a test a winner – when it’s really not – is dangerous.

It can send you chasing false realities, push you in the wrong direction and leave you doubling down on something you thought worked but really didn’t.

In the end, a type I error can drastically drag down revenue or conversions. So, it’s definitely something you want to avoid.

A **type II error **is the opposite.

It occurs when you incorrectly fail to reject the null hypothesis and instead reject the alternative hypothesis by declaring that there’s no conversion difference between versions – even though there actually is.

This kind of error is also known as a *false negative*. Again, a mistake you want to avoid.

Because, obviously, you want to be correctly calling a winner when it’s right there in front of you. Otherwise, you’re leaving money on the table.

If it’s hard to keep these errors all straight in your head, fear not.

Here’s a diagram summarizing this concept:

*Source: Reddit*

Making errors sucks. As an experimenter, your goal is to try to minimize them as much as possible.

Unfortunately, given a fixed number of users, reducing type I errors will end up increasing the chances of a type II error. And vice versa. So it’s a real tradeoff.

But here is, again, a silver lining in the clouds. Because in statistics, you get to set your risk appetite for type I and II errors.

How?

Through two stats safeguards known as alpha (α) and beta (β).

To keep your head from spinning, we’re only gonna focus on alpha (α) in this article. Because it’s most critical to accurately declaring a statistically significant result.

While Beta (β) is key to setting the power of an experiment, we’ll assume the experiment is properly powered with beta ≤ 0.2, or power of ≥ 80%. Cause, if we delve into beta's relationship with significance, you may end up with a headache. So, we'll skip it for now.

With alpha (α)** **top of mind,** **you may be wondering what function it serves.

Well, **significance level alpha (α)** helps us set our risk tolerance of a type I error.

Remember, a type I error, also known as a false positive, occurs when we incorrectly reject the null hypothesis, claiming the test treatment converts better than the control. But, in actuality, it doesn’t.

Since type I errors are bad news, we want to mitigate the risk of this mistake.

We do so by setting a cut-off point at which we’re willing to accept the possibility of a type I error.

This cut-off point is known as significance level alpha (α). But most people often just call it *significance* or refer to it as *alpha* (denoted* α)*.

Experimenters can choose to set α wherever they want.

The closer it is to 0, the lower the probability of a type I error. But, as mentioned, a low α is a trade-off. Because, in turn, the higher the probability of a type II error.

So, it’s a best practice to set α at a happy middle ground.

A commonly accepted level used in online testing is 0.05 (α = 0.05) for a two-tailed test, and 0.025 (α = 0.025) for a one-tailed test.

For a two-tailed test,** this level means we accept a 5% (0.05) chance of a type I error**/false positive, or of incorrectly (rejecting the null hypothesis by) calling a winner when there isn’t one.

In turn, there’s a 95% probability the null hypothesis is correct; the test treatment is indeed no better than the control -- assuming, of course, the experiment is not under-powered and we have a large enough sample size to adequately detect a meaningful effect in the first place.

That’s it.

That’s all α tells us: the probability of making a type I error/false positive -- or incorrectly calling a winner --, assuming the null hypothesis is correct, and there is no real conversion difference between variants.

That said, it’s important to clear a couple misconceptions:

At α = 0.05, the chance of making a type I error is not 5%; it’s only 5% ** if** the null hypothesis is correct. Take note because a lot of experimenters miss the last part – and incorrectly believe it means there's a 5% chance of making an error, or wrong decision.

Also, a 5% significance level does not mean there’s a 5% chance of finding a winner. That’s another misconception.

Some experimenters extrapolate even further and think this result means there's a 95% chance of making a correct decision. Again, this interpretation is not correct. Without introducing subjective bias, it’s very difficult to know the probability of making a right or wrong decision. Which is why we don’t go down this route in stats.

So, if α is the probability of making a type I error, assuming the null hypothesis is correct, how do we possibly know if the null hypothesis is accurate?

To determine this probability, we rely on alpha’s close sibling, p-value.

According to Wikipedia, **p-value** is the probability of the test producing an observed result at least as or more extreme than the ones observed in your data, assuming the null hypothesis is true.

But, if that definition doesn’t help much, don’t worry.

Here’s another way to think of it:

P-value tells us how likely it is that the outcome, or a more extreme result occurred, if the null hypothesis is true.

Remember, the null hypothesis assumes the test group’s conversion rate is no better than the control group’s conversion rate.

So we want to know the likelihood of results if the test variant is truly no better than the control.

Why?

Because, just like a coin toss, all outcomes are possible. But some are less probable.

P-value is how we measure this probability.

In fact, some fun cocktail trivia knowledge for you, the “*p*” in *p*-value stands for *p*robability!

And as you can probably imagine, the *value*** **part of p-

Well, guess what?

When the p-value is less than α (p≤α), it means the chance of getting the result is really low – assuming the null hypothesis is true. And, if it’s really low, well then, the null hypothesis must be incorrect, with high probability.

So we can reject the null hypothesis!

In rejecting the null hypothesis, we accept the less likely alternative: that the test variant is truly better than the control; our results are not just due to random chance or error.

Hallelujah! We’ve found a winner. 🥳

An unusual and surprising outcome, the result is considered significant and noteworthy.

Therefore, a p-value of ≤0.05 means the result is ** statistically significant**.

However, while a p-value of ≤0.05 is, typically, accepted as significant, many data purists will balk at this threshold.

They’ll tell you a significant test should have a p-value ≤0.01. And some data scientists even argue it should be lower.

However low you go, it seems everyone can agree: the closer the p-value to 0, the stronger the evidence the conversion lift is real – not just the outcome of random chance or error.

Which, in turn, means you have stronger evidence you’ve actually found a winner.

Or written visually: ↓ p-value, ↑ significant the finding.

And that’s really it!

A very simplified, basic, incredibly stripped down way to tell you the essentials of what you absolutely need to know to declare a statistically significant A/B test.

Once it’s made clear, it’s really not that hard. Is it?

Of course, there’s SO MUCH we’ve left out. And much, much more to explain to properly do the topic justice.

So consider this article your primer. There’s plenty more to learn. . .

But, now that you more clearly understand the basics, you probably get how the null hypothesis ties into a type I error, based on significance level α, expressed as a p-value.

So it should be clear:

When calling a result “statistically significant,” what you’re really saying in stats-speak is: the test showed a statistically significant better conversion rate against the control. You know this outcome occurred because the p-value is less than the significance level (α) which means the test group’s higher conversion rate was unlikely under the null hypothesis.

In other words, the test variant does, in fact, seem to be better than the control.

You’ve found a winner! 🙂

Although most standard A/B testing platforms may declare a statistically significant winning test for you, not all get it right.

So it’s strongly advocated that you verify test results yourself.

To do so, you should use a test validation calculator, like this one from AB Testguide.

Simply plug in your visitor and conversion numbers, declare if your hypothesis was one- or two-sided, and input your desired level of confidence.

If the result is statistically significant, and you can confidently reject the null hypothesis, you’ll get a notice with a nicely visualized chart. It will look something like this:

If the result is not significant, you’ll see a similar notification informing you a winning test was not observed. It will look like this:

However, an important word of caution: just because your test result is statistically significant doesn’t always mean it’s truly trustworthy.

There are lots of ways to derive a statistically significant result by “cheating.” But that’s a whole topic for another conversation. . .

In essence, evaluating statistical significance in A/B testing asks the question: is the treatment actually better than the control?

To determine the answer, we assume the null hypothesis: that the treatment is no better than the control. So there is no winner – until proven otherwise.

We then use significance level alpha (α) to safeguard against the probability of committing a Type I error and to set a reasonable bar for the proof we need.

P-value works with α as truth barometer. It answers the question: if the null hypothesis is true, how likely is it the outcome will occur?

The answer can be stated with a specific value.

A p-value lower than or equal to α, usually set at 0.05, means the outcome is unlikely due to random chance. The result is surprising and, therefore, significant.

The lower the p-value, the more significant the finding.

A statistically significant result shows us: the result is unlikely due to just random chance or error under the null hypothesis. There’s strong evidence to reject the null hypothesis and declare a winner.

So we can take our winning result and implement it with confidence, knowing that it will likely lift conversions on our website.

Without a statistically significant result, we’re left in the dark.

We have no way to know whether we’ve actually, accurately called a winner, made an error, or achieved our results just due to random chance.

That’s why understanding what statistical significance is and how to apply it is so important. We don’t want to implement supposed winners without strong evidence and rigour.

Especially when money, and possibly our jobs, are on the line.

Currently, statistical significance is the standard, accepted way to declare a winner in frequentist A/B testing. Whether it should be or not is a whole other topic. One that’s up for serious debate.

But for now, statistical significance is what we use. So apply it. Properly.

Congrats! You’ve just made it through this article, and have hopefully learned some things along the way.

Now, you deserve a nice break.

Grab yourself a bubble tea, kick your feet up, and breathe out a sigh of relief.

The next time you go to calculate a winning test, you’ll know exactly what to look for and how to declare a statistically significant result. If you’ve actually got one. 😉

Hope this article has been helpful for you. Please share widely and post your questions or comments in the section below.

**Statistical significance:**a type of result that validates the conversion difference between variants is real – not just due to error or random chance.**Frequentist testing:**a statistical method of calculation used in traditional hypothesis-based A/B testing in which you aim to prove or disprove the null hypothesis. Attempts to answer the question: Is one variant better than another?**Null hypothesis**: in a one-sided hypothesis, the assumption the test group is no better than the control. Your aim is to reject*the*null hypothesis.**Type I error:**also known as a false positive. Occurs when the null hypothesis is incorrectly rejected and you claim the treatment is better when, really, it isn’t. What appears a winner is just statistical noise or random chance.**Type II error:**also known as a false negative. Occurs when the treatment is actually better than the control, but you fail to reject the null hypothesis claiming no conversion difference.**Significance level alpha (α**): the probability of a type I error. The standard significance level (α) = 0.05 for a two-tailed t-test. For a one-tailed t-test, 0.025 is generally recommended.**P-value**: A measure of how likely a result could have occurred under the null hypothesis. It is used to safeguard against making a decision based on random chance. It’s a probability expressed as a specific value. A p-value lower than α (usually set at ≤0.05) means the results are extreme enough to call the null hypothesis into question and reject it. The result is surprising and, therefore, significant. The lower the p-value, the more significant the finding.

Happy testing!

Deborah O’Malley is one of the few people who’s earned a Master’s of Science (M.Sc.) degree with a specialization in eye tracking technology. She thought she was smart when she managed to avoid taking a single math class during her undergrad. But confronted reality when required to take a master’s-level stats course. Through a lot of hard work, she managed to earn an “A” in the class and has forever since been trying to wrap her head around stats, especially in A/B testing. She figures if she can learn and understand it, anyone can! And has written this guide, in plain English, to help you quickly get concepts that took her years to grasp.

Timothy Chan is among one of the select few who holds a Ph.D. and an MBA. Well-educated and well-spoken, he’s a former Data Scientist at Facebook. Today, he’s the Lead Data Scientist at Statsig, a data-driven experimentation platform. With this experience, there truly isn’t anyone better to explain statistical significance in A/B testing.

Ronny Kohavi is one of the world’s foremost experts on statistics in A/B testing. In fact, he’s, literally, written the book on it! With a Ph.D. from Stanford University, Ronny's educational background is equally impressive as his career experience. A former vice president and technical fellow at Microsoft and Airbnb, and director of data mining at personalization at Amazon, Ronny now provides data science consulting to some of the world’s biggest brands. When not working directly with clients, he’s teaching others how to accelerate innovation in A/B testing. A course absolutely every experimenter needs to take!

By: Deborah O'Malley, M.Sc.* *| Last updated May, 2022

A Navigational Menu, also known as a navigational bar, or nav bar, is a menu system, usually located at the top of a website.

Its purpose is to help users find categories of interest and access key pages on a website.

The nav bar is usually organized into relevant sections and may have a dropdown menu or sub-sections directly below.

Most have clear section and sub-section menu titles.

These titles should state what the users will get, or where the user will arrive, by clicking into the menu category.

A typical nav menu may look something like this:

The nav menu is, usually, present across all pages of your site. That means optimizing it can pay big returns, impacting every stage of your conversion funnel.

Testing the organization, presentation, and wording in the nav menu presents a great optimization opportunity with potentially incredible conversion returns.

In fact, there's not many other site-wide, high-ticket, low-effort tests like nav menu formatting.

However, in order to optimize a nav bar effectively, there are several important "do's" and "don'ts" you must follow.

Before redesigning or reorganizing any nav bar, always think about SEO and the keywords your visitors are using to find your site and navigate through.

Don't remove or rename any nav menu titles that will lower your SEO value, SERP, or keyword rankings.

This advice is especially important if a large portion of your site traffic comes from paid ads. You want to be using and showcasing the keywords you're paying for or that are bringing visitors to your site.

Once on your site, users will expect to see these keywords and will look for them to orient themselves on your website.

So cater to visitors needs and expectations by showcasing these keywords in your nav menu.

Don't add or remove nav bar links without thinking about the site structure and internal linking. Otherwise, you risk adding or removing access to pages that link throughout your site.

Ideally do an XML map before making any changes involving internal links. As XML sitemap will look something like this:

Here's a good article on how to create an XML sitemap.

Heatmapping data is a powerful way to see how visitors are interacting with your site and the nav menu.

As you can see in this example heatmap, the bigger and darker the hotspot, the more likely it is visitors are interacting with that section of the site:

Use heatmapping data to see what nav categories your visitors are clicking on most or least.

But don't stop there. Explore trends across both desktop and mobile.

Take note if users are heavily clicking on the home icon, menu, or search box as all of this behavior may indicate users are not able to easily find the content they're looking for. Of course, eCommerce clients with a lot of products to search are the exception.

Also take note if a user is clicking on the nav page link, say pricing, when they're already within the pricing section. This trend provides clues the nav or page structure may be confusing.

And visitors might not realize where they are on site.

If you detect this type of behavior, test the effect of making specific changes to optimize the interaction for users.

One great way to do so is through breadcrumb links.

Breadcrumb links, or breadcrumbs, are links placed directly under the main nav bar. They look something like this:

Breadcrumbs can be helpful to orient users and better enable them to navigate throughout the site. Because when users know where they are, they can navigate to where they want to be. Especially on large sites with a lot of pages.

If you don't already include breadcrumbs on your site, consider testing the addition of breadcrumb navigation, especially if you have a large sites which many pages and sections.

At their essence, words are just squiggly lines on a page or website, but they have a visual presence that can be subconsciously perceived and understood.

That's why whether you put your nav menu titles in ALL CAPS <-- (like this) or Title Case <-- (like this) may make a surprising conversion difference.

PUTTING A SMALL AMOUNT OF TEXT IN ALL CAPS IS FINE FOR EMPHASIS. OR IF YOU'RE TRYING TO COMMUNICATE A POINT IN A LOUD, OFTEN ANGRY, BLATANT TONE.

BUT LARGE BLOCKS OF TEXT IN ALL CAPS IS DIFFICULT AND TIRING TO READ.

In fact, according to renowned usability expert, Jakob Nielsen, deciphering ALL CAPS on a screen reduces reading speed by 35% compared to the same font in Title Case on paper.

The reason ALL CAPS is so difficult, tiring, and time consuming to read is because of the letter shape.

In APP CAPS format, the height of every letter is the same, making each letter in every word create a rectangular shape.

Because the shapes of all the letters are the same, readers are forced to decipher every letter, reducing readability and processing speed.

Need proof? Take a look at this example:

With Title Case, the tops of each letter helps us decipher the text, increasing readability and reading speed.

That said, TITLE CASE DOES HAVE UTLITY!

It's great for drawing attention and making you take notice. That's why it works well for headings and highlighting key points.

But to be most effective, it needs to be used sparingly -- and when isn't not necessary to quickly decipher a big chunk of text.

With these points in mind, test whether it's best to use ALL CAPS or Title Case with your audience on your site.

GOT THAT!? 😉

If you need some inspiration, checkout this real-life caps case study. Can you guess which test won?

Sometimes the navigational format we *think* will be simplest for our users actually ends up causing confusion.

Because nav menus are such a critical aspect of website usability, testing and optimizing their formatting is critical to improving conversions.

Which brings up the question, should you use a top "callout," that looks something like this, with the bolded text to categorize, highlight and make certain categories pop at the top?

Doing so may save visitors from having to hunt down and search each item in the menu.

But, it also may create confusion if you're repeating the same product categories below. Users may be uncertain if the callout section leads to different a page on site than in the rest of the menu.

An optimal site, and nav system, is one that leaves no questions in the user’s mind.

So test the effect of adding or removing a top callout section in your nav menu. See this real-life case study for inspiration.

Can you guess right?

Back in the olden days, all websites had text menus that streched across the screen.

But that's because mobile wasn't much of a thing. As mobile usage exploded, a nifty little thing called a "hamburger" menu developed.

A funny name, but so-called because those stacked little bars look like kinda a hamburger:

Hamburger menus on mobile make a lot of sense. The menu system starts small and expands bigger. So it's a good use of space on a small-screened device.

Hamburger menus have become the standard format on mobile.

But does that mean it should also be used on desktop? Instead of a traditional text-based menu?

It's a quesiton worth testing!

In fact, the big-name computer brand, Intel, tested this very thing and found interesting findings. Can you guess which test won?

And while you're at it, if you are going to use a hamburger menu -- whether on your mobile or desktop site -- test placement.

Many a web developers put the hamburger menu on the right side. But it may not be best placed there. For two reasons:

In English, we read from left to right and our eyes naturally follow this reading pattern.

In fact, eye tracking studies show we, typically, start at the top of the screen, scan across, and dash our attention down the page. This reading pattern emerges into what's known as an "f-shaped pattern."

And because of this viewing behavior, the top left of the screen is the most coveted location. Known as the "golden triangle," it's the place where users focus most.

Here's a screenshot so you can visualize both the F-shaped pattern and golden triangle:

Given our reading patterns, it makes sense to facilitate user behavior by placing the hamburger menu in the top left corner.

As well, as this article by Adobe design succinctly says, because we read (in English) from left to right, it naturally follows that the nav menu should slide open from left to right.

Otherwise, the experience is unexpected and unintuitive; that combination usually doesn't convert well.

But do test for yourself to know for sure!

A persistent, or "sticky" nav menu is one that stays with users as they scroll up or down.

How well does a persistent or sticky nav bar work?

It can works wonders, especially if you include a sticky CTA within the nav bar, like this:

But a sticky nav bar doesn't always have to appear upon page load.

It can also appear upon scroll or upon a set scroll depth. What works best?

See this case study for timing ideas and challenge yourself. Can you guess the right test?

The best wording within your nav menu are titles and categories that resonate with your user.

To know what's going to win, you need to deeply understand your audience and the terminology they use.

This advice will differ for each client and site. But, as a starting point, delve into your analytics data.

If you have Google Analytics set-up, you can find this information by going to the Behavior > Site Search > Search Terms tab. (*Note, you'll need to have previously set-up this search function to get data*).

For example, here's the search terms used for a client covering divorce in Canada:

Notice how the keyword "divorce" barely makes it into the top-5 keywords users are searching? In times of distress, these users are looking for other resources.

Showing keywords that resonate in the top nav can facilitate the user journey and will make it more likely visitors click into your site and convert.

** Pro tip:** using analytics data combined with heatmapping data can provide a great indication of whether users are clicking on the nav menu title giving a good clue whether the wording resonates.

Need more inspiration for high-converting copy?

See this case study examining whether the nav menu titles "Buy & Sell" or "Classifieds" won for a paid newspaper ad site. Which won one? The results may surprise you:

Testing whether a sticky nav bar works best is a great, easy test idea. Looking at what nav menu copy resonates is also a simple, effective way to boost conversions.

So why not combine both and test the winning wording within a sticky nav bar with a Call To Action (CTA) button!?

Here, again, you'll need to really understand your audience and assess their needs to determine and test what wording will win.

For example, does "Get a Quote" or "Get Pricing" convert better? The answer is, it depends whose clicking from where. . . See this case study for testing inspiration:

But just because an overall trend appears, doesn't mean it will hold true for all audiences.

If you have distinct audiences across different provinces, states, or countries, optimal copy may differ because the terminology, language, and needs may also change.

When catering to a local audience, try to truly understand your audience so you can best hone in on their needs.

Test the optimal copy that will most resonate with the specific cohort.

Segment results by geolocation to determine which tailored approach converts with each audience segment.

It may seem like a small change, but optimizing a navigational menu can have a big impact on conversions.

To optimize, start by using analytics data to inform your assumptions. Then test the highest-converting copy, format, organization, and design. And segment results by audience type.

Doing so will undoubtably lift conversions.

Hope these test ideas are helpful for you!

Please share this post widely and comment below.

By: Tim Waldenback*, co-founder Zutobi *| Last updated April, 2022

Apps are everywhere. Many businesses have them, consumers expect them, and more and more businesses are creating them.

Apps are a great business opportunity, but also come with stiff competition. Making an app isn’t enough -- you also have to market it effectively and ensure it offers a top-notch user experience.

On top of these challenges, even small changes in an app’s user experience can have a detrimental impact on conversion and engagement rates.

So it’s best to test your features with A/B testing.

In this comprehensive article, you'll learn everything you need to know about A/B testing apps to get *app*ealing results.

When most people think of A/B testing, they imagine testing websites, webpages, or landing pages.

A/B testing for mobile apps is really no different. It involves testing individual variables and seeing how the audience responds to these variables.

The audience may comprise a single cohort or multiple, segmented audiences, but the goal is the same: to identify which option provides the best user experience.

For example, say you want to test your app for your driver’s permit practice test among teens in New Jersey.

Let's imagine your goal is to drive more app downloads, so you start A/B testing to see which variables entice users to download.

You may start with the icon displayed in the store to determine if one gets more attention and leads to more download. Everything else stays the same.

Once you have results for this test, you may move onto testing keywords, the product title, description, screenshots, and more.

A/B testing is a technique used by many app creators because it provides valuable, verifiable results.

With testing, and the subsequent result, you’re no longer relying on assumptions. Instead, you have concrete data to inform your decisions.

There are several other benefits of A/B testing apps, including:

- Optimizing in-app engagement
- Observing the impact of features
- Learning what works for different audiences and segments
- Gaining insights into user preferences and behavior

The benefit of each example goes back to data.

You're no longer basing decisions on assumptions, personal preferences, or bias. Rather, you know exactly what works and have the numbers to prove it.

Most mobile apps are tested using two different types of A/B testing.

This method is primarily used by developers and tests the UX and UI impact, including retention rate, engagement, session time, and lifetime value. You may want to add other metrics to test for specific functions.

For marketing, A/B testing can optimize conversion rates, retarget users, and drive downloads. You can test which creative ad is more effective, down to the call-to-action, font, images, and every other granular detail.

For example, this GuessTheTest case study tested the optimal app thumbnail image while this study looked at the best CTA button format.

One of the best aspects of A/B testing is that it’s repeatable and scalable. You can use it to continuously optimize your app and its marketing campaigns. Here’s how:

Your testing should always have a hypothesis that you’re trying to prove or disprove. Clearly stating a hypothesis is how you know which variables to test.

For example, you may want to test whether having screenshots of your practice permit test inspires more people to download your app. Testing the number of screenshots in the app page of the store gives you a starting point, orienting you to know where to begin.

You should also create a checklist to ensure you cover all the information you need, including:

- What are you testing?
- What audience(s) are you testing with?
- What will you do if your hypothesis is proven or disproven?

If you can’t come up with a defined testing variable, begin with the problem for which you’re seeking a solution. Then investigate what testing approach can help you best solve the problem.

Once you know what to test, you need to segment and define the audiences on which you’re testing.

Ideally, in A/B testing, you should isolate one variable to test at a time against one audience cohort.

The reason why is because testing across different audiences, or diverse audiences, adds another variable to the mix which makes it more difficult to accurately define what worked and what didn’t.

It's also valuable to segment your audience by factors like traffic source or new versus returning visitors.

However, when segmenting your audience, it's important the sample size is large enough to glean important insights. If the sample size is too small, your A/B test data won't be valid, and you’ll miss out on part of the big picture.

Now is the time to analyze your results and determine which variable offers better results. Consider all the available data to get a comprehensive picture. For example, if you notice that your change increased session time, rather than conversions as you hoped, that’s still a valuable learning.

You’ve determined your best results from each valuable, so now you know that you can use that winning variable and implement it across your entire audience. If you didn’t get clear results from A/B testing, you should still have data that you can use to inform your next test.

A/B testing is repeatable, so you can keep refining and testing until you get optimal results. It’s important to continue testing regularly, no matter what, and use the information to improve your app and user experience.

You always need to understand why you’re testing a variable with a clear hypothesis and how you will move forward once you have an outcome. This statement may seem obvious, but knowing why you’re testing ensures that you’re not wasting time and money on a test that won’t serve your larger goals.

A/B tests have a lot of value. Even if things don’t go the way you hoped or you get results earlier than expected, it’s important to stick with your tests long enough to be confident in your decisions.

Don’t get too invested in the result you hope to get. User behavior is anything but simple, and sometimes your testing will show you something unexpected. You need to stay open-minded and implement the necessary changes, then test again. Testing is an iterative process towards continual optimization.

Ideally, with A/B testing, you're only changing only one variable at a time, but you can’t control everything. The season in which you test matters, and you can’t control that. So, be sure to test the same variables in different seasons to see what results you get.

A/B testing is essential to ensuring your app delivers the best possible experience for users and validates your assumptions and ideas. Include A/B testing in your app development and updates to keep users downloading and engaging with your app.

By: Deborah O'Malley & Shawn David | Last updated April, 2022

In A/B testing, planning and prioritizing which tests to run first is a process marred in mystery.

It's seems there's no great system for organizing ideas and turning them into executable action items.

Keeping track of which tests you've run, plan to run, or are currently running is an even bigger challenge.

That is until now.

In this short 12-minute interview, you'll hear from Shawn David, Operations Engineer, of CXL's sister site, Speero, a leading A/B testing firm.

He shares the secrets on how Speero tracks, manages, plans, and prioritizes their A/B tests.

Check out the video to:

- Get an inside view into Speero's planning and prioritization methodology, based on the CXL framework known as PXL.
- Watch a demo of Speero's custom-made test planning and prioritization tool in action, and apply the insights to optimize your own planning and prioritization process.
- See how this tool, built through Airtable, can be customized and used to develop or enhance your own test planning and prioritization model.

**Hope you’ve found this content useful and informative. Please share your thoughts and comments in the section below. **

Please enter your login info:

Forgot Password? Click here to reset.

Don't have an account? Create one!

Create FREE Account

Enter the username or e-mail you used in your profile. A password reset link will be sent to you by email.

Already have an account? Login!

Login to your Account

Subscribe to the newsletter to get free, new A/B test case studies sent to you every other week.

Join thousands of other digital marketers from organizations like these:

Annual Plan

normally $997/yr

$

497

/yr

(billed annually )

3 team members for the price of 2

Unlimited access to all webinar, tests and resources

Exclusive offer - Get a free Conversion Audit on your site

Upgrade Group

Best Value - Save 56%

Annual Plan

normally $697/yr

$

247

/yr

(billed annually )

Save big! Enjoy 56% off by paying once, annually

Complete access to the full library of 50+ archived tests

Exclusive access to all webinars, articles, and resources

Upgrade Annual

Monthly Plan

normally $797/yr

$

47

/yr

(billed monthly )

No obligation, try monthly and enjoy

Full access to the complete library of 50+ archived tests

Complete access to all webinars, articles, and resources

Upgrade Monthly

All are prices in $USD