guess_the_test-white-green
Member Login
guess_the_test-white-green
Menu

By: Deborah O'Malley, M.Sc. & Timothy Chan, Ph.D. | Last updated May, 2022


Overview

Awesome! You’ve just run an A/B test, crunched the numbers, and achieved a statistically significant result.

It’s time to celebrate. You’ve got another test win. 🙂

Or have you?

As this article explains, understanding and calculating statistical significance is actually quite complex.

Many experimenters don’t truly know what statistical significance is or how to derive a statistically significant test result.

Statistical significance is the standard, accepted way to declare a winning test. It helps us identify real effects from noise. 

So to properly call a winning (or losing) test, it’s really important to clearly understand what a statistically significant result is and means. 

Otherwise, you risk making incorrect conclusions, random decisions, or money-losing choices.

In plain English, this guide is here to set it all straight so you can accurately declare a statistically significant A/B test with ease and confidence. 

Let’s dig in. . .


Statistical significance defined

In A/B testing, statistical significance is defined as a result that provides evidence there is an actual difference between variants, and that measured differences are unlikely due to random chance.

Say what?!?

If this definition seems confusing, hang tight. We’re going to decode it. . .

But to fully understand, you’re going to need to wrap your head around a few fancy stats terms and their relationship to each other. These terms include: 

While these terms may not make much sense yet, don’t worry.

In the sections below, we’re going to clearly define them and show you how each concept ties into statistical significance so you can accurately declare a winning A/B test.


Statistical methods in A/B testing

But first, we need to start with the basics.

In A/B testing, there are two main statistical frameworks: Bayesian and Frequentist statistics.

While these words sound big and fancy, they’re nothing to be intimidated by. 

They simply describe the statistical approach taken to answer the question: does one A/B test version outperform another?

Bayesian statistics 

Bayesian statistics measures how likely, or probable, it is that one version performs better than another.

In a Bayesian A/B test, results are NOT measured by statistical significance. 

This fact is important to realize because popular testing platforms like VWO Smarts Stat Mode and Google Optimize currently use the Bayesian framework.

So if you’re running a test on these platforms, you shouldn’t claim your test is statistically significant. That’s not accurate.

Instead, you should state the probability at which the variant may beat the control.

Frequentist statistics

Frequentist statistics takes and entirely different approach.

In A/B testing, using the Frequentist method is more popular and involves fewer assumptions. But, the results are more challenging to understand. Which is why this article is here for you.

Frequentist statistics attempts to answer the question, “what’s the probability one version will win? by identifying how unlikely or unusual the results are.

Unexpected and notable results are deemed statistically significant.

But in order to declare a statistically significant result, a whole bunch of checkpoints have to be met along the way, starting with rejecting the null hypothesis.


Null hypothesis

Say what? What the heck is a null hypothesis?

To better understand this term, let’s break it down across the two words, starting with hypothesis.

Hypothesis: most of us have a vague sense of what a hypothesis is. Simply stated, we can call it an educated guess that can be proven wrong.

In A/B testing, you make assumption about which variant you think will win and try validate your belief through testing.

Hypothesis testing is the very basis of the scientific method. Through the hypothesis, it's usually assumed the current state of scientific knowledge is true. 

In A/B testing, this assumption means it’s initially believed there’s no conversion difference between test versions. Until proven otherwise, the control and variant(s) are thought to perform equally. 

And, here’s where the null, in null hypothesis, comes into play.

The null hypothesis, denoted H0, assumes the difference between test variants is null or amounts to nothing at all.

The null hypothesis and Frequentist testing takes a guilty until proven innocent approach.

As experimenters using this approach, we need to start by assuming our tests won’t win; when they don’t, it’s considered normal

But, sometimes, a result is unusual or surprising enough that we suspect the null hypothesis is wrong. 

This event occurs when we find a statistically significant result. We, then, reject the null hypothesis and accept alternative viewpoint.


Alternative hypothesis

This other viewpoint is known as the alternative hypothesis (denoted Ha or H1). 

It’s the alternative to the null hypothesis.

As an experimenter, your goal is actually to reject the null hypothesis and accept the alternative hypothesis.

In doing so, you show that there is in fact, a real detectable effect, or conversion difference between variants. A difference that’s not due to random chance. 

Your amazing test variant has actually outperformed!

When this exciting outcome occurs, it’s deemed noteworthy and surprising.

How surprising?

To determine that answer, we need to calculate the probability of whether we made an error in calling a winner when there really wasn’t one.

Because we are just human after all. 

And, as mere mortals struggling to understand all this complex stats stuff, it’s certainly possible to make an incorrect call.

In fact, it happens so often, there’s names for these kinds of mistakes. They’re called type I and type II errors.


Errors in testing

Type I error, false positive

A type I error occurs when the null hypothesis is rejected – even though it was correct and shouldn’t have been.

In A/B testing, this error happens when you incorrectly claim there's a conversion lift, but, in actuality, there's not. What you've really detected is just statistical noise -- an outcome of random chance.

This error can be thought of as a false positive since you’re claiming a winner that’s not really there.

Oops. You don’t want to do that!

Calling a test a winner, when it’s not may drag down revenue or conversions. 

Type II error, false negative

A type II error is the opposite. It occurs when you incorrectly declare that there’s no conversion difference between versions – even though there actually is.

This kind of error is also known as a false negative. Again, a mistake you want to avoid. 

Because, obviously, you want to be correctly calling a winner when it’s right there in front of you. Otherwise, you’re leaving money on the table.

If it’s hard to keep these errors all straight in your head, fear not. Here’s a diagram summarizing this concept:


Avoiding errors in testing 

Making errors sucks. And, as an experimenter, your goal is to minimize errors as much as possible.

The thing is, A/B testing itself is imperfect, and statistical noise can lead you to make wrong decisions. There’s no way around that.

To better understand this concept of statistical noise, it can be helpful to think of a coin flip analogy.

Imagine you have a regular coin with heads and tails. You make a bet with your friend and flip a coin 5 times. For each head, you win $20.  For each tail, you pay them $20.

Sweet! You're already seeing dollar bills in your eyes.

But what happens when you flip the coin 5 times and 4/5 tosses its tails? You might start to wonder whether it’s just bad luck or if the coin is actually rigged. 

In this example, the unfavorable coin flips were just due to random chance. The outcome is unavoidable. 

While the results of this simple coin toss bet may not break the bank, incorrectly calling A/B test winners or losers may have much bigger financial consequences.

You don’t want to be at the whim of random chance. 

Fortunately, you can turn to statistics to help you minimize poor decisions.

That’s the good news.

The bad news?

Unfortunately, when you turn to statistics and take steps to minimize noise and type I errors, you end up increasing the chances of a type II error. And vice versa. So it’s a real tradeoff.

But there is a silver lining in the cloud. Because in statistics, you get to set your risk appetite for type I and II errors.

How? 

Through our close family friends, Alpha (α) and Beta (β).


Statistical Power Beta (β)

Let’s start with beta (β). Beta assess the likelihood of committing a type II error.

Remember, a type II error, also known as a false negative, happens when you incorrectly claim there’s no difference between variants when, in fact, there is.

You can safeguard against making this mistake by setting the risk level at which you’re willing to accept a type II error.

This risk level can be calculated through beta’s twin brother, power.

Power

Power assesses how well you’ve managed to avoid a type II error and correctly call a winner (or loser) when there is, in fact, a real effect, or conversion difference between variants.

Like good twin brothers, power and beta (β) work together in unison.

In fact, power is calculated as 1 - β. The result is, typically, expressed as a decimal value. A power of 0.80 is the standard (1 - β = 0.80). 

This amount means there's at least an 80% likelihood of successfully detecting a meaningful effect, or conversion difference.

To fully grasp this concept, though, we should probably to go back to grade school math where we learned that, to convert a decimal to a percentage, we simply multiply it by 100, or move the decimal two places to the right.

So, in that case, a power of 0.80 * 100 = 80%.

As such, there's only a (0.20) 20% chance of missing the effect and ending up with a false negative. A risk we’re willing to take.

The higher the power, the higher the likelihood of accurately detecting a real effect, or conversion difference.

Although increasing power sounds like a great idea, it can be tricky, especially for lower traffic sites. The higher the power, the larger the sample size needs to be. And, often, the longer the test needs to run.


Significance level alpha (α)

While power helps us measure the risk of a type II error, its close cousin, alpha (α), or significance, as it’s commonly called, helps us measure the risk of a type I error.

Remember, a type I error, also known as a false positive, occurs when we think there’s a difference between versions. But, in actuality, there isn’t. The variants don’t convert better than the control.

To have a statistically significant test, we need to confirm there’s a low probability we’ve made a type I error.

So, here, again, we set a risk level, or cut-off, at which we’re willing to accept this chance of error.

This cut-off point is known as significance level. The standard significance level is 0.05.

This amount means we accept a 5% (0.05) chance of a type I error in calling a winner when there’s actually no real difference between variants. 

As such, there’s a 95% probability we’ve made the correct call. 

But how do we actually calculate this probability? 

The answer is through family ties. 

Just like beta’s twin brother is power, cousin alpha also has a sibling. A big brother. His name is p-value. 


P-value explained

Officially defined, p-value is the probability of observing a sample more extreme than the ones observed in your data, assuming the null hypothesis is true.

But, if that definition doesn’t help much, don’t worry.

Here’s another way to think of it: 

P-value shows the probability of there actually being a conversion difference between variants.

True to its name, this probability is expressed as a value. In fact, some fun cocktail trivia knowledge for you, the “p” in p-value stands for probability! 

To determine this probability, we have to remember that, as experimenters, we've gotta go into things accepting the null hypothesis: that, until proven otherwise, there’s no conversion difference between variants. 

Assuming no difference, we lean on sibling significance level alpha to determine the likelihood we’ve made a type I error. 

Remember, an alpha of 0.05 is the accepted standard. 

Anytime the p-value is less than alpha (p < α), the result is considered statistically significant. Therefore, a p-value of ≤0.05 means the result is statistically significant.

However, many data purists will balk at this threshold.

They’ll tell you a significant test should have a p-value ≤0.01. And some data scientists even argue it should be lower.

However low you go, it seems everyone can agree: the closer the p-value to 0.00, the more likely that the observed effect, or conversion lift, is real – not just the outcome of random chance, noise, or error. 

Which, in turn, means there’s a high probability you’ve actually found a winner. 

Since, according to the null hypothesis, that outcome is surprising and unexpected, it’s significant.

Now, what you've just taken in is a lot! If you need a little help putting all these concepts together, here's a "family tree" that summarizes all these relationships for you:


Inverse relationship of p-value and statistical significance

As you've probably gathered, p-value and statistical significance have an inverse relationship.

The lower the p-value, the more likely the effect, or results, are real and you’ve actually found a winner. 

Or written another way: ↓ p-value, ↑ significant the finding.

Conversely, the higher the p-value, the lower the likelihood there’s really a difference between versions – and therefore, the less likely the result is significant.

Or visually written: ↑ p-value, ↓ significant the result.

But even if you achieve a p-value of ≤0.05, how can you really be sure that you truly found a difference between variants and can claim a winner?


Confidence Level defined

To answer this question, you can lean on confidence level.

With a p-value of ≤0.05, you can be pretty confident you’ve really found a meaningful difference between versions.

But. . . notice that nuance? We said that you can be pretty confident.

When money’s on the line, being reasonably sure isn’t good enough. You want to be quite certain!

And that’s why we also use the guardrail of confidence level as the final factor in achieving statistical significance.

Confidence level indicates the degree of certainty that you’ve actually found a winner and you’re not just making a mistake, or committing a type I error.

Fittingly, it shows how confident you are in your results.

Calculating confidence

Confidence level is calculated through the equation 1 - α. 

If that formula looks scary to you, don’t fret. 

Remember, α is the significance level, and a result is significant when the p-value is ≤0.05.

So, when the p-value = 0.05, the confidence level (1 - α) is 1 - 0.05 = 0.95. 

But here’s the problem, and a big source of confusion for some. . . Confidence level is stated as a percentage. Yet, here, we’ve arrived at decimal, 0.95. 

So what gives?

Again, we need to go back to grade school math where we learned that, to convert a decimal to a percentage, we multiply it by 100, or simply move the decimal two places to the right.

Let’s try it: (1 - α) = 1 - 0.05 = 0.95 which means 0.95 * 100 = 95%.

Amazing! We've now arrived at a 95% level of confidence.

And, now, more fun cocktail trivia for you. . . 

Did you know, the word percent is actually a Latin phrase meaning by the hundred? And it can be broken down into two root words:

  • Per meaning out of 
  • And cent = 100 in french

So, when something is 100%, it, literally, means out of 100.

A 95% confidence level means that, 95 times out of 100, you’ve called the test correctly without error. Or to use fancy stats speak, you've correctly accepted the null hypothesis when there is no effect.

At 95%, you've pretty high odds that you can feel confident in.

Notice something else?

At first glance, a p-value of ≤0.05 and 95% confidence level might not seem like they have anything to do with each other. But, they’re actually interdependent.

Because if we convert the p-value decimal 0.05 to a percentage (0.05 *100 = 5%), we get 5%.

Which means, out of 100%, with a 95% confidence level, there’s just a 5% (100% - 95% = 5%) chance we’ve incorrectly called a winner, or made a type I error. 

The higher the confidence level, the lower the probability the result is just random chance or our own interpretation error. And, therefore, the more significant the finding is.

Or visually written:

↑ confidence level, ↓ p-value, ↑ statistical significance. 

And conversely:

↓ confidence level, ↑ p-value, ↓ statistical significance.

Confidence level best practices

As a general best practice, you should aim to achieve a minimum 95% level of confidence at a significance level of 0.05. 

However, setting the confidence level and significance level really does depend on the experiment and your testing needs. 

If you’re just trying to get a quick win, or test for proof of concept, you might be able to get away with a lower confidence. 

But if you need to show, with absolute certainty, that your million dollar test idea is undeniably a winner, you’re going to want a highly significant result at a very high level of confidence. For example, if you're like Amazon, you're probably not gonna wanna change your checkout flow without being REALLY confident in results. There’s just too much on the line.

Also, a subtle but important point to note, a 95% level of confidence does not mean you’re 95% sure the results are accurate. Rather, it means you’re 95% sure observed results are not just an outcome of random chance or error. 

As well, it’s important to note, even though confidence level is calculated out 100%, it’s actually impossible to reach 100% confidence. You can only ever be 99.99% confident, as there’s always some possibility of error.


Confidence Interval defined

Test results are commonly reported with a margin of error, known as the confidence interval

A confidence interval is a range of values you expect your test results to fall between if you were to redo the experiment.

For example, if the conversion lift is 12% ± 4%, the confidence interval would be a range of values anywhere between an 8% (12 - 4%) to a 16% (12 + 4%) conversion lift.

But, in stats, you can't just arbitrarily declare you think the test results would fall within this range. You have to have a certain level of confidence.

A confidence interval answers the question: how probable is it the conversion lift would be close to the same, if the experiment was repeated?

The answer, is, not surprisingly, calculated with the help of our close friend, significance level (α).

Typically, confidence intervals are set to 1 - the significance level.

So if the significance level is 0.05, we report a 95% (1 - 0.05 = 0.95) confidence interval.

Which is different then, and not to be confused with a confidence level.

It’s a 95% confidence interval because the real result should be within that range 95% of the time. But may be outside that range 5% of the time. A risk we're willing to take.

Although this range can be substantial, it’s set by the statistical power and sample size of your experiment.

Smaller confidence intervals give more precise and trustworthy estimates.

Confidence interval ≠ statistical significance

Confidence interval and significance are tightly coupled. The lower the p-value, the tighter the confidence interval.

However, confidence interval is not the metric used to declare a winning test. 

Statistical significance is. 

Many times, I’ve heard experimenters boast they’ve achieved a winning test result based on a strong confidence interval. That’s not right.

You should only gloat when you’ve achieved a statistically significant win.


Correctly calling a winning test

What’s in a name: calling your test a winner

Now that you understand how the null hypothesis ties into a type I error, based on significance level α, expressed as a p-value, within a given confidence level, based on a specific confidence interval, it should be clear:

When calling a result “statistically significant,” what you’re really saying in scientific, stats speak is: the test achieved statistical significance based on significance level (α), expressed as a p-value of ≤0.05 at a 95% level of confidence that you can reject the null hypothesis and don’t have a type I error.

Or much more simply stated, the test achieved stat sig at a 95% level of confidence.

In other words, there does, in fact, seem to be a real difference between variants. 

You’ve found a winner! 🙂


Calculating and verifying if your test is a winner

Although most standard A/B testing platforms may declare a statistically significant winning test for you, not all get it right. 

So it’s strongly advocated that you verify test results yourself.

To do so, you should use a test validation calculator, like this one from AB Testguide.

Simply plug in your visitor and conversion numbers, declare if your hypothesis was one- or two-sided, and input your desired level of confidence.

One-sided vs. two-sided tests are a whole topic to explore in and of themselves. There’s no clear consensus on which is best to use for A/B testing. For example, this in-depth article states one-tailed tests are better. But, this one argues two-tailed tests are best.

A one-side test measures conversions only going one-way, compared to the control, so either just up OR just down. Whereas a two-sided test measures conversion results going up AND down, compared to the control.

As a general suggestion, if you care about negative results – and want to measure if the variant performed worse than the control – you should use a two-sided test because you want to be able to evaluate both positive AND negative results.

Conversely, if you only care about detecting a positive result, you can use a one-sided test. A one-sided test lets you more easily evaluate if you’re going to implement the variant because it’s a clear winner.

If the result is statistically significant, and you can confidently reject the null hypothesis, you’ll get a notice with a nicely visualized chart. It will look something like this:

If the result is not significant, you’ll see a similar notification informing you a winning test was not observed. It will look like this:

However, an important word of caution: just because your test result is statistically significant doesn’t always mean it’s truly trustworthy. 

There are lots of ways to derive a statistically significant result by “cheating.” In fact, we’re working on a whole new article related to this topic! Stay tuned.


Statistical significance summarized

In essence, evaluating statistical significance in A/B testing asks the question: what’s the probability I’ve truly got a winning test?

To determine the answer, we assume the null hypothesis: that there’s no conversion difference between variants. So there is no winner – until proven otherwise.

We use significance level alpha (α) to quantify how likely it is we’ve made an error in claiming there’s a difference between versions – when there’s really not; all version perform the same.  

P-value tells us the probability of there actually being a real conversion difference, not just due to random chance or noise.

A p-value lower than α, usually set at ≤0.05, means the outcome is unlikely due to random chance. The result is surprising and, therefore, significant.

The lower the p-value, the more significant the finding. 

When we’re very confident (95%+) about our result, we’re able to put our money where our mouth is and know we’ve truly found a winner, not just an outcome of error or random chance. 

So, in summary, a statistically significant result shows us the result is unlikely due to just random chance or error. There’s strong evidence we’ve found a winner. That's surprising. And, therefore, significant.

We can, then, take our winning result and implement it with confidence, knowing that it will indeed lift conversions on our website.

Without achieving a statistically significant result, we’re left in the dark.

We have no way to know whether we’ve actually, accurately called a winner, made an error, or got results just due to just random chance or noise.

For this reason, understanding what statistical significance is and how to apply it is so important. We don’t want to be wrong in calling and implementing supposed winners.

Currently, statistical significance is the standard, accepted way to declare a winner in frequentist A/B testing. 

Whether it should be or not is a whole other topic. One that’s up for serious debate. . .

But for now, statistical significance is what we use. So apply it. Properly.


Final thoughts

Congrats! You’ve just made it through this article, and have hopefully learned some things along the way.

Now, you deserve a nice break. 

Grab yourself a bubble tea , kick your feet up, and breathe out a sigh of relief.

The next time you go to calculate a winning test, you’ll know exactly what to look for and how to confidently declare a statistically significant result. If you’ve actually got one. 😉

Hope this article has been helpful for you. Please share widely and post your questions or comments in the section below.


Glossary: key terms to know

  • Frequentist testing - a statistical method of calculation used in traditional hypothesis-based A/B testing in which you aim to prove or disprove the null hypothesis.
  • Null hypothesis - the assumption there is no difference between test versions; both perform equally effectively. Your aim is to disprove, or nullify, the null hypothesis.
  • Type I error - a fault that occurs when the null hypothesis is rejected even though it shouldn’t be. Happens when you incorrectly declare a difference between versions when, really, there isn’t.
  • Significance level alpha (α) - a measure used to determine the likelihood of a type I error, expressed as a probability. The standard significance level (α) = 0.05. 
  • P-value - a value given to the probability of results occurring due to random chance rather than a real conversion difference. A p-value less than (α) ≤0.05 is, typically, considered significant and shows you truly have a winning variant.
  • Confidence level - how certain you can be the conversion uplift detected isn’t just the result of random chance or error. Has an inverse relationship to p-value and is expressed as a percentage. A p-value of 0.05 (5%) = 95% confidence.

About the authors:

By: Deborah O'Malley, M.Sc. | Last updated May, 2022


Nav bars defined

A Navigational Menu, also known as a navigational bar, or nav bar, is a menu system, usually located at the top of a website.

Its purpose is to help users find categories of interest and access key pages on a website.

The nav bar is usually organized into relevant sections and may have a dropdown menu or sub-sections directly below.

Most have clear section and sub-section menu titles.

These titles should state what the users will get, or where the user will arrive, by clicking into the menu category.

A typical nav menu may look something like this:

The nav menu is, usually, present across all pages of your site. That means optimizing it can pay big returns, impacting every stage of your conversion funnel.

Testing the organization, presentation, and wording in the nav menu presents a great optimization opportunity with potentially incredible conversion returns.

In fact, there's not many other site-wide, high-ticket, low-effort tests like nav menu formatting.

However, in order to optimize a nav bar effectively, there are several important "do's" and "don'ts" you must follow.

Here are the top-10 key do's, don'ts, and test ideas to try:

1. Think about SEO & keywords

Before redesigning or reorganizing any nav bar, always think about SEO and the keywords your visitors are using to find your site and navigate through.

Don't remove or rename any nav menu titles that will lower your SEO value, SERP, or keyword rankings.

This advice is especially important if a large portion of your site traffic comes from paid ads. You want to be using and showcasing the keywords you're paying for or that are bringing visitors to your site.

Once on your site, users will expect to see these keywords and will look for them to orient themselves on your website.

So cater to visitors needs and expectations by showcasing these keywords in your nav menu.


2. Assess your existing site structure

Don't add or remove nav bar links without thinking about the site structure and internal linking. Otherwise, you risk adding or removing access to pages that link throughout your site.

Ideally do an XML map before making any changes involving internal links. As XML sitemap will look something like this:

Here's a good article on how to create an XML sitemap.


3. Use heatmapping data

Heatmapping data is a powerful way to see how visitors are interacting with your site and the nav menu.

As you can see in this example heatmap, the bigger and darker the hotspot, the more likely it is visitors are interacting with that section of the site:

Use heatmapping data to see what nav categories your visitors are clicking on most or least.

But don't stop there. Explore trends across both desktop and mobile.

Take note if users are heavily clicking on the home icon, menu, or search box as all of this behavior may indicate users are not able to easily find the content they're looking for. Of course, eCommerce clients with a lot of products to search are the exception.

Also take note if a user is clicking on the nav page link, say pricing, when they're already within the pricing section. This trend provides clues the nav or page structure may be confusing.

And visitors might not realize where they are on site.

If you detect this type of behavior, test the effect of making specific changes to optimize the interaction for users.

One great way to do so is through breadcrumb links.


4. Test including breadcrumb navigation

Breadcrumb links, or breadcrumbs, are links placed directly under the main nav bar. They look something like this:

Breadcrumbs can be helpful to orient users and better enable them to navigate throughout the site. Because when users know where they are, they can navigate to where they want to be. Especially on large sites with a lot of pages.

If you don't already include breadcrumbs on your site, consider testing the addition of breadcrumb navigation, especially if you have a large sites which many pages and sections.


5. Determine if ALL CAPS or Title Case is optimal

At their essence, words are just squiggly lines on a page or website, but they have a visual presence that can be subconsciously perceived and understood.

That's why whether you put your nav menu titles in ALL CAPS <-- (like this) or Title Case <-- (like this) may make a surprising conversion difference.

PUTTING A SMALL AMOUNT OF TEXT IN ALL CAPS IS FINE FOR EMPHASIS. OR IF YOU'RE TRYING TO COMMUNICATE A POINT IN A LOUD, OFTEN ANGRY, BLATANT TONE.

BUT LARGE BLOCKS OF TEXT IN ALL CAPS IS DIFFICULT AND TIRING TO READ.

In fact, according to renowned usability expert, Jakob Nielsen, deciphering ALL CAPS on a screen reduces reading speed by 35% compared to the same font in Title Case on paper.

The reason ALL CAPS is so difficult, tiring, and time consuming to read is because of the letter shape.

In APP CAPS format, the height of every letter is the same, making each letter in every word create a rectangular shape.

Because the shapes of all the letters are the same, readers are forced to decipher every letter, reducing readability and processing speed.

Need proof? Take a look at this example:

With Title Case, the tops of each letter helps us decipher the text, increasing readability and reading speed.

That said, TITLE CASE DOES HAVE UTLITY!

It's great for drawing attention and making you take notice. That's why it works well for headings and highlighting key points.

But to be most effective, it needs to be used sparingly -- and when isn't not necessary to quickly decipher a big chunk of text.

With these points in mind, test whether it's best to use ALL CAPS or Title Case with your audience on your site.

GOT THAT!? 😉

If you need some inspiration, checkout this real-life caps case study. Can you guess which test won?


6. Optimize menu categorization

Sometimes the navigational format we think will be simplest for our users actually ends up causing confusion.

Because nav menus are such a critical aspect of website usability, testing and optimizing their formatting is critical to improving conversions.

Which brings up the question, should you use a top "callout," that looks something like this, with the bolded text to categorize, highlight and make certain categories pop at the top?

Doing so may save visitors from having to hunt down and search each item in the menu.

But, it also may create confusion if you're repeating the same product categories below. Users may be uncertain if the callout section leads to different a page on site than in the rest of the menu.

An optimal site, and nav system, is one that leaves no questions in the user’s mind.

So test the effect of adding or removing a top callout section in your nav menu. See this real-life case study for inspiration.

Can you guess right?


7. Test using text or "hamburger" menu

Back in the olden days, all websites had text menus that streched across the screen.

But that's because mobile wasn't much of a thing. As mobile usage exploded, a nifty little thing called a "hamburger" menu developed.

A funny name, but so-called because those stacked little bars look like kinda a hamburger:

Hamburger menus on mobile make a lot of sense. The menu system starts small and expands bigger. So it's a good use of space on a small-screened device.

Hamburger menus have become the standard format on mobile.

But does that mean it should also be used on desktop? Instead of a traditional text-based menu?

It's a quesiton worth testing!

In fact, the big-name computer brand, Intel, tested this very thing and found interesting findings. Can you guess which test won?

And while you're at it, if you are going to use a hamburger menu -- whether on your mobile or desktop site -- test placement.

Many a web developers put the hamburger menu on the right side. But it may not be best placed there. For two reasons:

i. The F-shaped pattern and golden triangle

In English, we read from left to right and our eyes naturally follow this reading pattern.

In fact, eye tracking studies show we, typically, start at the top of the screen, scan across, and dash our attention down the page. This reading pattern emerges into what's known as an "f-shaped pattern."

And because of this viewing behavior, the top left of the screen is the most coveted location. Known as the "golden triangle," it's the place where users focus most.

Here's a screenshot so you can visualize both the F-shaped pattern and golden triangle:

Source

Given our reading patterns, it makes sense to facilitate user behavior by placing the hamburger menu in the top left corner.

ii. Unfolds from left outwards

As well, as this article by Adobe design succinctly says, because we read (in English) from left to right, it naturally follows that the nav menu should slide open from left to right.

Otherwise, the experience is unexpected and unintuitive; that combination usually doesn't convert well.

But do test for yourself to know for sure!


8. Test if a persistent menu is optimal

A persistent, or "sticky" nav menu is one that stays with users as they scroll up or down.

How well does a persistent or sticky nav bar work?

It can works wonders, especially if you include a sticky CTA within the nav bar, like this:

But a sticky nav bar doesn't always have to appear upon page load.

It can also appear upon scroll or upon a set scroll depth. What works best?

See this case study for timing ideas and challenge yourself. Can you guess the right test?


9. Determine which wording wins

The best wording within your nav menu are titles and categories that resonate with your user.

To know what's going to win, you need to deeply understand your audience and the terminology they use.

This advice will differ for each client and site. But, as a starting point, delve into your analytics data.

If you have Google Analytics set-up, you can find this information by going to the Behavior > Site Search > Search Terms tab. (Note, you'll need to have previously set-up this search function to get data).

For example, here's the search terms used for a client covering divorce in Canada:

Notice how the keyword "divorce" barely makes it into the top-5 keywords users are searching? In times of distress, these users are looking for other resources.

Showing keywords that resonate in the top nav can facilitate the user journey and will make it more likely visitors click into your site and convert.

Pro tip: using analytics data combined with heatmapping data can provide a great indication of whether users are clicking on the nav menu title giving a good clue whether the wording resonates.

Need more inspiration for high-converting copy?

See this case study examining whether the nav menu titles "Buy & Sell" or "Classifieds" won for a paid newspaper ad site. Which won one? The results may surprise you:


10. Test button copy with your sticky nav

Testing whether a sticky nav bar works best is a great, easy test idea. Looking at what nav menu copy resonates is also a simple, effective way to boost conversions.

So why not combine both and test the winning wording within a sticky nav bar with a Call To Action (CTA) button!?

Here, again, you'll need to really understand your audience and assess their needs to determine and test what wording will win.

For example, does "Get a Quote" or "Get Pricing" convert better? The answer is, it depends whose clicking from where. . . See this case study for testing inspiration:

But just because an overall trend appears, doesn't mean it will hold true for all audiences.

If you have distinct audiences across different provinces, states, or countries, optimal copy may differ because the terminology, language, and needs may also change.

When catering to a local audience, try to truly understand your audience so you can best hone in on their needs.

Test the optimal copy that will most resonate with the specific cohort.

Segment results by geolocation to determine which tailored approach converts with each audience segment.


Summary

It may seem like a small change, but optimizing a navigational menu can have a big impact on conversions.

To optimize, start by using analytics data to inform your assumptions. Then test the highest-converting copy, format, organization, and design. And segment results by audience type.

Doing so will undoubtably lift conversions.

Hope these test ideas are helpful for you!

Please share this post widely and comment below.

By: Tim Waldenback, co-founder Zutobi | Last updated April, 2022


Overview

Apps are everywhere. Many businesses have them, consumers expect them, and more and more businesses are creating them.

Apps are a great business opportunity, but also come with stiff competition. Making an app isn’t enough -- you also have to market it effectively and ensure it offers a top-notch user experience.

On top of these challenges, even small changes in an app’s user experience can have a detrimental impact on conversion and engagement rates.

So it’s best to test your features with A/B testing.

In this comprehensive article, you'll learn everything you need to know about A/B testing apps to get appealing results.


A/B Test Mobile Apps

When most people think of A/B testing, they imagine testing websites, webpages, or landing pages.

A/B testing for mobile apps is really no different. It involves testing individual variables and seeing how the audience responds to these variables.

The audience may comprise a single cohort or multiple, segmented audiences, but the goal is the same: to identify which option provides the best user experience.

For example, say you want to test your app for your driver’s permit practice test among teens in New Jersey.

Let's imagine your goal is to drive more app downloads, so you start A/B testing to see which variables entice users to download.

You may start with the icon displayed in the store to determine if one gets more attention and leads to more download. Everything else stays the same.

Once you have results for this test, you may move onto testing keywords, the product title, description, screenshots, and more.

Benefits of A/B Testing Apps

A/B testing is a technique used by many app creators because it provides valuable, verifiable results.

With testing, and the subsequent result, you’re no longer relying on assumptions. Instead, you have concrete data to inform your decisions.

There are several other benefits of A/B testing apps, including:

  • Optimizing in-app engagement
  • Observing the impact of features
  • Learning what works for different audiences and segments
  • Gaining insights into user preferences and behavior

The benefit of each example goes back to data.

You're no longer basing decisions on assumptions, personal preferences, or bias. Rather, you know exactly what works and have the numbers to prove it.


Types of A/B Testing for Mobile Apps

Most mobile apps are tested using two different types of A/B testing.

1. In-App A/B Testing

This method is primarily used by developers and tests the UX and UI impact, including retention rate, engagement, session time, and lifetime value. You may want to add other metrics to test for specific functions.

2. A/B Testing for Marketing

For marketing, A/B testing can optimize conversion rates, retarget users, and drive downloads. You can test which creative ad is more effective, down to the call-to-action, font, images, and every other granular detail.

For example, this GuessTheTest case study tested the optimal app thumbnail image while this study looked at the best CTA button format.


How to Conduct A/B Testing

One of the best aspects of A/B testing is that it’s repeatable and scalable. You can use it to continuously optimize your app and its marketing campaigns. Here’s how:

Start with a Hypothesis

Your testing should always have a hypothesis that you’re trying to prove or disprove. Clearly stating a hypothesis is how you know which variables to test.

For example, you may want to test whether having screenshots of your practice permit test inspires more people to download your app. Testing the number of screenshots in the app page of the store gives you a starting point, orienting you to know where to begin.

Create a Checklist

You should also create a checklist to ensure you cover all the information you need, including:

  • What are you testing?
  • What audience(s) are you testing with?
  • What will you do if your hypothesis is proven or disproven?

If you can’t come up with a defined testing variable, begin with the problem for which you’re seeking a solution. Then investigate what testing approach can help you best solve the problem.

Segment the Audience

Once you know what to test, you need to segment and define the audiences on which you’re testing.

Ideally, in A/B testing, you should isolate one variable to test at a time against one audience cohort.

The reason why is because testing across different audiences, or diverse audiences, adds another variable to the mix which makes it more difficult to accurately define what worked and what didn’t.

It's also valuable to segment your audience by factors like traffic source or new versus returning visitors.

However, when segmenting your audience, it's important the sample size is large enough to glean important insights. If the sample size is too small, your A/B test data won't be valid, and you’ll miss out on part of the big picture.

Analyze the Results

Now is the time to analyze your results and determine which variable offers better results. Consider all the available data to get a comprehensive picture. For example, if you notice that your change increased session time, rather than conversions as you hoped, that’s still a valuable learning.

Implement the Changes

You’ve determined your best results from each valuable, so now you know that you can use that winning variable and implement it across your entire audience. If you didn’t get clear results from A/B testing, you should still have data that you can use to inform your next test.

Adjust Your Hypothesis and Repeat the Test

A/B testing is repeatable, so you can keep refining and testing until you get optimal results. It’s important to continue testing regularly, no matter what, and use the information to improve your app and user experience.


A/B Testing Best Practices

Always Understand Why You’re Testing

You always need to understand why you’re testing a variable with a clear hypothesis and how you will move forward once you have an outcome. This statement may seem obvious, but knowing why you’re testing ensures that you’re not wasting time and money on a test that won’t serve your larger goals.

Don’t Stop Short

A/B tests have a lot of value. Even if things don’t go the way you hoped or you get results earlier than expected, it’s important to stick with your tests long enough to be confident in your decisions.

Stay Open

Don’t get too invested in the result you hope to get. User behavior is anything but simple, and sometimes your testing will show you something unexpected. You need to stay open-minded and implement the necessary changes, then test again. Testing is an iterative process towards continual optimization.

Test Seasonally

Ideally, with A/B testing, you're only changing only one variable at a time, but you can’t control everything. The season in which you test matters, and you can’t control that. So, be sure to test the same variables in different seasons to see what results you get.


Summary: Leverage A/B Testing for Success

A/B testing is essential to ensuring your app delivers the best possible experience for users and validates your assumptions and ideas. Include A/B testing in your app development and updates to keep users downloading and engaging with your app.


About the Author

By: Deborah O'Malley & Shawn David | Last updated April, 2022


Overview

In A/B testing, planning and prioritizing which tests to run first is a process marred in mystery.

It's seems there's no great system for organizing ideas and turning them into executable action items.

Keeping track of which tests you've run, plan to run, or are currently running is an even bigger challenge.

That is until now.

In this short 12-minute interview, you'll hear from Shawn David, Operations Engineer, of CXL's sister site, Speero, a leading A/B testing firm.

He shares the secrets on how Speero tracks, manages, plans, and prioritizes their A/B tests.

Check out the video to:

  • Get an inside view into Speero's planning and prioritization methodology, based on the CXL framework known as PXL.
  • Watch a demo of Speero's custom-made test planning and prioritization tool in action, and apply the insights to optimize your own planning and prioritization process.
  • See how this tool, built through Airtable, can be customized and used to develop or enhance your own test planning and prioritization model.

Hope you’ve found this content useful and informative. Please share your thoughts and comments in the section below. 

By: Deborah O'Malley, M.Sc. | Last updated April, 2022


Overview

If you've been in the A/B testing field for a while, you've probably heard the term Sample Ratio Mismatch (SRM) thrown around.

Maybe you've talked with people who tell you that, if you're not looking at it, you're not testing properly. And you need to correct for it.

But, in order to do so, you first need to know what SRM is and how to spot it.

This article outlines the in's and out's of SRM, describing what it is, why you need to be looking at it when testing, and how to correct for it, if an SRM issue occurs.

Let's go!


Understanding Sample Ratio Mismatch (SRM)

The term Sample Ratio Mismatch (SRM) sounds really intimidating.

It's a big mumble, jumble of words. With "sample," and "ratio," and what exactly is "mismatch" anyway?

Well, let's break it all down, starting with the word sample.

In A/B testing, a sample applies to 2 separate but related concepts that impact test traffic:

1) Traffic allocation

2) The test sample

What's the difference?

1) Traffic allocation

Traffic allocation is the way traffic is split.

Typically, in an A/B test, traffic is split equally, or 50/50, so half of users see the control version and the other half the variant.

Visually, equally split traffic looks something like this:

If a test has more than one variant, for example in an A/B/C test, traffic can still be equally split if all versions receive approximately the same amount of traffic.

In a test with 3 variants, equally allocated traffic would be split 33/33/33.

Visually it would look something like this:

While traffic can be divided in other ways, say 70/30 or 40/30/10, for example, this unequal allocation is not considered best practice. As explained in this article, traffic should, ideally, always be equally allocated.

However, whether traffic has been equally or unequally allocated, SRM can still occur and should always be calculated.

2) Test sample

In addition to traffic allocation, there's the sample of traffic, also known as the sample population.

In an A/B test, the sample of traffic comprises the sample size, or the number of visitors in a test.

Despite how traffic is allocated, if the sample of traffic is routed so one variant receives many more visitors than the other, the ratio of traffic is not equal. And you have a Sample Ratio Mismatch (SRM) issue.

Visually, the test sample should be routed like this:

If it's not, a SRM issue occurs. One version has far more traffic routed to it than the other and the ratio of traffic is off. Visually, a SRM issue looks like this:

Sample Ration Mismatch (SRM) defined

According to Georgi Georgiev, of Analytics-Toolkit, in his article, Does Your A/B Test Pass the Sample Ratio Mismatch Check?

"Sample ratio mismatch (SRM) means that the observed traffic split does not match the expected traffic split. The observed ratio will very rarely match the expected ratio exactly."

And Ronny Kohavi, in his book Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, adds:

If the ratio of users (or any randomization unit) between the variants is not close to the designed ratio, the experiment suffers from a Sample Ratio Mismatch (SRM).

Note that the SRM check is only valid for the randomization unit (e.g., users). The term "traffic" may mislead, as page views or sessions do not need to match the design.

In other words, very simply stated: SRM occurs when one test variant receives noticeably more users than expected.

In this case, you've got an SRM issue. And that's a big problem because it means results aren't fully trustworthy.


SRM a gold standard of reliable test results

However, by checking for SRM, and verifying your test doesn't have an SRM issue, you can be more confident results are trustworthy.

In fact, according to this insightful SRM research paper, "one of the most useful indicators of a variety of data quality issues is (calculating) Sample Ratio Mismatch."

Furthermore, according to Search Discovery,

"When you see a statistically significant difference between the observed and expected sample ratios, it indicates there is a fundamental issue in your data (and even Bayesian doesn't correct for that). This bias in the data causes it to be in violation of our statistical test's assumptions."

Simply put: if you're not looking out for SRM in your A/B tests, you might think your data is reliable or valid when it actually isn't.

So make sure to check for SRM in your tests!


How to spot SRM in your A/B tests

Calculating SRM should be a standard part of your data confirmation process before you declare any test a winner.

The good news: determining an SRM issue is actually pretty easy to do. In fact, some testing platforms, like Convert.com, will now automatically tell you if you have a SRM issue.

But, if the testing platform you're using doesn't automatically detect SRM, no worries. You can easily calculate it yourself, ideally before you've finished running your test.

If you're a stats guru, you can determine if there's a SRM issue through a Chi-Square Calculator for Goodness of Fit. However, if the name alone scares you off, don't worry.

You can also simply plug your traffic and conversion numbers into an existing SRM calculator, like this free calculator or Analytics-Toolkit's SRM calculator.

These calculators can be used for both Bayesian and Frequentist methods and can be used whether your traffic is equally or unequally allocated.

If you don't have a SRM issue, you'll see a message that looks like this:

Assuming everything checks out and no SRM is detected, you're golden.

So long as the test has been run properly, your sample size is large enough to be adequately powered to determine a significant effect, and you've achieved statistical significance at a high level of confidence, you can declare your test a winner.

However, if you have an SRM issue, the calculator will alert you with a message that looks like this:

A p-value of  0.01 shows a significant result. The lower the p-value, the more likely an SRM issue has occurred.

If SRM is detected, that's a problem. It shows the ratio of traffic hasn't been directed to each variant equally and results might be skewed.


How frequent is SRM?

According to experimentation specialist, Iqbal Ali, in his article The essential guide to Sample Ratio Mismatch for your A/B tests, SRM is a common issue.

In fact, it happens in about 6-10% of all A/B tests run.

And, in redirect tests, where a portion of traffic is allocated to a new page, SRM can be even more prevalent.

However, if any test you've run has an SRM issue, you need to be aware of it, and you need to be able to mitigate the issue.

Does SRM occur with all sample sizes?

Yes, SRM can occur with samples of all sizes.

According to Iqbal in this LinkedIn post:

The Type I error rate is always 1% with a Chi Test with the p-value at < 0.01. This means it doesn't matter if we check with 100 users or 100,000 users. Our false-positive rate remains at about 1 out of 100 hundred tests. (See the green line on the chart below).

Having said that, we need to be wary as with low volumes of traffic, we can see larger differences happen by random chance, WITHOUT it being flagged by our SRM check. (See the red line on the chart below).

It should be rare to see those extremes though.

It should be even rarer to see even larger outliers, where SRM alert triggers (false positives). (See the yellow line on the chart below). I wanted to see this number as it's a good indication of volatility. The smaller the traffic volumes, the larger the volatility.

At 10,000+ users assigned, the % difference between test groups before SRM triggers is <5%.

At 65,000+ users this % difference drops to < 2%. So, the Chi Test becomes more accurate.

So beware, no matter how big or small your traffic volume, SRM is a possibility. But, the larger the sample, the more likely for a SRM issue to be accurately detected.

Why SRM happens

If your test shows a SRM issue, the first step is to figure out why it may have occurred.

SRM usually happens because:

  • The test is not set-up properly or is not randomizing properly
  • There's a bug in the test or test set-up
  • There are tracking or reporting issues

Or a combination of all these factors.


What to do about SRM issues

1. Review your test set-up

When any test shows an SRM issue, the first step is to review the set-up and confirm everything has been allocated and is tracking properly. You may find a simple (and often hard to detect) issue led to the error.

2. Look for broader trends

However, if you can't isolate any particular problems, it's worthwhile reviewing a larger collection of your tests to see if you can find trends across a broader swath of studies.

For example, for one client who had a lot of tests with SRM issues, I did a meta analysis of all tests run within a 1-year period. Through the analysis, I noticed for every test with over 10,000 visitors per variant, an SRM issue occurred on a particular testing platform.

While this finding was clearly problematic, it meant the SRM issue could be isolated to this variable.

After this issue was detected, the testing platform was notified of the issue; I've since seen updates that the platform is now working to fix this problem.

3. Re-run the study

In his paper Raise the Bar on Shared A/B Tests: Make Them Trustworthy, data guru, Ronny Kohavi, describes that a trustworthy test must meet 5 criteria, including checking for SRM.

In this LinkedIn post, Ronny compares not checking for SRM to a car without seatbelts. Seatbelts save lives; SRM is a guardrail that saves you from incorrectly declaring untrustworthy results as trustworthy.

If your study shows an SRM issue, the remedy is simple: re-run the test.

Make sure you get similar results, but without the SRM issue.

4. Create an SRM alert system

To mitigate against undetected SRM issues, you might also want to consider building an SRM alerting system.

As experimentation mastermind Lukas Vermeer of Vista remarked in this LinkedIn post, his team created their own SRM altering system in addition to the testing platform they're using. (Lukas has also created a free SRM chrome extension checker available here).

The progressive experimentation folks at Microsoft have done something similar:

5. Choose your poison

If you're not willing to invest in any of these strategies because they're too time or resource intensive, fine.

You then need to willingly accept the tests you're declaring as winners aren't fully trustworthy. And you really shouldn't be making implementation decisions based on the results.

In that case, you need to choose your poison: either accept untrustworthy data or attempt to obtain accurate test results that may take more effort to yield.


Summary

If your A/B tests have fallen victim to a SRM issue, the results aren't as valid, and reliable, as they should be. And that's a problem.

Because, if you're making data-driven decisions without accurate data, you're clearly not in a good position.

Testing for SRM is like putting on your seatbelt. It could save you from implementing a so-called "winning" test that actually amounts to a conversion crash.

So remember: always be testing. And, always be checking for SRM.


Hope you’ve found these suggestions useful and informative. Please share your thoughts and comments in the section below. 

By: Lawrence Canfield | Last updated April, 2022

Overview

Virtually any new web venture today can benefit from A/B testing as a means of curating content to visitor preferences.

Without a doubt, personalizing holds true for gaming platforms, too.

As a segment of platforms that sees and relies on directly interfacing a large volume of active consumers on a day-to-day basis, gaming sites benefit from testing in significant and direct ways.

With this outlook in mind, here are some elements those managing gaming sites might want to consider for A/B testing.


Game Previews: Visual Covers vs. Video

A popular method for many gaming sites is to showcase a variety of image-only click-throughs when giving options, with video previews available on game-specific pages.

Some sites also opt for embedded GIFs as ways of quickly previewing gameplay (although these might actually be better replacements for images than video due to lack of audio).

Generally speaking, the marketing industry has trended toward video of late, but it’s a good idea to brainstorm and test how users might best engage with game previews. These GuessTheTest A/B test case study shows just how important it is to test your optimal app, video, or gaming and product thumbnail images.


Pull-Down Menus vs. Horizontal Sliders

This comparison needs to be considered in the context of both desktop and mobile versions of gaming platforms since the transition between them has, historically, been a bit of a nightmare.

The reality, the majority of gamers are now on mobile, and UX designers often opt to adapt mobile sites to display pull-downs as horizontal sliders.

While the choice is going to depend partly upon the architecture of the site, it’s important to know how users perceive accessibility and enticement.

With a lot of CSS templates being easy to copy and paste nowadays, quickly mocking up a couple of menu options to test shouldn’t take too long.

As these GuessTheTest case studies show, testing and finding the optimal navigational format and presentation for your audience has a big impact on conversion rates across desktop and mobile devices.


Purchase Recommendations

Deciding how to order and display buying options is a key business decision for any game site operator.

As this GuessTheTest case study, conducted on behalf of T-Mobile, shows testing how to display different subscription tiers illustrates how careful design may yield substantially different results in terms of what users consider the “default” and how willing they are to be upsold.

For gaming sites, the test could be specifically between tiers of bundles or other models.


Methods of Payment

When a potential gamer takes the plunge is crucial to optimize. And on the user side, a balance needs to be struck between convenient options and perceived safety, assuming all options are safe.

With gaming sites, there are both new and established options for users to choose between.

Regarding the new, the notion of poker sites considering cryptocurrency for deposits and payouts has become very real in recent years.

Crypto is becoming a fresh option alongside more long-standing alternatives –– such that an A/B/C test could differentiate between preferences for a processor (like PayPal or Skrill), paying directly via credit card, or paying via crypto.

It may be that platforms are best off supporting all three options, but testing for preferences will nevertheless influence what’s most sensible to emphasize.


Displaying Offers

Timing and method are keys when it comes to displaying special offers.

Pop-ups can create a bit of an allergic reaction for internet users -- even gamers who are accustomed to fairly “busy” casino sites. To this point, many users still take active steps to block them.

For this reason, it's worthwhile testing if it's better to use a pop-up of display the offer as a banner carousel.

If you do go for pop-ups, test the impact of using less intrusive formats and minimizing the number of overall pop-ups used.

As this GuessTheTest notification bar case study shows, timing and placement are important variables to get right to improve the effectiveness and conversion rates of pop-ups or other notification displays.


Summary

There are a lot of decisions that need to be made with a mass user-base platform like a gaming site.

There are many elements that you can be optimized to appeal to more users. A/B testing is a powerful tool to determine what works best with your users.

For more A/B testing inspiration, checkout the full library of tests. There are hundreds of inspiring A/B tests for you to look at and game the system with.

Take your best guess, see if you win, and apply the findings to optimize your own success.


Hope you’ve found these suggestions useful and informative. Please share your thoughts and comments in the section below. 

By: Deborah O’Malley and Joseph A. Hayon | Last updated January, 2022

Overview

Did you know that 96% of adult users report being concerned about their privacy on the Internet? And 74% say they’re wary of how their privacy and personal data is being used.

With rising fears over Internet privacy and security, there’s been a push to protect users and tighten privacy policies.

As a result, 3rd party cookies will soon be blocked by most major web browsers, leaving optimizers and experimenters desperately picking up the crumbled pieces.

We'll have no easy way to track user events or behaviors -- let alone measure conversions.

What's an experimenter to do?

In this article, you’ll learn everything you need to know about cookies, their implication on A/B testing, and how to ensure continued success in your testing practice.

Specifically, you’ll find out:

  • What cookies are
  • The two most important types of cookies
  • How a cookieless world may impact on A/B testing
  • Server-side tag management and what it means for your testing practice
  • What you should do to adapt to the changing landscape

What are cookies? 

Beside those delicious, tasty snacks, what are cookies anyway?

Well, in the context of the World Wide Web, cookies are text files sent from a website to a visitor’s computer or other Internet connected device.

That text file identifies when a visitor is browsing a website and collects data about the user and their behavior.

Collected data may include elements like a user's personal interests, geographical location, or device type.

These data points are stored on the user's browser. The information storage occurs at the time the user is engaging with a website, app, or online ad.

Through this collection and storage process, cookies achieve three things.

They can:

  1. Gather and store information about the user
  2. Capture the actions or events a user takes on a website or app
  3. Track a user’s activity across multiple websites

Cookies – the good, the bad, and the ugly

The good

Cookies attempt to provide users with a more personalized experience when engaging with a website or app.

When marketers analyze tracking data, it can provide them with guidance into where optimizers' focus should be placed. The insights available are often richer than what standalone spreadsheet results or visual dashboards can provide.

So, despite their bad rep, cookies can, in fact, help digital marketers detail what, where, and why certain web events happen.

For these reasons, cookies, can be a valuable (and delicious) part of the web.

The bad

However, because cookies track and store a user’s browser activity, they’ve been demonized as invasive and intrusive monsters. Cookie monsters, that is.

The ugly

With growing concerns over user privacy, there’s been a call to arms for the abolishment of cookies.

Getting rid of cookies may spell bad news for digital advertisers, marketers, experimenters, optimizers, and A/B testers reliant on using tracked data to understand user behavior and measure conversions.


The two types of cookies you need to know

While there are plenty of tasty cookies out there, as a digitals marketer, there are really only two kinds you need to know: 1st and 3rd party cookies.

The difference between these two types of cookies is based on how a user's browsing activity is collected and where that data gets sent. 

1st Party Cookies Defined

1st party cookies are directly stored by the website or domain you visit or the app the user has logged into.

Once the user’s data has been captured, it is sent to an internal server.

3rd Party Cookies Defined

3rd party cookies have a slightly different flavor.

3rd party cookies are created by outside domains and relayed back to an outside third-party server, like Google, Facebook, or LinkedIn.

It's because the data is collected and sent out, to a 3rd party, that 3rdparty cookies are so named.

While this information might be interesting, as a digital marketer, what you really need to know is, 3rd party cookies are the troublesome ones of the batch.

They’re essentially tracking pixels.

And because they track and send data back to advertisers’ servers, they're seen as intrusive and highly invasive of user privacy.

For this reason, privacy protection advocates don’t like them and want them abolished.

And so, big-name brands, like Google and Facebook, are being forced to shift into a cookieless world -- one where 3rd party cookies will no longer exist.

In fact, Google is planning to eliminate the use of 3rd party cookies as soon as early 2022 and will likely move instead into privacy sand box initiatives.

How do 1st and 3rd party cookies work?

Now that you intricately understand what a cookies is, and the differences between a 1st and 3rd party cookies, you can explore the intricacies between how a 1st and 3rd party cookie works.

Grasping this concept will help you aptly shift from a reliance on 3rd party cookies to adopting other solutions that still enable the data tracking, collection, and conversion measurements you need for A/B testing.

The inner workings of a 1st party cookie

To fully understand how a 1st party cookie works, you have to start with realizing the internet has what’s known as a “stateless protocol.”

This term means, websites inherently “forget” what a user has done, or what activity they’ve taken -- unless they’re given a way to remember this information.

1st party cookies solve this forgetfulness problem.

They track and capture user activity information and, in doing so, aim to deliver a highly identifiable and authenticated user experience.

For example. . .

Imagine a user goes to your eCommerce website, adds an item to their cart, but doesn’t end up making a purchase during their initial session.

Well, the cookie is what enables the items added to cart to be saved in the user’s basket.

So, if the visitor later returns to your site, they can continue their shopping experience exactly where they left off --with the items still in their basket from the previous session.

All thanks to 1st party cookies.

This process can aid users and provide a better user experience.

So, you can see, 1st party cookies really aren’t that bad!

And, the good news is, most everyone seems to agree, 1st party cookies are a-okay. So they're here to stay. At least for now. . .

How 3rd party cookies work

An easy way to understand  how 3rd party cookies work is by imaging this scenario:

Pretend you’re searching online for a local sushi restaurant and land on “sushi near me” website.

But then, your kid abruptly grabs your phone out of your hand and insists they play you their latest, favorite Facebook video.

It’s 3rd party cookies that will have tracked this behavior and try to serve you ad content believed relevant based on your browsing activity.

The mystery for why you're getting Nyan Cat memes in your Facebook feed is now explained. . .

And, 3rd party cookies are also the reason why you might get served Facebook ads for that local sushi restaurant you just searched.

What happens to user data after a 3rd party vendor gets ahold of it?

After the data is relayed back to the 3rd party, it’s then pooled together by all the data warehouses owned by the various ad tech vendors.

This data collection and pooling process is what enables marketers to:

  1. Deliver targeted ads relevant to users’ interests
  2. Develop look-a-like audiences
  3. Build machine learning processes aimed at predictive ads
  4. Control the number of times a user sees an ad through any given ad network

With these capabilities, it’s possible to measure ad effectiveness, assess key conversion metrics, and gauge user engagement across multiple website networks.

None of these assessments would be easily possible without 3rd party cookies. Which means, without them, A/B testing may become significantly difficult!


How a cookieless world may impact A/B testing

If you’re concerned about protecting your data and privacy, the abolishment of 3rd party cookies may be welcomed news.

But, if you’re like most marketers, you might be a little freaked out and wondering how the abolishment of 3rd-party tracking cookies is going to impact the tracking and data collection capabilities in your Conversion Rate Optimization (CRO) and A/B testing work.

How will the abolishment of 3rd party cookies affect optimizers?

The answer is -- likely in three main ways:

  1. Legislatively
  2. Through cookie banner requirements
  3. Within browser tracking

Let’s look at each concept in more depth. . .

1) Legislative Reforms

When the Internet surfaced to the mainstream in the late 1990’s, www, might have just as well stood for wild, wild west.

There weren’t many laws or privacy controls enacted back then.

But, as the web has matured, more privacy policies have become instituted.

GDRP requirements were just the starting point. As Stape.io states,

“Governments are creating laws and regulations designed to encourage tracking transparency. These laws are geared towards preventing companies from unethical or unregulated user tracking. In a nutshell, it's all for the sake of user privacy. As a result of GDPR, CCPA, and ePR, websites that don't reveal which user data is collected or how it's used can face civil penalties.”

Here's a timeline showing recently implemented privacy policies:

From a legal perspective, digital marketers are being required to comply with more, and more aggressive, data-privacy regulations.

And this trend will likely only continue as time marches on.

2) Cookie banner requirements

Website owners must now also provide transparency at the start of a user’s journey by prominently displaying a cookie banner on their website.

We’ve all seen these cookie notification banners. They look something like this:

Now you know why they’re on so many websites! 🙂

And, while cookies banners may slightly disrupt an A/B test experience, they're not the worst offender.

3) Browser tracking

Worse yet are increasingly stringent browser privacy policies.

For example, Apple’s Safari browser recently instituted aggressive policies against cookie tracking.

Safari now “kills” cookies after 7 days, and kills cookies from ad networks after just 24 hours!

These timelines are a big problem, especially for testers following best practice and running an experiment for +2 weeks.

Why?

Well, let's say, for example, a Safari user enters a website being tested, leaves the site and comes back 8 days later?

By this time, the cookie has been dropped, and, as a result, the user is now classified as a new visitor and may see a different test variation.

The result: the experiment data may be dramatically skewed and unreliable.

What's going to happen to data collection with A/B testing?!

If this kind of scenario sounds doomsday to you, you might be wondering how you're now possibly going to accurately track and measure browser behavior, events, or conversions.

Here’s the likely outcome:

Fragmented data collection

Data collection and measurement will likely become more fragmented.

As 3rd party cookies are phased out, experimenters will be forced into relying largely on 1st party cookie data.

But, this data will be accessible only from users who consent to the site’s cookie policies and data collection or login terms.

Since fewer users are likely to accept and consent to all privacy policy, data profiles will become more anonymous and less complete.

Net sum: there will be big gaps in the data.

These gaps will likely only grow larger as privacy laws evolve and become more stringent.

Modeled user behavior

Currently, the digital marketing realm seems to be moving towards more of an event-style tracking model, similar to Google Analytics 4 (GA4) data collection reports.

This movement means, in the end, we’ll have more algorithms pointing to modeled user behavior -- rather than actual individual user data.

And, we’ll likely need to rely more on data modeled through statistical calculations, methodologies, and technologies.

Essentially, we’ll move from an individual data collection approach to an aggregated, non-user-segmented overview.

This approach will work if we compile enough data points and invest in developing robust predictive analytics data collection and measurement infrastructure.

Untrackable user data

As privacy policies rachet down, tracking user behavior is going to get far more challenging.

For example, Apple recently released its newest operating system, iOS 15.

Built into this OS are a host of security measures that are great for user privacy, but prevent the tracking of any user behavior.

In fact, right now, the only new Apple device without privacy controls are the Airpods!

In addition to Apple, other browser brands are also cracking down. As such, an anonymous user browsing experience -- with untrackable data -- is becoming the norm.

In turn, private, secure browsers, like Brave, will likely become mainstay.

Meaning, yet again, less data and metrics available to marketers.

The final outlook?

But, in the end, A/B testing likely won’t die. It will just change.

Conversions are most likely going to be based on broader machine-driven learning algorithms that model users, predicts and analyzes expected behavior, then provides anticipated outcomes.

Data will be less about observed insights and more about predicting how we think users are likely to behave.

Which, when you really get down to it is kinda like how most experimenters declare A/B test winners now anyway. 😉


Server-side tag management

Big changes are coming.

Major privacy and security enhancements will shift the way experimenters track, access, and measure conversion data.

3rd party cookies will soon be abolished, and marketers will be left with big holes in their data collection and analysis capabilities.

For data purists, the situation sounds a bit scary.

But, the good news is, server-side tag management may be a potential solution. 

What is server-side tag management is, how does it work, and how might it help A/B testers?

What is server-side tag management?

According to data wizard, Simo Ahava, server-side tag management occurs when you run a Google Tag Manager (GTM) container in a server-side environment.

If that terminology doesn't mean much to you, don't fret.

Simply stated, it means server-side tagging is a bypass around 3rd party cookies.

How does server-side tag management work?

You'll recall, with a 1st party cookie, the cookie is deployed by the owner of the website domain.

Well, with server-side tracking, the data is sent to your own server infrastructure. A cloud server is then used to send data back to 3rd party platforms.

Doing so gets around tracking restrictions and privacy laws. At least for now.

How does server-side tag management help A/B testers?

This solution helps enable experimenters to take more control of their data. With server-side tag management, you can choose how to prep and clean the data as well as where to unload users' data.

As a result, there are 3 main advantages. With server-side tagging, you can continue to:

  1. Serve 3rd party content
  2. Achieve better, faster site performance
  3. Reduce data gaps

Let's explore each advantage in a bit more depth. . .

1. Serve 3rd party content

In server-side tag management, typically, the data is ported into a 3rd party ad network. This data flow means, marketers can serve and target content based on users’ indicated interests. 

Marketers can also mask the identity of the browser by changing the container name or hostname.

As a result, an adblocker won't recognize it as something trying to identify the users, so the ad won't be as redily blocked. 

2. Achieve better performance

From a performance standpoint, server-side tagging is faster and more efficient because less code needs to load on the client-side.

So, A/B tests may be able to run more quickly potentially without lag or flicker.

3. Reduce data gaps

Server-side tag management can also prevent the data from "leaking" or getting lost in the shuffle.

But, most importantly, server-side tag management can reduce the gap between conversion events and actual user behavior.

Final thoughts on server-side tagging

While the shift to server-side tagging may not be the end-all solution, it is a means currently available to help digital marketers take better control of how user data is captured.

For now, it's a viable solution that's worth investing and learning more about.


Adapting to change

To stay ahead of the curve, here are the top 3 things you can do in this changing environment as we shift into a cookieless world.

1. Build your relationship with your customers

Don’t just rely solely on your business web analytics data.

Learn about your users.

Really get to know your audience. Find out what they like and what keeps them away from your site.

You don’t need to rely on analytics data to do so.

You can talk to your customers, poll them, apply qualitative data or do User eXperience (UX) studies to learn about your customers their needs, wants, desires, and how they behave on your website.

2. Get to know your analytics account

Audit the data collection integrity of your analytics data one or twice a month depending on your organization’s needs.

Note any visible changes, and try to dig into elements that are less guided by analytics tools. 

3. Stay up to date

Keep up with new updates to ad tech platforms, like Facebook’s new API implementation for tracking Facebook events, as well as updates with GTM server side tagging.

Some good resources for staying on top of the latest trends include Simo Ahava’s blog and Google’s Developer Resources.

Summary

We're moving into a cookieless world where user privacy and personal data protection are paramount.

This paradigm shift may present many challenges for data-driven marketers reliant on tracking users to personalize the customer experience and accurately calculate conversions.

Despite the changing landscape, as an experimenter, you can still continue to build relationships with your customers within a first-party environment.

Also remember, as data collection methods change, you can’t expect to achieve 100% data accuracy.

Rather, you’ll be using more sampled data to piece together an approximation of how customers behave.

In the end, A/B testing likely won’t die. It will just change.

For even more information about A/B testing in a cookieless world, checkout this webinar replay.


Hope you’ve found these suggestions useful and informative. Please share your thoughts and comments in the section below. 

The complete guide: a comprehensive step-by-step tutorial to create and launch A/B tests

By: Deborah O'Malley, M.Sc | Last updated December, 2021

Overview

Good news!

In the free A/B testing platform, Google Optimize, it's easy to set-up your first A/B test.

But, so that you can accurately, and confidently, perform the set-up, this article breaks down exactly what you need to do -- and know -- with definitions, examples, and screenshots.

Setting-Up a Test Experience

Imagine that you're about to set-up and run your first A/B test. Exciting times!

Let's say you've decided to test the button copy on your sign-up form. You want to know if the text "sign up now" converts better than "get started".

Well, by running an A/B test in Google Optimize, you can definitively determine if one text variant will outperform.

Creating Your Experience

To begin, you simply need to open Google Optimize, ensure it's set-up for your specific site and click the "Create" button in the upper right hand corner of your screen in Google Optimize.

Doing so will create a new test experience:

A screen will slide out from the right side, asking you both to name your experiment, as well as define what kind of test you’re going to run:

Naming Your Test

You'll first need to name the test.

The test name should be something that is clear and recognizable to you. 

*Hint* - the name should be both memorable and descriptive. A title both you and your colleagues or clients will immediately understand and recognize when you come back to it in several months or even years time.

For this example, we’ll call the test “CTA Button Test”

Site URL

You’ll next be prompted to type in the URL of the webpage, or website, you’d like to use. 

This URL is very important to get right. 

It’s crucial you input the URL of the control -- which is the original version you want to test against the variant(s).

If you’re running a re-direct test and you select the URL of the variant, the variant will be labeled as the control in Google Optimize and it will totally mess up your data analysis -- since you won’t be able to accurately compare the baseline, or performance of the original version against the variant.

Trust me! I’ve seen it happen before; it’s a mistake you want to avoid.

So, make sure you properly select the URL of the page where your original (the control) version currently sits on your site. 

If the element you want to test is a global element, for example, a top nav bar button that shows on all pages across the site, you’ll use the homepage URL.

In this example, the button we’re testing is in the top nav header and shows on every page of the site. So, we’ll use the website's homepage URL: https://convertexperts.com/

We’ll enter that URL into the URL field:

Defining Your Test Type

In order to accurately run the proper test, you'll need to choose the correct type of test to run.

In Google Optimize, there are four different test type options:

  • A/B test
  • Multivariate test
  • Redirect test
  • Personalization

Most of the time, you'll be a running a straight-up A/B test. But to know for sure which test type is best for your needs, check out this article.

For our example, we just want to set-up a simple A/B test looking at the effect of changing the CTA button.

For your test, once you've confirmed the type of test you're going to run, you're ready to officially set-up the test, or as Google Optimize calls it, "create the experience."

To do so, simply press the blue "Create" button in the upper right hand corner:

Adding the Variants

Once you've created the experience, you’re now ready to continue setting up the test, starting with adding your variant(s).

The variant, or variants, are the versions you want to test against your control. Typically, in an A/B test, you have one variant (Version B).

But, as long as you have enough traffic, it's perfectly legitimate to run an A/B/n test with more than one variant (where n stands for any number of other variants.)

To set-up your variant(s), sSimply click the blue “Add variant” button:

Next, name the variant with a clear, descriptive, memorable title that make sense. Then click “Done”:

Woo hoo! You’ve just set-up your A/B test! 🙂

But, your work isn’t done yet. . . now you need to edit your variant. 

Editing the Variant

Remember, the variant is the version you want to test against the control.

To create the variant, you can use Google Optimize’s built-in WYSIWYG visual editor.

Or, for more complex tests and redesigns, you might want to inject code to create the new version.

To do so, you click “Edit”:

Note, the original is the version currently on your site. You can not edit this page in anyway through Optimize. The name cannot be changed, nor can the page URL.

You’re now brought into an editor where you can inject code or get a visual preview of the webpage itself. 

Using the visual editor, you can click to select the element you want to edit, and make the change directly. 

Optimize’s visual editor is pretty intuitive, but if you’re unsure what elements to edit, you can always refer to this guide.

In this example, you see the visual editor.

To make changes, you'd first click to select the “Get a Free Analysis” button text, and then click to edit the text:

Now, type in the new text, “Request a Quote” and click the blue “Done” button at the bottom right of the preview editor screen:

When you're happy with all changes, click the top right "Done" button again to exit out of the preview mode:

You're now brought back into the Optimize set-up:

Here, you could continue adding additional variants, in the same way, if you wanted.

You could also click the "Preview" button to preview the variants in real-time.

Assigning Traffic Weight

Once you've assigned and defined your variants, you're going to want to state the weight of traffic, or what percentage of traffic will be allocated to each variant.

The default percentage is a 50/50% split meaning half of visitors (50%) will see the original version and the other half (50%) will see the variant.

As a general testing best practice, traffic should be evenly split, or when testing two versions, weighted 50/50.

As explained in this article unequal allocation of traffic can lead to data discrepancies and inaccurate test results.

So, as a best practice, don't change this section.

But, if for some reason, do you need to change the traffic weight, you can do so by clicking on the link that says "50% weight":

A slide-out will then appear in which you can click to edit the weight to each variant. Click the "Custom percentages" dropdown and assign the weight you want:

If you were to assign an 80/20% split, for example, that would mean the bulk, 80% of traffic, would be directed towards the control and just 20% of visitors would see the variant.

This traffic split is very risk adverse because so much of the traffic is diverted to the control -- where there is no change.

If you find yourself wanting to allocate traffic in this way, consider if the test itself should be run.

Testing is itself a way to mitigate risk.

So, if you feel you further need to decrease risk by only showing a small portion of visitors the test variant, the test may not actually be worth doing.

After all, testing takes time and resources. So, you should do it properly. Start with evenly splitting traffic.

Setting-Up Page Targeting

You're now ready to set-up the page targeting.

Page targeting means the webpage that is being "targeted" or tested in the experiment.

You can target a specific, single page, like the homepage, a subset of pages, like all pricing pages, or all pages on the site.

In this test example, we want to select any URL that contains, or includes a certain URL path within the site.

We're, therefore, going to set-up our Google Optimize test with a URL that “Contains” the substring match. We'll do so by going to the pencil or edit symbol, clicking on it:

And, selecting “Contains” from the dropdown menu:

This rule is saying, I want all the URLs that contain, or have, the www.ConvertExperts.com URL to be part of the test. 

In contrast, if we had selected “Matches”, the test would only be on the homepage www.ConvertExperts.com, because it would be matching that URL.

If you’re unsure which parameter to select for your test, you can consult this Google article.

If you want to verify the URL will work, you can check your rule. Otherwise, click “Save”:

Audience Targeting

Now, you can customize the test to display for only certain audiences, or behavior.

To do so, simply click “Customize”:

Here you can set a variety of parameters, like device type and location, if you only want certain viewers taking part in the test:

Note, if you want to parse out results by device type, that reporting is done in Google Analytics, and should NOT be activated within Google Optimize.

However, if you only wanted mobile visitors, for example, to take part in the test, then you’d select the “Device Category” option and choose only mobile visitors.

In this test example, we don’t have any rules we’d like to segment by, so we’ll leave everything as is.

Describing the Test

Next, you can add a “Description” about the test.

This step is optional, but is a good practice so you can see your test objective and remind yourself of the hypothesis. 

Adding a description also helps keep colleagues working with you on the same page with the test.

To add a “Description” simply, click the pencil to edit:

Then, add your description text:

Defining Test Goals

You're now ready to input your test goals.

"Goals" are defined as the conversion objectives or Key Performance Indicator (KPI) of what you're measuring from and hoping to improve as a result of the experiment.

You may have one single goal, like to increase form submissions. Or many goals, like increasing Clickthrough Rates (CTRs) and form submissions.

Your goals may be conversion objectives that you've newly set, or might tie-in to the existing goals you've already created, defined, and are measuring in Google Analytics.

To set-up a new goal, or select from your existing goals, simply click the “Add experiment objective” button:

You'll then have the option to either choose from already populated goals in Google Analytics, or custom create new goals.

Note, if you're using existing goals, they need to have already been set-up in Google Analytics and integrated with Google Optimize. Here's detailed instructions on how to link Google Analytics into Google Optimize.

For this example, we want to “Choose from list” and select from the goals already created in Google Analytics (GA):

The GA goals now show-up as well as other default goals that you can select:

In this example, we want to measure those people who reached the thank you page, indicating they filled out the contact form. We, therefore, select the "Contact Us Submission" goal:

We can now add an additional objective. Again, we’ll “Choose from list”:

In this case, we also want to see if the button text created a difference in Clickthrough rate (CTR) to the form page.

Although this goal is very important it's labelled as the secondary goal because contact submissions are deemed more important than CTR conversions:

Email Notifications

It's completely optional, but under the “Settings” section, you can also select to receive email notifications, by sliding the switch to on:

Traffic Allocation

Traffic allocation is the percentage of all visitors coming to your site who will take part in the test. 

Note, this allocation is different than the weight of traffic you assign to each variant. As described above, weight is the way you split traffic to each variant, usually 50/50. 

Of that weighted traffic, you can allocate a percentage of overall visitors to take part in the test.

As a general best practice, you should plan to allocate all (100%) of traffic coming to your site to the test experiences, as you’ll get the most representative sample of web visitors. 

Therefore, you shouldn't need to change any of the default settings.

However, if you’re not confident the test will reveal a winner, you might want to direct less than 100% of the traffic to the experiment.

In this case, you can change the traffic allocation from 100% of visitors arriving at your site to a smaller percentage by clicking the pencil to edit the value here (simply drag the slider up or down):

Note that the smaller the percentage of total traffic you allocate to your test, the longer it will take for you to reach a statistically significant result.

As well, as explained in this article unequal allocation of traffic, or reallocation of traffic mid-test can lead to data discrepancies and inaccurate test results. So once you've allocated, ideally, 100% of your traffic to the test, it's best to set it and forget it.

Activation Event

By default, “Page load” is the activation event, meaning the experiment you've set-up will show when the webpage loads. 

So long as you want your test to show when the page loads, you’ll want to stick with the default “page load” setting.

If you’re testing a dynamic page -- one which changes after loading -- or a single page application that loads data after the page itself has populated, you’ll want to use a custom “Activation event” by clicking the pencil tool and selecting the activation event from the dropdown menu that fits best for you:

An activation event requires a data layer push using this code: dataLayer.push({‘event’: ‘optimize.activate’}); You can learn more here.

Note, in the free version of Google Optimize, you can choose up to one primary objective and two secondary objectives. 

Once these objectives are selected and your experiment launched, you can’t go back and change them. 

So, make sure you think about how you want to track and monitor conversions before you launch your test!

Prior to Launch

With all your settings optimized, you’re nearly ready to start your test! 

Preview Your Experiences

But, before launching, it’s always a good idea to preview your experiences to make sure everything looks good.

To do so, click on the “Preview” button in the Variants section and select the appropriate dropdown menu for the view you want to see:

My recommendation is to individually preview each variant in web, tablet, and mobile preview mode:

Confirm and Debug Your Tags

Next, you’ll want to click the “Debug” option within the Preview mode:

Clicking into the Debug mode will bring up the website you’re testing and will show you a Google Tag Manager (GTM) screen, with the targeting rules for the tags that will be firing:

If there are any issues, you can debug them now -- before launching your experiment.

Get Stakeholder Approval

If you’re working with stakeholders, or clients, on the test, it’s a good idea to let them know the test is set-up and ready to launch, then get their approval before you start the test.

Previewing the test variants, and sending them screenshots of the preview screens will enable you to quickly and efficiently gain stakeholder approval.

You're then ready to launch the test! 🙂

Launching the Test

With all your i’s dotted and t’s crossed, you’re ready to launch your test!

Woohoo!

To do so, simply click the “Start” button at the top right of the screen.

And, ta da! You’ve just set-up an A/B test in Google Analytics.

Congratulations. Well done!

Your Thoughts

Hope you found this article helpful!

Do you have any thoughts, comments, questions?

Share them in the Comments section below.

A simplified explanation of the four types of tests you can run in Google Optimize

By: Deborah O'Malley, M.Sc | Last updated December, 2021

Overview

In the free A/B testing platform, Google Optimize, there are four different types of tests you can choose to run.

To help you accurately run the right test for your project needs, this article breaks down the four test types, defining each term with an explanation, examples, and images.

The Four Test Types

In Google Optimize, the four different test type options are:

  • A/B test
  • Multivariate test
  • Redirect test
  • Personalization

A/B Test

In an A/B, you're testing the control, which can be thought of as the original, or latest version, against one, or more, new versions.

The goal of an A/B test is to find a new version that outperforms the control. 

When running an A/B test, you can test the control against more than one variant. In this case, you have what's called an A/B/n test. 

The "n" stands for any number of other variants.

If running an A/B/n test, you just need to make sure you have enough traffic to reach a significant result in a reasonable amount of time. To ensure statistically significant results, check out this article.

Here’s a visualization of a very simple A/B test.

As you can see, in version A, there's a great button. In version B, the button color has been changed to red.

If we were to run this A/B test, our goal would be to see if one button color outperforms:

While button color tests are often regarded as the simplest form of studies, they still can yield powerful results and valuable testing lessons as this GuessTheTest case study shows.

But beyond button color tests, the changes you make between the control and variants don't have to be big or drastic to yield a strong uplift.

As this powerful study shows, simply changing the orientation of a form, from the right to left side of a page, for example, can create a huge conversion gain. Which version do you think won? Go ahead and take your best guess on this formidable form test.

Multivariate Test (MVT)

In an MVT test, you're testing multiple elements within a page to see which combination of elements performs best. 

For example, referring back to the A/B button color test example (above), you might decide you'd also like to add a picture and test not only the button color, but also the shape of the picture, either square or circle.

With the addition of picture shape, you're now modifying more than one  element on the page. 

An MVT test allows you to accurately compare the performance of all the combinations of changed elements.

This point is particularly important because in regular A/B testing, you're only able to assess the change of one version against another. While you can indeed change multiple elements, the drawback is, in an A/B test, you can't accurately account for which element change created the conversion effect.

However, with an MVT test, if set-up and run properly, you can change, test, and accurately assess any number of different combinations of elements to ascertain which elements performs best.

Here’s a visualization of a MVT test looking two different elements, across two different combinations, button color and picture shape:

A word of caution, since there are more variations, MVT tests require higher traffic, and take a longer time to run. They can also be more complex to accurately decode results. 

So, it’s recommended you don’t run MVT tests unless you have adequate traffic and/or you’re a highly skilled tester with robust data analysis capabilities.

Redirect Test

A redirect test is also called a split URL test.

The reason why is because you’re putting one test variant on one page, with its own URL, and testing it against another variant on a separate page, with its own URL.

While a redirect test is, in essence, an A/B test, the difference is that with an A/B test, version A and B sit on the same URL and traffic is directed to one variant or the other.

In contrast, in a redirect test, version A and B sit on different URLs.

A redirect test is typically used when you have an alternative page, or completely different design you’d like to test against the control. 

For example, you might have a single-page form and want to test that form against a multi-step form. 

To do so, you’d set-up a redirect test and send traffic either to the original form (version A, control), or the multi-step form, (version B, variant) to see which variant performs best. 

Here’s a visualization of a redirect test:

Here’s a good case study example of a multi-step form test, set-up as a split URL test. This test looked at whether a multi-step or single-page registration form converted better. Which version do you think won? Take your best guess.

As a general rule of thumb, it’s best to run redirects test when:

  • It’s easier to make a completely new page, rather than modify an existing page design.
  • You’re looking to test two (or more) completely different designs against each other, like a new form design, or landing page alternative.
  • You want to add or take out steps in the user journey/conversion path. 
  • You’re testing entire websites against each other.

Personalization Test

A personalization test is a bit different than a standard A/B test.

A personalization experiment allows you to specify changes for a specific group of visitors. 

Unlike traditional experiments, it can run forever, and doesn’t have variants. So, it’s more of a change implemented on a website than an actual “test”.

An example of a personalization would be testing the effect of displaying a different hero image to desktop viewers and mobile viewers. Here’s a visualization of that concept:

In this example, you can see, the desktop and mobile experience are different. The mobile version shows a different shape, or image, than on desktop.

The view is personalized depending on the device type the user is on.

Personalization experiments can be as simple as showing an image on desktop, but not mobile. Or they can be very complex.

Here’s a great real-life personalization case study showing a complex personalization test based on different user cohorts. Can you guess the right test?

Share Your Thoughts

With this primer, you're now ready to set-up your test and begin to "create your experience" in Google Optimize.

Hope you found this article helpful.

Do you have any thoughts, comments, questions? Share them in the Comments section below.

By: Deborah O'Malley, M.Sc. & Joseph A. Hayon| Last updated December, 2021

Overview:

Soon, third-party cookies will be blocked by most major web browsers leaving optimizers and experimenters desperately picking up the crumbled pieces.

We'll have no easy way to track user events or behaviors -- let alone measure conversions.

What's an optimizer to do?

In this informative webinar replay, renowned data analytics consultant, Joseph A. Hayon, will tell you!

He shares the latest on:

  • Cookies. Mmm
  • The two most important kinds of cookies you need to know about
  • Changes happening to cookies
  • Cookies and their impact on A/B testing
  • Server-side tag management
  • Adapting to change

Grab a coffee, enjoy a cookie, and take a bite out of your privacy policy questions.

Oh, snap! Only Pro Members Have Full Access To This Post

Unlock Access. Sign up to become a Pro Member. Get complete access to this helpful content, plus so much more.

paywall_robot

eyechevron-downarrow-circle-upmagnifiercrossmenu-circlecross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram