By: Georgi Georgiev | 2019
What is “unequal allocation”?
Unequal allocation is most easily understood when contrasted with equal allocation.
If there are 3 variants and one control, we would allocate 25% of users to each of them, under equal allocation.
With unequal allocation, one decides to allocate a smaller or greater proportion of users to either the control or the variants as a whole or to particular variants in a multivariate test (MVT).
For example, one can allocate 80% of traffic to the control and 20% of traffic to the test variant, or 40% to the control and 20% to each variant in an MVT with 3 variants.
People consider unequal allocation for different reasons, but the main concern is the cost/benefit analysis for the duration of the test.
With an 80/20 control/variant allocation, one risks less exposure to an inferior variant during the test, compared to a test of the same duration, but with equal 50/50 allocation.
With a 20/80 control/variant allocation, one usually aims to get more users to experience the tested variant during the test.
Some appear to be under the impression that by running a 20/80 control/variant test they can reach a conclusion faster, while others believe that running a 80/20 control/variant test means they can reach the same conclusion, but with less risk.
Below I’ll address each of these assumptions in detail.
Effects of unequal allocation on the duration of an A/B test
Some people seem to think that running an A/B test with unequally distributed traffic among the variants will allow them to reach conclusions quicker.
This, however, not the case.
The reason is that statistical significance calculations and confidence interval calculations depend highly on the number of users allocated to each variant, not the test as a whole.
In a significance calculation within-group variance is compared to between-group variance. If one group contains data for significantly fewer users, the variance of whatever outcome metric you are using will be higher for that group, leading to higher uncertainty and thus higher significance level (lower confidence level).
Similarly, the confidence interval for absolute or relative difference will be wider, reflecting the greater uncertainty associated with the estimate.
For example, if during the test, the control converts at 3% and the test variant at 3.5%, with 80/20 unequal allocation we get this result:
While if the same conversion rates had been observed with 50/50 allocation, we would have almost 2 times less uncertainty (~5% instead of ~10%):
In proper terms, a relative difference as big as 20%, or bigger, would be observed with probability ~10% in the first case if there was no improvement, while in the second case the probability of observing such a difference is only about 5%. The 95% confidence interval for relative change spans from -5.3% to +∞ in the first case and from -0.3% to +∞ in the second.
Some people find it easier to visualize this in terms of confidence intervals for the conversion rate in the two groups instead of the confidence interval for the difference in conversion rates. Here are how the confidence intervals look like for the two groups with unequal vs equal allocation of 10,000 users:
As you can see, with unequal allocation we have greater accuracy for the conversion rate of the control group, compared to the accuracy for the control group with equal allocation; however, it is entirely negated by the increased inaccuracy of the variant estimation in the unequal case, versus the equal one.
There is no free lunch in life and statistics are no exception. Gaining accuracy for one group comes at the cost of reduced accuracy for the other, everything else being equal.
This is the reason why anything other than equal allocation leads to less efficient tests. A/B tests with unequal allocation will require significantly longer time to complete compared to equal allocation. The same, of course, holds for MVT tests.
Changing to unequal allocation mid-test
Even after understanding this concept, many are still tempted to resort to unequal allocation if they monitor a test and discover that one variant is performing significantly worse or significantly better than another.
In such cases, practitioners may be tempted to push more traffic, or to restrict the amount of traffic allocated to a variant. This is a really bad practice, if done ad-hoc and without proper statistical methods.
It has similar effects to monitoring a test and then interpreting the results as if no monitoring (peeking) occurred: the nominal p-value no longer reflects the actual p-value, the nominal accuracy of confidence intervals and other estimates no longer holds and in it ultimately leads to a severe increase in the rate of false positives in a testing program.
First, a proper way to monitor the outcome of an A/B test during its duration is to employ a proper sequential testing method, such as AGILE. In such cases one constructs efficacy and futility stopping boundaries and only stops the test if one of the boundaries is crossed. However, most methods only hold if the allocation does not change during the entire duration of the test.
There is a class of methods for “adaptive sequential design”, which allows for changing the allocation between test variants on the go and also for dropping variants mid-way and even adding new variants to a running A/B test. They are quite more complex computationally and can be more difficult to interpret. What’s worse is that contrary to naïve intuition, such methods are not more efficient than methods based around equal allocation.
In fact, the opposite has been proven: for any adaptive sequential method there exists an equally-efficient or more efficient non-adaptive sequential alternative. To quote Tsiatis & Mehta’s 2003 paper “On the inefficiency of the adaptive design for monitoring clinical trials”: “For any adaptive design, one can always construct a standard group-sequential test based on the sequential likelihood ratio test statistic that, for any parameter value in the space of alternatives will reject the null hypothesis earlier with higher probability, and, for any parameter value not in the space of alternatives will accept the null hypothesis earlier with higher probability.”
Further confirmation can be found in Lee, Chen & Yin’s 2012 “Worth adapting? Revisiting the usefulness of outcome-adaptive randomization”: “In summary, equal randomization maintains balanced allocation throughout the trial and reaches the specified statistical power with a smaller number of patients in the trial [compared to adaptive randomization]“.
Should you use unequal allocation in A/B testing?
While unequal allocation and especially mid-test unequal allocation has an intuitive appeal, it is in fact detrimental to the efficiency of an A/B test when done properly and has negative effects on the validity of the results to other kinds of malpractices such as unaccounted peeking when done improperly.
Even if your tools allow for unequal allocation of users between test variants your best bet is to actually stick to equal allocation from the beginning till the end of an A/B test.
If some of the terminology used in this article is unfamiliar you can also refer to the Analytics-Toolkit A/B testing glossary for detailed definitions.
Disclaimer: While unequal allocation can technically be beneficial with more than one treatment, if one uses a naive p-value and confidence interval calculation, the efficiency gains obtained by using procedures like the Dunnett’s correction far outweigh such gains. Given that Dunnett’s can only be applied to an equal allocation design, one should stick to equal allocation in order to achieve best use of the data.