A/B testing is a useful tool to determine which page layout or copy works best to drive users to reach a given goal. Companies like 37 signals use A/B testing to improve conversion rates on the site, HubSpot uses it to increase email conversions, and Zynga uses A/B testing to increase engagement in its games [1,2,3].Whenever you run an A/B test you must decide when you have gathered enough data, and you can pick the winning idea and implement it for all users. People typically plug the conversion numbers into an online calculator*, and if the result is ‘significant’ they pick the winner.

However deciding when to stop a test using significance is wrong.

Significance testing is useful when the goal is inference. If we want to make falsifiable statements, and draw conclusions through experimentation, then we use statistical significance to measure certainty. However in the business setting, we want to make a decision with some other goal in mind: increasing conversions, improving ease of use, maximizing profit, or some other objective. In these cases there are better criteria to determine an experiment stopping point.

Suppose we are running a test with two ideas, call them A and B, and one idea is better than the other. The longer that we run the test, the better we are able to quantify how much better A is than B. However, the longer we run the test the more users that we expose to the inferior idea.

In a classical testing environment we would decide that we want to be able to detect differences in conversions as 1 percentage point, we would pick a confidence level (say 95%), find the appropriate sample size and run the test. At the end of the test we could say either, A is better than B, B is better than A, or A and B are within 1 percentage point, and we would be certain with 95% confidence that the statement we made was correct.

However using a Bayesian approach, we first determine how many people will be exposed to the result of the test. We need to balance cost of the test, against the cost of making the wrong decision. The more users that will be exposed to the result, the higher the cost of making the wrong decision, so we can justify running a longer test.

This cost is formally called ‘regret’, and is measured as the difference between the actualized, as compared to the optimal revenue that could be realized if we had perfect information.

These two approaches have been debated in the clinical trial literature. In clinical trials scientists must balance providing a potentially inferior treatment to patients, against the learnings that they gain to help future patients. The Bayesian approach was developed by Anscombe in the 60s, and is widely used in clinical trials today [5].

Anscombe provides a formula to determine the stopping point of an experiment. The experiment should be terminated when the following condition is true.

Where y is the difference between results of A and B, k is the expected number of future users who will be exposed to a result, and n is the number of users who are exposed to the test so far. And Phi-inverse is the quantile function of the standard normal.

So what does this mean? How do using the results from the Anscombe paper affect the actual performance.

In the following example we simulate 100,000 visits to the site, with two ideas, idea A has a 21% conversion rate, and idea B has a 20% conversion rate. We evaluate how the two ideas perform using a significance test, compared to Anscombe’s stopping rule, and compared to picking a fixed sample size of 10K.

Then we simulate 10,000 different iterations and calculate the regret.

Method | Mean Regret | (95% quantiles) | Correct version chosen |

Anscombe | 89 | (-6.0, 520) | 87% |

Repeated Significance | 150 | (-4.0, 620) | 72% |

Fixed Sample Size | 115 | (10, 225) | 96% |

Below are two plots of a typical path of an experiment. We plot the advantage of A over B. While A, in the long run, is better than B, there are short spells where B performs better than A. We also see a visual representation of the two confidence intervals. Zooming in on the first 2,000 visitors, we see the problem with the repeated significance testing. A short run of conversions on idea B results in B being declared the winner around 100 visitors into the test.

Using Anscombe’s stopping rule is much better than using significance testing. With a 40% less regret than using repeated significance testing. The traditional way of using repeated significance testing leads to higher regret, and produces the correct answer less often than Anscombe’s method or using a fixed sample size.

In conclusion, using Anscombe’s methods minimizes regret, at the cost of giving up the ability to make inference. If you aim to make inferences about which ideas work best, you should pick a sample size prior to the experiment and run the experiment until the sample size is reached. But if you want to maximize conversions, use Anscombe’s rule.

*There are a lot of A/B testing calculators out there. If you do decide to pick your sample size in advance. I personally like ABBA [6]. Since it uses Aggresti-Coull confidence intervals and performs mulitple test correction, which avoid two other pitfalls not covered in this article.

[2] http://blog.hubspot.com/blog/tabid/6307/bid/31634/A-B-Testing-in-Action-3-Real-Life-Marketing-Experiments.aspx

[3] http://tdwi.org/videos/2010/08/actionable-analytics-at-zynga-leveraging-big-data-to-make-online-games-more-fun-and-social.aspx

[4] http://www.evanmiller.org/how-not-to-run-an-ab-test.html

[5] Anscombe. (1963). Sequential Medical Trials.

*Trials*,

*58*(302), 365-383.

[6] http://www.thumbtack.com/labs/abba/

**Acknowledgements:**

I’d like to thank Eric Schwartz for introducing me to the idea of optimizing sequential trials.

I think you may have a bug in your implementation of the “Repeated Significance” test. A correctly designed early-stopping trial will control overall type-1 error to be below the desired alpha, just like a one-shot trial.

If your alpha is .05, then you cannot have a false positive rate of more than 5%

So if you’re getting termination with the wrong answer 28% of the time (100-72), you probably have missed something in the implementation or I have missed something in your description.

There are many legitimate criticisms of “frequentist” methods, and a failure to minimize regret is probably one of them. Bayesian methods designed to minimize cost or maximize utility will generally do better than classical methods that don’t seek that. But the classical methods do meet the constraints they are designed for.

Hi Keith,

Thanks for your comments.

You are absolutely correct in your assessment. The termination with the wrong answer 28% of the time is due to using repeated significance test. I was trying to point out the problem of using significance testing as a stopping criterion for experiments.

If you apply classical significance testing once, at alpha 0.05, you will limit your false positives to 5%, however if you repeatedly apply this criteria until you reject the null hypothesis, then your false positive rate will be much higher.

This is how many people decide when to stop their A/B tests, and is implemented in programs like Optimizely. I was just pointing out the problems with this approach.

There are more valid frequentist approaches to repeated significance testing. These are actually a better choice for experimental design for inference, but the Bayesian methods are bettor for minimizing regret.

Hi, interesting article, thanks for sharing! I would be curious to see what does the decision boundary of the repeated significance test look like when you apply a Bonferroni correction. And what the simulated regret would be in that case?

Giovanni

Just curious – how do you choose between tests when you can’t get significance?

Hey Kevin,

You just pick the idea that has higher conversions. You won’t be able to make as strong statements about inference, but you end up exposing fewer people to the worse treatment.

Hey, thanks for taking the time to write this great article. One typo though, and it makes things a little confusing, what did you mean by ‘realized actualized’ when you were defining regret?

Thanks, that was a typo.

Basically it’s the difference between the revenue that you saw, and the expected revenue you would get if you had perfect information.

Hi Aaron, how did you choose ‘k’?

Estimate it based on how many users your site gets. For example if you have 10K visitors per month, and you expect to do a redesign every 6 months, K would be 60K.