Welcome to the fourth and final installment of our series on designing and implementing a successful marketing experiment. We have already covered how to formulate a strong hypothesis, control for natural variation between groups, and how to draw valid conclusions from our experimental results. Today, we are going to discuss how you can use some statistical tools to gauge how meaningful your results are. However, what does it mean for a result to be meaningful?

A result is meaningful if it is likely to hold in the future, and the result was not do to random chance. As you will remember, our hypothetical marketing department has been testing the following hypothesis:

Emailing customers a 20% discount increases the likelihood that they will make a purchase in the following week.

By now, we have designed and run our experiment using two emails, which we will call A and B. Email A is our company’s standard email with no discount, while B contains the 20% discount that we’re testing. Imagine our results are something like the following:

Idea A | Idea B | |

Total Sent |
1000 |
1000 |

Conversion |
300 |
320 |

Let’s imagine each email was sent to 1000 people. After receiving Email A, 300 customers returned to our store, while 320 customers returned after receiving Email B. That means there is difference of 2 percentage points between the two emails (30% conversion vs. 32% conversion).

However, response rates are subject to natural variation. We aren’t just interested in which email performed better during this single experiment; we are interested in which email will continue to perform better.

Our test illustrates how ambiguous results can be. Was Email B (which contained the 20% discount) really 10% more persuasive than A? Or were its additional conversions a matter of luck that had nothing to do with the email itself? If we can’t answer this question, we can’t call our results meaningful, and thus can’t conclude that our 20% discount actually helps to drive return customers.

Click here to see how to do the experiment calculation by hand. Or skip it and try out the A/B testing calculator

Idea A | Idea B | Total | |

Did Not Convert | 620 | ||

Convert | 1380 | ||

Total | 1000 | 1000 | 2000 |

If our results are significant (the typical threshold used is less than a 5% probability that we observed as large as we did due to random chance), then the chi-squared statistic is greater 3.84. In our case, we see that 0.93 is much less than 3.84, meaning there’s a significant chance that the results we saw were due to natural variation rather than the presence (or absence) of a 20% discount in the emails. You can also play with the A/B testing to get a sense for how big a sample size you need to achieve significance. Try out some of these examples:

You might look at these results and think that the solution is to repeat the same test with a much larger sample size. However, as I’ve written before, there are real problems using significance testing to determine experiment termination time. This means we are going to need to determine in advance how big our groups will be. But what’s the best way to go about doing that?

To figure that out, we first need to decide how big we want our effects to be. We don’t need a lot of users to detect large, obvious effects. For example, we would only need about 120 users per group to detect the difference between a 10% conversion rate and a 20% conversion rate, whereas we would need about 2200 to detect a difference between 5% and 4.5%.

Another way to think about all of this is to decide how precise we want our predictions to be. In our case, we are trying estimate what the long run average conversion rate is going to be for a given email; we can never be 100% certain what that rate is going to be, but we can be about 95% certain that it will fall within a certain range. The more that we sample, the smaller that range is going to be.Notice how the confidence bands shrink in the plot above. They shrink pretty dramatically between sample 0 and 200, and then almost imperceptibly between 800 and 1000. This is because the confidence interval shrinks in proportion to the square root of the sample size. So to cut the size of the confidence bands in half, you need to quadruple the number of users in your test.

This brings us to the end of our series. We hope this series has given you some insight into just how much goes into developing a quality marketing experiment and guidance in case you want to conduct your own.

If you are interested in doing A/B testing on email marketing, check out Custora for our A/B testing interface for email marketing.

Great article, thank you very much.

But could you please tell me where the 3.84 as the “significance threshold” come from?

It comes from picking a critical value of 0.05, ie there is a 5% probability that the results that we observed were due to chance. You can find this value using a statistical programming environment, excel, or just look them up in a table here: http://home.comcast.net/~sharov/PopEcol/tables/chisq.html

Thank you very much!