# Author Archives: Aaron Goodman

# Evaluating Significance – Designing a Marketing Experiment Part 4 of 4

Welcome to the fourth and final installment of our series on designing and implementing a successful marketing experiment. We have already covered how to formulate a strong hypothesis, control for natural variation between groups, and how to draw valid conclusions from our experimental results. Today, we are going to discuss how you can use some statistical tools to gauge how meaningful your results are. However, what does it mean for a result to be meaningful?

A result is meaningful if it is likely to hold in the future, and the result was not do to random chance. As you will remember, our hypothetical marketing department has been testing the following hypothesis:

Emailing customers a 20% discount increases the likelihood that they will make a purchase in the following week.

By now, we have designed and run our experiment using two emails, which we will call A and B. Email A is our company’s standard email with no discount, while B contains the 20% discount that we’re testing. Imagine our results are something like the following:

Idea A | Idea B | |

Total Sent |
1000 |
1000 |

Conversion |
300 |
320 |

Let’s imagine each email was sent to 1000 people. After receiving Email A, 300 customers returned to our store, while 320 customers returned after receiving Email B. That means there is difference of 2 percentage points between the two emails (30% conversion vs. 32% conversion).

However, response rates are subject to natural variation. We aren’t just interested in which email performed better during this single experiment; we are interested in which email will continue to perform better.

Our test illustrates how ambiguous results can be. Was Email B (which contained the 20% discount) really 10% more persuasive than A? Or were its additional conversions a matter of luck that had nothing to do with the email itself? If we can’t answer this question, we can’t call our results meaningful, and thus can’t conclude that our 20% discount actually helps to drive return customers.

Click here to see how to do the experiment calculation by hand. Or skip it and try out the A/B testing calculator

Idea A | Idea B | Total | |

Did Not Convert | 620 | ||

Convert | 1380 | ||

Total | 1000 | 1000 | 2000 |

If our results are significant (the typical threshold used is less than a 5% probability that we observed as large as we did due to random chance), then the chi-squared statistic is greater 3.84. In our case, we see that 0.93 is much less than 3.84, meaning there’s a significant chance that the results we saw were due to natural variation rather than the presence (or absence) of a 20% discount in the emails. You can also play with the A/B testing to get a sense for how big a sample size you need to achieve significance. Try out some of these examples:

You might look at these results and think that the solution is to repeat the same test with a much larger sample size. However, as I’ve written before, there are real problems using significance testing to determine experiment termination time. This means we are going to need to determine in advance how big our groups will be. But what’s the best way to go about doing that?

To figure that out, we first need to decide how big we want our effects to be. We don’t need a lot of users to detect large, obvious effects. For example, we would only need about 120 users per group to detect the difference between a 10% conversion rate and a 20% conversion rate, whereas we would need about 2200 to detect a difference between 5% and 4.5%.

Another way to think about all of this is to decide how precise we want our predictions to be. In our case, we are trying estimate what the long run average conversion rate is going to be for a given email; we can never be 100% certain what that rate is going to be, but we can be about 95% certain that it will fall within a certain range. The more that we sample, the smaller that range is going to be.Notice how the confidence bands shrink in the plot above. They shrink pretty dramatically between sample 0 and 200, and then almost imperceptibly between 800 and 1000. This is because the confidence interval shrinks in proportion to the square root of the sample size. So to cut the size of the confidence bands in half, you need to quadruple the number of users in your test.

This brings us to the end of our series. We hope this series has given you some insight into just how much goes into developing a quality marketing experiment and guidance in case you want to conduct your own.

If you are interested in doing A/B testing on email marketing, check out Custora for our A/B testing interface for email marketing.

# Setting Up Control Groups – Designing a Marketing Experiment Part 3 of 4

Welcome to Part Three of our series on designing and implementing a successful marketing experiment. In our previous posts, we looked at a few strategies for designing an experiment with a well-formulated hypothesis and a way to control for natural variation between groups. Today, we are going to discuss how to identify the most salient effects of a treatment and draw valid, useful conclusions from experimental results.

Let us return to our hypothetical marketing department. As you will remember, our goal is to determine whether a 20% discount in an email is an effective way to get customers to return to our store. First, we formulated our hypothesis as a falsifiable statement which, if confirmed, also takes the same form as our conclusion, in our case:

Emailing customers a 20% discount increases the likelihood that they will make a purchase in the following week.

Imagine that we send all of our customers a 20% discount and see that many of them return to our store. Thus, we conclude that a 20% discount is an effective way to get customers to return to our store and declare the experiment a success. Our boss congratulates us and we all take the rest of the day off.

Based on the success of our experiment, our company decides to run a similar deal with the same 20% discount, only this time we include the discount as part of a Facebook promotion. Much to our surprise, however, very few customers return to our store. Despite substantial investments of time and money, our promotion seems to have gone belly up. Our boss wants to know how this could happen but we are at a loss to explain why. Was something wrong with our experiment?

Actually, the problem was not with our experiment, but rather our conclusions. Let us examine our hypothesis again:

Emailing^{1} customers a 20%^{3} discount^{2} increases the likelihood that they will make a purchase in the following week.

Although our hypothesis seems pretty straightforward, if we look more closely we will see that our 20% discount email actually consists of three different variables rolled together: 1. It is an email, 2. It is a discount email, 3. It is a 20% discount email. Our challenge then, is to determine which of these variables (or what combination of them) actually brought our customers back to the store. In order to find our answer, we will need to test each of these variables independently.

We can easily parse the effects of our different variables by dividing our population into groups. In this case, we randomly assign the members of our base to one of four groups. Group I receives no email (readers who have been following our series will recognize this as our control group from Part Two), Group II receives an email with no offer, Group III receives an email with a 5% discount, and Group IV receives an email with the original 20% discount.

After defining our groups, we can compare their responses to evaluate our hypothesis. However, we cannot compare our groups directly. Because we sent emails to only a sample of the population, we can not say for certain how the entire population would have responded. However, using the tools of statistical analysis, we can estimate the range for what the response rate would have been. Based on those estimates, we can then determine what the likely email response rate would be.

Suppose we have a sample of 40,000 users, divided evenly among our four groups. After sending each group the appropriate email (or not, in the case of Group I), we measure their responses.

We can imagine a few different plausible scenarios, for example:

Message | Response Rate |

No Email | 1% |

5% | |

5% Discount | 5% |

20% Discount | 5% |

Here we see pretty unambiguously that customers who received an email of any kind were much more likely to return to the store than those who received no email. In this case, the magnitude of the discount (or even the presence of a discount) seems to play little or no role in increasing the response rate. Without the proper controls, however, we could have easily attributed the increase in customer returns to the discount.

Another plausible scenario could look something like this:

Message | Response Rate |

No Email | 1% |

1% | |

5% Discount | 3% |

20% Discount | 6% |

Based on these responses we could conclude that the discounts, rather than a simple email, are what drove customers to return to our store. Even a modest discount increased the response rate somewhat, while a larger discount increased the response rate still further.

Using even these basic tools of statistical analysis, companies can learn more and better information from their marketing experiments. This information in turn helps them reach their potential customers through more effective, targeted marketing. As we have seen, controlling for different effects through groups can be a powerful means for identifying the most salient effects in any marketing experiment. In our next post, we will discuss some statistical tools that can help us gauge the significance of the effects we have parsed here.

# Setting a baseline – Designing Marketing Experiments – Part 2 of 4

In Part Two of our series on designing and implementing a successful marketing experiment, we’re going to explain the importance of control groups and how to effectively use them in your own marketing experiments.

In our previous post, we discussed how to construct a well-defined hypothesis. In our hypothetical marketing department, we’ve decided to conduct an experiment to determine whether sending a discount email to our customers is an effective way to increase sales. After a few clarifications and refinements, the hypothesis we settled on was this:

Emailing customers a 20% discount increases the likelihood that they will make a purchase in the following week.

Imagine that we are eager to test our hypothesis and see our results. We decide to send emails to our entire customer base, and then observe open and click rates on the emails. After that, we perform a funnel analysis, and find that 50% of our customers opened the email, 10% clicked on the embedded link, and 2% made a purchase. We show our results to our skeptical boss, who looks at us and says, “And how can we tell that some of these people weren’t going to make purchases anyway, even without the discount email?” The answer is: we can’t.

The problem with a funnel analysis is that it implies a degree of causality that may not really be accurate. In order to build an experiment that produces useful, meaningful results, we have to have a clear picture of what we’re comparing our results to. This is why we need to make a control group.

Let’s return to our hypothesis for a moment. Pay particular attention to the phrase: “increases the likelihood that [customers] will make a purchase.” What this means is that we expect a customer to be more likely to make a purchase if we send him a discount email than if we don’t. However, quantum physics aside, we can’t simultaneously send and not send a discount to the same customer. That means we need to find another way to quantify the change we’re looking for. There are a few good ways to go about doing this, and quite a few bad ways. We’ll begin with some of the more common experimental design fallacies we see.

One (bad) way that we might attempt to include some sort of control is by comparing the repeat purchase rates during the two weeks following the email to the previous two weeks when we did not send an email. Unfortunately, this ignores the inherent week to week variability in sales, which for many retailers tends to be quite high. Thus we could incorrectly call our email campaign a success or failure due to some larger seasonal trends.

Another (also bad) strategy would be to try our experiment with a certain subset of users and then compare our results to another subset. For example, we might send all of our international customers our discount email, and then compare the results to our US customers. We would then compare the revenue from our international customers to our US customers for the same period. While this may appear to control for week to week variability, it does not do so completely, since even this variability is not consistent between countries.

What we need, then, is a way to control for natural variability over time and between groups. Our solution will require two steps. First, we’ll create two groups. One group, which we’ll designate our control group, receives the “status quo.” In our case, we’ll say that we’re not currently sending an email of any kind, which means our control group will receive no email. If, however, you were already sending out a weekly newsletter and wanted to test the effect of including a discount in the newsletter, then your control group would receive the regular newsletter while your experimental group would receive the newsletter with the discount.

Now that we’ve defined our control and experimental groups, we’re going to assign customers to each group at random. By randomly assigning our customers between groups, we create a powerful control against natural variability, as those effects will already be taken into account in both our control and experimental groups. This will allow us to directly compare the effects of our treatment (the discount email) on one group. This type of study is referred to as a randomized control study.

For our marketing experiment, let’s take 20,000 customers and divide them evenly between our control and experimental groups. We find that our control group has a 2% conversion rate while the experimental group has a 2.5% conversion rate. With randomized control groups we can quantify the difference between the two responses and figure out how likely we are to see similar results going forward.

With our randomized control groups, we’re able to control effectively for things like natural variability in sales over time and between different groups. However, we’re going to need additional control groups to determine exactly what component of our discount email caused the increase in conversion rates. We’ll also need statistical tests to figure out if our results are likely to hold in the future. We’ll be addressing both of these issues in future blog posts.

# Formulating A Hypothesis – Designing Marketing Experiments – Part 1 of 4

The job of any marketing department is to develop effective ways of reaching customers. In order to do that effectively, however, you must first figure out who your customers are and what they like. You may have learned the standard tactics: upselling, cross-selling, targeted messages, discounts, loss-leaders, and so on. What you haven’t necessarily learned is how best to apply these tactics to your brand and customers. In order to identify and refine an effective marketing strategy, you have to find ways to test it. Once you have identified what tactics work best, you can also use controlled experiments to determine how well a given tactic works, and then develop ways to improve it further.

The benefits of running controlled marketing experiments can be direct and tangible. This can help marketing departments stand out in companies where many people (the CEO included) have only a vague notion of what marketing has to do with the brand, much less how it contributes to the bottom line. By running controlled experiments, marketers can figure out what strategies work, measure their impact on profits, and deliver consistent results. Over the next few weeks, we’ll be discussing how to design a marketing experiment and make sense of the results. We’ll begin our series with how to create a well-defined hypothesis.

Imagine for a moment that we’re the marketing department at a large company. We’re developing a new marketing strategy with the goal of improving customer retention. One of our ideas is to include an email to our customers, perhaps with a discount of some kind. Our boss, who is not the savviest of technology users, says he doesn’t think that customers would respond to emails, and instead wants to go with an expensive direct-mail advert. One way we might convince our boss to join the digital age is by designing an experiment to gauge the effect of email marketing on sales. At the heart of our experiment is our hypothesis. We can start with something simple, such as:

Email marketing increases sales.

Notice that our hypothesis takes the same form as the conclusion we are trying to prove. ^{1}

Our particular hypothesis describes a cause and effect relationship, as in, “Action A leads to Result B.” We could formulate our hypothesis to test other types of relationships, but for now we’ll still with cause and effect.

At the moment however, both our cause and effect are only vaguely defined. A good hypothesis has to be specific enough to actually test, and it would take a massive number of experiments to determine if all email marketing increases sales. Furthermore, “sales” is itself a rather difficult objective to quantify. Right now we haven’t defined either a time frame or a target audience, which will make it difficult to effectively measure how effective our email marketing has or hasn’t been.

In order to create a useful and manageable experiment, we need to narrow our focus. Rather than testing “email marketing,” let’s test something more concrete, such as “Emailing customers a 20% discount.” And rather than looking for an increase in sales, we’ll look to see if customers who received the discount made a purchase sometime during the following week. Our revised hypothesis might look something like:

Now we have a well-formed hypothesis that is specific enough to test. Our next step will be to test this hypothesis and quantify how much more likely a customer is to make a purchase after receiving our discount email. In our next post, we will discuss how to set up the proper control groups and make these measurements.

Notes:

- Technically, we are trying to disprove our hypothesis, and after sufficient failure to do so, we accept it to be true. This subtlety, while interesting, is not especially relevant for our purposes. ↩