A Bayesian Approach to A/B Testing

If you want to learn more about A/B testing, and are in New York. I’m teaching at General Assembly on June 12. Register with .

A/B testing is a useful tool to determine which page layout or copy works best to drive users to reach a given goal. Companies like 37 signals use A/B testing to improve conversion rates on the site, HubSpot uses it to increase email conversions, and Zynga uses A/B testing to increase engagement in its games [1,2,3].Whenever you run an A/B test you must decide when you have gathered enough data, and you can pick the winning idea and implement it for all users. People typically plug the conversion numbers into an online calculator*, and if the result is ‘significant’ they pick the winner.

However deciding when to stop a test using significance is wrong.

Significance testing is useful when the goal is inference. If we want to make falsifiable statements, and draw conclusions through experimentation, then we use statistical significance to measure certainty. However in the business setting, we want to make a decision with some other goal in mind: increasing conversions, improving ease of use, maximizing profit, or some other objective. In these cases there are better criteria to determine an experiment stopping point.

Screenshot from our Optimizely experiment.

Suppose we are running a test with two ideas, call them A and B, and one idea is better than the other. The longer that we run the test, the better we are able to quantify how much better A is than B. However, the longer we run the test the more users that we expose to the inferior idea.

In a classical testing environment we would decide that we want to be able to detect differences in conversions as 1 percentage point, we would pick a confidence level (say 95%), find the appropriate sample size and run the test. At the end of the test we could say either, A is better than B, B is better than A, or A and B are within 1 percentage point, and we would be certain with 95% confidence that the statement we made was correct.

Cover of Sequential Medical Trials

P. Armitage led the frequentist school in developing methodologies for repeated significance tests.

However using a Bayesian approach, we first determine how many people will be exposed to the result of the test. We need to balance cost of the test, against the cost of making the wrong decision. The more users that will be exposed to the result, the higher the cost of making the wrong decision, so we can justify running a longer test.

This cost is formally called ‘regret’, and is measured as the difference between the realized actualized, as compared to the optimal revenue that could be actualized if we had perfect information.

These two approaches have been debated in the clinical trial literature. In clinical trials scientists must balance providing a potentially inferior treatment to patients, against the learnings that they gain to help future patients. The Bayesian approach was developed by Anscombe in the 60s, and is widely used in clinical trials today [5].

Anscombe provides a formula to determine the stopping point of an experiment. The experiment should be terminated when the following condition is true.

Where y is the difference between results of A and B, k is the expected number of future users who will be exposed to a result, and n is the number of users who are exposed to the test so far. And  Phi-inverse is the quantile function of the standard normal.

So what does this mean? How do using the results from the Anscombe paper affect the actual performance.

In the following example we simulate 100,000 visits to the site, with two ideas, idea A has a 21% conversion rate, and idea B has a 20% conversion rate. We evaluate how the two ideas perform using a significance test, compared to Anscombe’s stopping rule, and compared to picking a fixed sample size of 10K.

Then we simulate 10,000 different iterations and calculate the regret.

Method Mean Regret (95% quantiles) Correct version chosen
Anscombe 89 (-6.0, 520) 87%
Repeated Significance 150 (-4.0, 620) 72%
Fixed Sample Size 115 (10, 225) 96%

Below are two plots of a typical path of an experiment. We plot the advantage of A over B. While A, in the long run, is better than B, there are short spells where B performs better than A. We also see a visual representation of the two confidence intervals.  Zooming in on the first 2,000 visitors, we see the problem with the repeated significance testing. A short run of conversions on idea B results in B being declared the winner around 100 visitors into the test.

 

Using Anscombe’s stopping rule is much better than using significance testing.  With a 40% less regret than using repeated significance testing.  The traditional way of using repeated significance testing leads to higher regret, and produces the correct answer less often than Anscombe’s method or using a fixed sample size.

In conclusion, using Anscombe’s methods minimizes regret, at the cost of giving up the ability to make inference. If you aim to make inferences about which ideas work best, you should pick a sample size prior to the experiment and run the experiment until the sample size is reached. But if you want to maximize conversions, use Anscombe’s rule.

*There are a lot of A/B testing calculators out there. If you do decide to pick your sample size in advance. I personally like ABBA [6]. Since it uses Aggresti-Coull confidence intervals and performs mulitple test correction, which avoid two other pitfalls not covered in this article.

Interested in learning more about A/B testing? I’m teaching a tutorial on how to design and analyze online experiments. You can sign up here.

Competiton: Projecting Pebble Sales

Two weeks into the month-long campaign the Pebble has become the highest grossing Kickstarter project of all time. One of us finally broke down and kicked-in, and we started wondering how many will they sell when the campaign finishes.

Black Pebble Watch

We decided to take a stab at predicting the final number of backers. Once we got started, we thought it would be even more fun to challenge anyone else out there to come up with better prediction. (Details at the bottom of post)

Here I discuss several different ways that we can predict new product launches, that should serve as a starting point for your models. We’ll see the best method on May 18th when the project is over. Also we encourage readers to contribute guesses and send us their solutions. We’ll do a follow up post in May to see who gets the closest.

The Pebble project launched on Kickstarter on April 11, initially it got a lot of traction, and then got a second wave of backers on April 17-19.
Daily Pebble Sales from Launch to April 26th

One simple approach is to assume constant velocity of sales. They have 44,643 backers in 15 days, so we can project that they will sell 110,119 over 37 days.

Clearly this is not the case, as the initial wave of sales has likely already passed.

So let’s investigate some additional models. We’ll evaluate one spline based regression model which handles the problem encountered with the polynomial regression, and then look at three probability models.

We can fit a natural spline to the cumulative number of orders. The natural spline works by fitting local polynomials internally and then enforcing the constraint that the function is linear outside of the bounds. This is equivalent to extrapoaliting based on the last portion of the sales.

You can do it using the following R code

lm(cumulative~ns(day,df=3),data=pebble)

However extrapolation is a very difficult problem. We are trying to use the data that we have observed to project out into the future. Regression based approaches are often very good for interpolation, i.e. predicting a response from predictors that are within an observed range. However, regression is of limited utility in extrapolating much beyond the range of the data.

As an alternative to regression, we can use probability models. Probability models make some relatively strong, yet robust assumptions about customer level behavior, and then are able to extrapolate based on that.

Here we’ll evaluate three succesively more complicated probability models and display the results.

Exponential Gamma

In this model we assume that individual customers have an exponentially distributed interpuchase time and that these interpurchase times are distributed across the population with a gamma distribution. Breaking this down a little more. Exponential interpurchase time essentially means that a customer who has not yet bought a Pebble has a constant probability of purchasing. Now there is heterogenity across the population. Some people are more likely to make a purchase, while the opposite is true for others. This is accounted for by the gamma distribution which characterizes the spread of purchase frequencies across the customer base.

peg <- function(x,r,a){
  return(1-((a)/(a+x))^r)
}

Weibull Gamma

This is is similar to the previous distribution, but we replace the exponential distribution with a weibull distribution. A weibull distribution is similar to the exponential distribution, but it has an additional term that introduces duration dependence. Duration dependence specifies that a customer who has not made a purchase after a certain time is either more likely to make a purchase or less likely to make a purchase. Again, we assume the scale parameter of the Weibull distribution is distributed across the population with the a gamma distribution.

pwg <- function(x,r,a,c){
  1-(a/(a+x^c))^r
}

Weibull Latent Class

Here we again use the weibull distribution as our underlying probability distribution, but in this case rather than characterizing the customer base with a continous distribution we assume that there are two classes of customers. These are latent (unobserved) classes, and we infer them from the underlying distribution. Using this distribution we can start to capture the second peak in the data.

weibull.latentclass <- function(x,r,a,p, trans=F){
  if(!(length(r) == length(a) && length(p) == length(r)-1))
     stop(paste("r and a must be same length, p must be one",
                "less than number of classes"))
  classes <- 1:length(r)
  p.c <- c(p,0)
  weights <- exp(p.c)/sum(exp(p.c))
  unweighted <- sapply(classes, function(y)
                       weibull.trunc(x,r[y],a[y],trans))
  weighted <- unweighted %*% weights
  return(weighted)
}

Projections

We can visualize the daily and cumulative projections of the different models:

A comparison of different models to project daily Pebble sales over time.

Of the four reasonable models, each of the models comes up with slightly different predictions for the number of backers:

Method Prediction
Spline: 63,854
Exponential Gamma: 53,865
Weibull Gamma: 63,854
Weibull Latent Class: 45,421

Averaging these together, and we predict that the Pebble will have 55,583 backers by the time the funding closes on May 18th. Assuming that the average backing size stays constant at $147 per backer, this will yield a total amount raised of $8.19MM.

Just for fun, here are the guesses of the rest of the team:

Person Backers Amount Raised
Aaron 55,583 $8,191,220
Corey 60,000 $8,842,800
Jon 65,231 $9,613,744.38
David 66,840.5 $9,850,230.12

Competition

These are some pretty basic models, and none of them fit the data exceptionally well. We could try to improve the models by aggregating data from other Kickstarter projects, or layering covariates such as media coverage of the Pebble’s success, or it’s promotion as a Kickstarter featured project. You could also come up with a more sophisticated model for how average contribution size changes over time.

Interested in making a prediction? Send your predictions to pebble@custora.com by 11:59 EST Monday April 30. The winner gets some serious street cred, a beer on us if they’re in town, and we’ll throw in a free black Pebble. We’ll also invite you to write a guest post on our blog describing your solution. To get you started, here is some code you can use to pull data from Kickstarter. Winner is the person who comes the closest to predicting the total amount raised, and who provides the methodology for how they came up with the solution (excel, matlab, R or whatever tools you want to use are fine).

#!/usr/bin/perl
use LWP::Simple;
use Date::Format;
use DateTime::Format::Strptime;
 
my $maxpage = 901
my $strp = new DateTime::Format::Strptime(pattern => "%b %d %Y",on_error=>'croak');
my $strf = new DateTime::Format::Strptime(pattern => "%Y-%m-%d",on_error=>'croak');
 
 
for (my $i = 1; $i <=$maxpage; $i++){
    my $url = "http://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android/backers?page=$i";
    my $content = get $url;
    die "Couldn't get $url" unless defined $content;
 
    while($content =~ m/<div class="date">([^<]*)/g){
        $date=$1;
        if($date =~ m/^\d/){
            my @dt = localtime(time);
 
            print strftime("%Y-%m-%d\n",@dt);
        }
        else{
            my $formatted =  $strp->parse_datetime("$date 2012");
            print $strf->format_datetime($formatted)."\n";
        }
    }
}

Methodologies: Data was pulled from the Kickstarter to find the number of backers at a given day. We fit the probability models using truncated models and using maximum likelihood estimation for the difference of CDFs.


Further Reading

Hardie, Bruce G. S., Peter S. Fader, and Michael Wisniewski (1998),
“An Empirical Comparison of New Product Trial Forecasting Models,” Journal of Forecasting, 17 (June–July), 209–229.

Morrison, Donald G. and David C. Schmittlein (1980), “Jobs, Strikes, and Wars: Probability Models for Duration,” Organizational Behavior and Human Performance, 25 (April), 224–251.

Every customer has a story.  Make the most of it.

Infographic: How different are mobile customers?

It’s no secret that mobile commerce is exploding.  As a result, many of our customers have been using Custora to understand the shopping behavior of their mobile customers.  Looking beyond aggregate metrics, we’ve been surprised to see just how much “the mobile shopper” varies across clients and verticals.  We put together the following infographic to highlight some of the stories we’ve seen (click to view in full size):

If you’re curious to understand the customer lifetime value and ordering behavior of your mobile shoppers, let us know!

How Bayesian CLV predictions helped a firm optimize Google Adwords spend

Researchers at the Olin Business School evaluated a firm’s search advertising strategy and found that they were significantly undervaluing their customers and underbidding for ads. Here is a brief summary of their findings.

The researchers looked at a retailer of chemical lab supplies, and evaluated their marketing strategy over a three year period. They found that cost-per-click (CPC) has been rising sharply due to intense competition for keywords and an increase in the popularity of search advertising. The firm was wondering if it was worthwhile to continue their bidding strategy.

On average, the firm was spending about $190 per customer they acquired from Google. However, looking at the conversions from Google advertising, they realized the average gross margin was only $142. Furthermore, with CPC rising to $0.80, the story was getting worse.  Cost per acquisition was approaching $266. For Google to be an effective channel, they could only spend $0.37 per customer, so they were spending above their break even point.

The story changed when researchers helped the firm look beyond the initial conversion.  The researchers used predictive models to project the customer lifetime value of customers acquired from Google Adwords and found that customers were worth much more than the value of their first transaction. Customers often returned, and they ordered with increasing frequency the longer they had done business with the firm. The researchers found that the 1-year gross margin from customers from google was $562.  This raised the break-even price of customers to $1.79, which was well above the current CPC.

Cost Per Click Rises, surpassing the first transaction gross margin, but remains less than the lifetime gross margin.


Furthermore, they found that customers who came from paid search were significantly more valuable than those who came from other channels. The average cost 1-year gross margin of customers who came from google was nearly twice that of customers who came from other sources ($562 vs. $289).

By taking into account the direct results of advertisements, and looking at transaction data in aggregate, they would have significantly undervalued their customers and abandoned the CPC channel. But by using advanced analytics, they realized the true value of their customers, and continued to invest in paid search.

“Measuring the Lifetime Value of Customers Acquired from Google Search Advertising,” Marketing Science, Tat Y. Chan, Chunhua Wu and Ying Xie
A note about methodologies: Chan et al. used a probability model derived from the work of Schmittlein et al.. They use the same individual level likelihood function, but use a log-normal mixing distribution. The use of the log normal mixing distribution allows the introduction of correlation between the order frequency, gross margin, and attrition. Most of which they find to be uncorrelated. At Custora we use the original model developed by Schmittlein to predict customer lifetime value. For more information on different methods of predicting CLV, look at our other posts.

Visualizing “good retention” and “bad retention” in retail

The “layer cake” cohort visualization is one of our favorite forms of historical customer analysis.  Instead of just focusing on top-line revenue numbers, we color-code monthly revenue based on the “acquisition cohort” it belongs to.  Each color represents the revenue earned from a group, or “cohort,” of customers who joined in a given month (hat tip to Roberto Medri from Etsy who suggested this visualization to us).

The layer-cake cohort graph provides three valuable pieces of information:

  1. What % of revenue comes from newer and older users.
  2. How quickly a group of new customers fades away.
  3. Aggregate revenue trends.
A few examples:

This is a layer-cake diagram of Firm A, a retailer who does not have strong customer retention.  Notice how quickly each cohort fades away.  In an average month, roughly 90% of revenue earned comes from new customers.

Firm B is a retailer with much better customer retention.  Notice how each cohort band remains strong over time.  In this graph, roughly 70% of monthly revenue comes from new customers with 30% coming from repeat business.  The cohort bands shrink down quite a bit after their initial month because customers don’t make orders every month – but the customers clearly stick around for a while.

Firm C is a retailer with great customer retention and customers who purchase with high frequency.  Nearly 80% of monthly revenue comes from repeat business.

Taking things further

The layer-cake graph is an informative way to kick off retention analysis, but it doesn’t paint the whole picture.

If we define retention as how long customers stick around, Firm B and Firm C actually have very similar retention.  However, Firm C’s customers purchase with much greater frequency.  Drawing out retention for users who order at different frequencies can provide further insights.  For example, we might discover that retention revenue numbers are all derived from a small pocket of high-value, sticky customers.  Or, we might discover that many customers stick around, each ordering once per quarter.  If it turns out we have a pocket of all-star customers, we’d most certainly want to zoom in on those customers to learn how we can find more of them!

Finally, and perhaps most obviously, the layer-cake graphs don’t provide answers on what to do next – they don’t shed insight as to how we might improve customer retention.  To change the shape of your layer-cake graph, we need to drill beyond aggregate cohort metrics and hone in on individual-level insights.  The more we know about specific customers, the better chance we have to keep our customers engaged and happy.

We’ll write more soon on some of the most effective forms of individual-level insights we’ve seen with retention marketing – from churn detection to behavioral segmentation to customer lifetime value prediction.

If you’re interested in automating these graphs – and drilling from aggregate cohort analysis all the way down to individual-level customer insights, get started with Custora today!

Objective C-like Null Object Pattern in Ruby

At Custora we allow all the numbers on the screen to be lazy-loaded. We precache all of the main page views, but there are too many possible ways that a user can slice the data to pre-compute all of the values.

If you hit one of these pages, you see a loading bar that looks like this:

Writing code to deal with all the lazy loading can be a bit of a challenge. Basically, it means that whenever we get a number from our internal statistics api, we have to accept the possibility that it is not rendered correctly. And we use the statistic term loosely, it can be a float, an array, or some other ruby class. Anything that is computed from the raw transaction data goes through this API.

So to render the pages we make two passes. First to run through and figure out what statistics need to be calculated. We then send these all to delayed job, display a pending page that perodically polls to see if the jobs are done, and when the jobs are completed we display the rendered page.

We have an interesting technical challenge, how do we handle these uncomputed statistics. When we first started we ran into all kinds of errors. We would have an expression like:

number_with_precision(client.new_customer_value)

 

When client.new_customer_value was hit, the job would be enqueued, but then the page would fail to load.

To solve this we tried checking if a statistic was nil, but this made convoluted code

(client.new_customer_value ? number_with_precision(client.new_customer_value) : nil)

So to solve this we made a do nothing class with the goal to fail silently as much as possible. So we have a class that is designed to fail gracefully no matter what you do with it. It should fail silently if you call

stat * 100

or

stat[1]

or

1 * stat

or even

stat[1][0].first.to_s

 

We could have done something like:

class NilClass
 def method_missing(*)
     return nil
 end
end

But this would have changed how nil behaves all over our application. Instead we decided to make a new class for unevaluated model statistics.

Another solution would have been to use the .try method on all statistics. But this is cumbersome, and doesn’t work when the unevaluated statistic is passed to another function.

So we came up with the unevaluated model statistic class.

Code:

class UnevaluatedModelStatistic
  def ==(_other)
    false
  end
 
  def method_missing(_method,*_args)
    self
  end
 
  def to_f
    0.0
  end
 
  def to_str
    ""
  end
 
  #for cohorts statistics
  def number_of_display_columns(_arg)
    1
  end
 
  def *(_arg)
    0
  end
 
  def +(_arg)
    0
  end
 
  def -(_arg)
    0
  end
 
  def coerce(_other)
    [self,_other]
  end
 
  def /(_arg)
    1
  end
end

On the second pass we end up with the completely rendered page:

This is the way we are attacking the problem. How would you approach it?

Custora and Fab, calculating the lifetime value of iPad customers

Jason Goldberg from Fab.com just wrote up a nice piece covering some customer lifetime value analysis we ran with the Fab team zooming in on iPad customers.

As Jason notes, the expected two-year revenue from iPad customers is over twice that of the average customer!

iPad customers also “convert” from members into paying customers much faster than the average customer.

How Bayesian Probability Models Can Make CLV Predictions 12x More Accurate

This is part three of a three-part series exploring ways to calculate CLV in a retail setting.  Part one discussed the shortcomings of using ARPU to calculate lifetime value, and part two discussed the shortcomings of historical cohort analysis to calculate CLV.  

Imagine you’re the marketing manager of an online retail company.  You’re wondering how much you should spend to acquire new customers.  You know you should look beyond conversion rates and set your budget based on customer lifetime value, so you ask three analysts to calculate CLV.  Each uses a different approach, and the results vary wildly:

  • Analyst A uses an ARPU-based approach, and tells you the CLV is $240.
  • Analyst B uses a historical, cohort-based approach, and tells you the CLV is $150.
  • Analyst C uses probabilistic modeling and tells you the CLV is $108.

You listen to Analyst C, and it’s a good thing you did.  It turns out the CLV turned out to be $100.  Had you based your acquisition spend on the other analyst’s numbers, you could have lost over one hundred dollars per customer.

Getting predictive with CLV

CLV is a prediction – if I pick up a new customer today, what will he spend over his customer lifetime?  As we pointed out in our previous posts in this series, if you base CLV numbers entirely off the past, you can end up with very inaccurate CLV numbers.

When it comes to predicting CLV, as with any predictive science, there are many approaches you can take.  In this post, we’ll describe how probabilistic modeling can be used to predict lifetime value.

We’ll start with the story behind these models, then describe a mathematical approach.

Probabilistic-based CLV models: the story behind the approach

“Probabilistic models” might sound like a handful, but they begin with a clean, simple story about the customer.  Each customer is unique.  Each one orders at his own pace and frequency.  Each one has a chance to be a loyal customer, but each one might also turn out to be a one-time-buyer.

The two variables we described – order frequency and loyalty – can form the basis of of our probabilistic model. We can think of each customer as having a pair of dice that he rolls every month to determine how often he orders, and a coin he flips every month to determine whether or not he will remain a customer.  Because we know the danger of using average rates, we can assume that each customer has his own weighted dice and weighted coin.

To get to accurate CLV predictions, the goal of the models is to try to understand the distribution of those dice and coins.  What percentage of the customers are weekly buyers?  Annual shopers?  What percentage are loyal?  One-and-done?  Why is this valuable?  There are a few benefits to such modeling techniques:

  1. As we just mentioned, staying away from average retention rates can lead to a massive improvement in CLV accuracy.  By understanding the distribution of loyal and non-loyal customers, we avoid the common “average” problem.
  2. We also gain the ability to make CLV projections for specific customers (more details on this below).
  3. The distribution itself tells us about the customer base.  Do we have a lot of great customers, and a lot of poor ones – a love/hate relationship with our customers?  Or is there an even distribution of customer quality?

The results:

Before digging into some details about the math, we can take a look at how accurate these models are in the “real world.”  We are constantly testing the accuracy of different customer lifetime value techniques.  On average, probabilistic models significantly outperform historical techniques.

If you’re going to make a business decision based on CLV numbers, you absolutely must ask your team how they’re generating their projections – and how accurate those numbers are.

The math behind probabilistic modeling

For the modelers who are reading, there are many forms of probabilistic modeling one can use to project CLV.  We’ll dig into one common approach here.

We follow a two-step process: first, we set a framework to model the individual customer, then we account for customer heterogeneity.

To model the individual, we first think of the customer story mentioned above.  We can think of each customer having two variables: λ (lambda), which represents his order frequency, and μ (mu), which represents his drop out rate (i.e. lambda is our “dice,” and mu is our “coin”).  We can go one step further with our frequency variables and acknowledge that customers rarely order on a strict pattern.  To handle this reality, we can think of λ as the mean number of orders a customer makes in a period – the mean of a poisson distribution.

Next we shift gears to focus on the distribution of λ and μ across our customer population.  We need a distribution that can describe different customer bases, and has a few parameters so that the model remains powerful.  The gamma distribution is a perfect candidate, since it characterizes most customer bases very well.

Now we have a way to model individuals, and we have a mathematical way to describe how people are different.  We are ready to ask our model, “what distribution of dice and coins would have given us the behavior we see in the past?”  We can use maximum likelihood estimation to find the most likely parameters for the distribution.  We use numeric optimization to figure out the parameters of the two gamma distributions, one for λ and one for μ in a way that best explains the ordering patterns we have seen in the past.  The optimizer does the work of testing different distributions and will eventually converge on shapes that best describe what’s going on in our user base.

Once we have obtained these distributions, we can use the outputs to derive a more accurate, precise projection for the expected number of orders a new customer will make.  We now understand the probability that a customer will be loyal or not, and the probability a customer will be a frequent shopper or a once-a-year-buyer.  By avoiding the dangerous average retention rate in both these cases, we’ll derive much more accurate numbers.

Moreover, we can use Bayes’ theorem to make projetions for specific customers.  Given what we have seen from a specific customer, and given what we know of the whole customer base, we can make informed probabilistic projections about individual customers.  CLV is no longer a game of the population at large – it’s a figure you can project for each and every customer.

Taking things further

There are some obvious limitations with this model.  As with most models in general, they assume the world is static.  Covariates can be added that try to handle things like seasonality and gradual changes to the business in general, but adding these to the model estimation is no small feat.  The base model, often referred to as “buy ’til you die”   or a “latent attrition” model assumes that customers, once they leave, are gone for good.  This isn’t always the case.  Hidden Markov models can be used where we assume customers have latent active/inactive states before actually leaving for good, and simulations can be run to see if the HMM describes the customers better, and makes better predictions than the latent attrition model.

Things get even trickier when you add other marketing actions into the picture.  If a firm induces a purchase with a 50% discount, should we treat that order like any other order and update our user-level expectations accordingly?  These are the issues that keep our team up at night.

Finally, we’ve been focusing this entire piece of projecting orders.  Separate models are required to get a feel of expected revenue and profit on the user level.

How are you modeling CLV and handling these challenges?  We’d love to chat here in the comments or over in the discussion on Hacker News.

How We Would Do It: Predicting Customer Pregnancy At Target

There has been a lot of buzz recently about how Target was able to predict which of their customers are pregnant. Here is how we would approach the problem.

Target started with a belief that females form brand allegiances when they shop in the their third trimester.  As such, they want to be able to predict when their female customers will enter that trimester.  By sending relevant coupons at the end of the second trimester, they want to encourage their customers to visit Target and forge more of those long-lasting relationships.

We can think of approaching the problem in three steps: first predict which customers are pregnant, then predict the due dates, and finally figure out the best coupons to send to get the customer to come back to the store. In this post we’re going to look at the first problem, predicting which customers are pregnant.

The pregnancy prediction problem can be further broken apart as follows:

  1. Establish a training data set made up of pregnant and non-pregnant shoppers
  2. Create ‘market baskets’ of items purchased by these customers
  3. Choose a model, identifying relevant features and generating pregnancy prediction scores
  4. Determining which customers receive mailers

Creating a training data set
To predict pregnancy, we first need to develop a training data set for the models. We filter down the customer data set to women who shop regularly at target. Target must have a way of linking gender and guest ID directly, or they are able to determine gender from the products that guests buy.  They need to be fairly regular shoppers to have enough data for accurate predictions.

Target also has some data on which of these women are pregnant. The article says that they have due date information from guests who supply the information with Target’s gift registry. We can use this data as the training set for the model.

Defining ‘pregnant’ and ‘non-pregnant’ market baskets
We can establish a variety of ‘market baskets’ of products that are purchased by pregnant and non pregnant women. We create the baskets of pregnancy products by looking at what guests buy in their first 26 weeks of pregnancy. We establish a baseline basket of products that non-pregnant women purchase by taking products that women purchase in a randomly selected 26 week period.

We are now armed with the data we need to predict pregnancy due dates. The article says that:

[Target’s statistician] was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.

Picking a model and approach to learn and predict
So the first thing to do is feature selection.  Feature selection is the process of picking which of the possible predictor variables are relevant. In this case the features are the purchase or lack of purchase of specific products. Target has tens of thousands of products, and in order to predict which customers are pregnant we need to determine the subset of products that are purchased more by pregnant women. To figure this out, we could code each product in their portfolio with a boolean indicator variable then each market basket is a collection of these variables.  So for n market baskets, and m items in the store, we can encode the problem as a nx1 matrix of response variables where a 1 indicates that the market basket was from a pregnant women and a 0 indicates the market basket was from a randomly chosen female customer.  We would make an nxm matrix of predictor variables where rows are the individual market baskets and the m columns are items in targets inventory. Cells in the matrix are filled with with 1 if the item is present in the market basket, and 0 if the item is absent.

Then we could use a supervised learning algorithm to predict which baskets belong to pregnant individuals, and then we can perform feature selection to figure out which products are the most predictive of pregnancy. The more popular supervised learning algorithms, logistic regression, neural-nets, support vector machines and random forests. I would start with a regularized logistic regression, which combines the prediction and feature selection steps (Tibshirani et. al). Regularization is a way to avoid over fitting and uses a penalized maximum likelihood estimation.  The regularization also is used to determine which products are useful, we can just pick a regularization parameter, and then pick all products that have non-zero prediction coefficients.

Choosing to whom to send mailers
At this point we have a pregnancy prediction score for every customer, and need to figure out what the appropriate cut-off is.  We do this by establishing picking a false discovery rate (FDR). Since we will never be able to predict with 100% accuracy who is pregnant and who is not, we need a way to minimize the error that we will make. We can set a FDR at 0.05, that is to say that when we send out the mailers we expect 95% of the women who receive them to be pregnant, and 5% to be false positives. (Storey et. al).

Conclusion
This is just one method that could be used to identify the pregnant customers. Obviously this is fairly speculative, and not necessarily how Target approached the problem.

How would you approach the problem? Let us know in the comments, or join the discussion on Hacker News.

Rob Tibshirani & Trevor Hastie & Jerome H. Friedman, “Regularization Paths for Generalized Linear Models via Coordinate Descent,” Journal of Statistical Software, American Statistical Association, vol. 33(i01).
Storey, J. D. (2002), A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64: 479–498. doi: 10.1111/1467-9868.00346

Making “engagement” a meaningful metric for your business

This is a guest post by Liz Ryan, President of Relish Tray Media.  Liz has over 12 years of experience in online marketing strategy and implementation, both brand and agency side. Liz previously served as Director of Marketing at Threadless.com

Marketers have always, formally or informally, used “engagement” as a benchmark of a successful campaign.  Yet the notion of engagement is difficult to define and measure.

Engagement is a poorly defined metric that has different meanings to different marketing teams.  Ask ten people in an organization what engagement means, and you’ll likely hear a variety of answers – open rates, click through rates, time on site, conversions, lifetime value, and more.

Merriam-Webster defines engagement a number of ways:

  1. The state of being in gear: Wouldn’t we all like to think we’re in gear and on top of our game plan, ready to strike with the right tools at the right moment?
  2. A hostile encounter between military forces:  This is what it can feel like, especially when you don’t have a clear strategy or data to help you form a plan.
  3. Emotional involvement:  Emotions; how do we connect with our clients, prospects, subscribers and members on an emotional level?

Emotional involvement is what we, as marketers, are really shooting for.

Define the right engagement metrics for your goal

To make “engagement” meaningful, we first need to define the appropriate engagement goals for a given business challenge.  For example:

  • If you’re looking to “win back” customers by offering a giant discount, conversion rate is probably too short-sighted.  You want to measure engagement by tracking how many of those customers returned after the discount.
  • If you’re looking to generate awareness of a new product offering, open rates and click rates might be sufficient.  Better yet, you might track how many visitors ended up spending a lot of time on the site once they arrive.
  • If you’re looking to make the most of the holiday season, it certainly makes sense to keep an eye on profit and not only look to revenue conversion.

Track your engagement metric on the customer level

In most cases, the goal is get customers to interact more and more with your brand and store.  Rather than tracking aggregate metrics we should track engagement metrics on the customer level and monitor how those customer-level numbers change over time.

A simplified retail example: Let’s assume our goal is to get customers purchasing more often.   Instead of focusing on the conversion rates of specific emails, we focus on the “purchase frequency” of each customer.  As we work on new retention marketing ideas, we can take snapshots of our customer set and inspect their purchase frequencies at different points in the year.  We hope to see that our marketing efforts are encouraging customers to purchase more often.

 

GROUP Q1 Q2 Q3 Q4
Weekly buyers 10% 11% 13% 13%
Monthly buyers 50% 45% 40% 40%
Quarterly buyers 30% 35% 38% 40%
Twice-a-year buyers 10% 9% 9% 7%

In this example, we see an increase in weekly buyers, which is great.  However, we also see that a lot of our Monthly buyers have turned into Quarterly buyers – not so great.   As we keep trying new ideas and refining our strategies, we want to keep an eye on this type of engagement metric to measure our success.

As mentioned above, purchase frequency is just one type of engagement metric.  However, it’s a great example of a concrete piece of engagement data that you can track on the individual customer level.  High-level brand engagement metrics are also valuable for numerous reasons, but it’s concrete individual-level data that helps the most for optimizing engagement.

Everything revolves around the Customer and the Customer Lifecycle

Our goal is to deliver a one-to-one engagement experience with each customer.  To reach that goal, we need to start thinking about each customer with a long-term mindset.  Every new customer is, in fact, a new relationship that we can build.

Every time we interact with a customer, we learn.  When a customer makes her first purchase, we gain loads of valuable insight.  The channel, campaign, creative, medium, promotion, and product purchased can be used to help us determine how to follow up.  We can continue to learn with each subsequent interaction.  As we watch a customer evolve, we can learn her purchase frequency and learn whether or not she has seasonal buying habits.  These bits of data are very valuable when we begin to think about personalizing content and promotions for each customer.

So how can we get there?  Marketing tools need to be integrated with data from CRM and customer analytics – this can be done in house, if you have a team for it, or with outside vendors.  The big first step is to put pieces in place so you can track engagement metrics and behavioral characteristics at the customer level.

Use Multiple Channels

Finally, all too often, we look at engagement metrics per channel. Customers very rarely purchase in a channel vacuum.  Email marketing, display retargeting, mobile phone notifications, adwords, social network advertisements – these are tools working for the same goal of connecting with your customer.  Experiment to understand how specific individual customers react to different channels.

Continue to test and refine results. Never feel comfortable with your program, because in the modern marketing landscape, personal preferences and behaviors can change as quickly as your products do.