Sparse matrix formats: pros and cons

Under the mathematical hood at Custora we often have to work with sparse vectors and matrices, which are very large but whose elements are mostly zero. A simple example of this type of data structure that occurs in many statistical applications is a model matrix, a matrix whose rows correspond to observations, whose columns correspond to categorical membership, and whose elements indicate each observation’s membership in each category.

 

Customer From US From Europe From Asia
Chester McNebbins 1 0 0
Vesper Lynd 0 1 0
Walter Walterson 0 0 0

 

Storing these data structures with a naive matrix architecture, one allocated variable per element, can be very wasteful. In R, which we use to do our statistical work, there is a package called Matrix that is designed to handle these structures more efficiently. Let’s do some quick comparisons (all numbers that follow have been computed on a 3.4GHz iMac with 32 GB of RAM):

> standard <- matrix(0, nrow=1e6, ncol=100)
> object.size(standard)
800000200 bytes
> csparse <- Matrix(0, nrow=1e6, ncol=100)
> object.size(csparse)
1824 bytes

R in fact cannot handle ordinary matrices with more than 231 – 1 elements; try increasing nrow to 1e8 to see this. (The 231 – 1 upper limit is the max value of a signed 32-bit integer, and is accessible in R via .Machine$integer.max.) Aside from concerns about memory efficiency, this limit is restrictive for our purposes: if you’re dealing with 3 million users and you want to store 1000 variables per user in your matrix, you’re already out of luck.

The default sparse matrix architecture for Matrix is the column-oriented format (class dgCMatrix in the Matrix package). There is also a similar row-oriented format that is not as well supported by Matrix (class dgRMatrix). Finally, Matrix also offers a triplet format (class dgTMatrix). Wikipedia’s article on sparse matrix representation explains these formats nicely, but to summarize in brief:

  • The column- and row-oriented formats store data as three arrays (A,B,C). In the column-oriented format, A contains all of the nonzero entries reading top to bottom one column after the other, B contains indices of/pointers to A indicating where each new column begins, and C contains the row index of each element in A. The row-oriented format is similar except that A reads left to right one row after the other, B indicates where each new row begins, and C contains column indices. (I will disregard the row-oriented format from here on, without loss of generality, as it is architecturally the same as the column-oriented format.)
  • The triplet format consists of a list of triplets (row, column, value) for each element. As Matrix‘s documentation mentions, this implies that internal representation is not unique; a different reordering of the triplets would correspond to the same matrix.

These two architectures lead to various pros and cons relative to each other and to the standard matrix format. One disadvantage of both of the sparse formats is that assignment is much slower than for standard matrices, although this is a considerably worse problem for the column-oriented format than for the triplet format:

> standard <- matrix(0, nrow=1e5, ncol=100)
> csparse <- as(Matrix(0, nrow=1e5, ncol=100), "dgCMatrix")
> tsparse <- as(Matrix(0, nrow=1e5, ncol=100), "dgTMatrix")
> system.time(standard[,25] <- 1)
 user system elapsed
 0.001 0.000 0.001
> system.time(csparse[,25] <- 1)
 user system elapsed
 3.949 0.002 3.952
> system.time(tsparse[,25] <- 1)
 user system elapsed
 0.066 0.007 0.075

Standard matrices can just assign new values directly to space allocated in memory and so assignment takes very little time. The triplet format has a little more to do; it has to allocate memory for each new triplet and then assign values. The column-oriented format has the most work to do: it has to allocate for its array of nonzero values and assign values, allocate for its array of row indices and assign those values, and then adjust the pointer array so that the columns start in the right place.

A disadvantage of the triplet format is that it often consumes more memory than the column-oriented format:

> standard <- matrix(c(0,0,0,0,1), nrow=1e5, ncol=100)
> csparse <- as(Matrix(c(0,0,0,0,1), nrow=1e5, ncol=100), "dgCMatrix")
> tsparse <- as(Matrix(c(0,0,0,0,1), nrow=1e5, ncol=100), "dgTMatrix")
> object.size(standard)
80000200 bytes
> object.size(csparse)
24001824 bytes
> object.size(tsparse)
32001416 bytes

These matrices are still fairly sparse, and both sparse formats offer substantial memory improvements over the standard format, but the column-oriented format does better in this respect. Both formats do worse as the matrix becomes less and less sparse and will eventually consume more memory than the standard matrix format (try it out with a matrix consisting entirely of 1s).

One way to deal with these issues in Matrix is to convert between sparse matrix formats as needed, or to deal with smaller matrices in the standard format and use R’s cbind2 and rbind2 functions (which can combine two matrices along columns or rows into a single larger matrix) to attach the data into a sparse matrix. For example, if you are storing sparse matrices on disk you may prefer to keep them in column-oriented format. But when later working with them, you may want to convert them to triplet format, or, if possible, do all your operations with standard-format matrices and take care of conversion to a sparse format separately.

> system.time({
   csparse <- Matrix(0, nrow=1e5, ncol=100)
   csparse[,c(25,50,75,100)] <- 1
 })
 user system elapsed
 16.141 0.016 16.157
> system.time({ 
 csparse <- Matrix(0, nrow=1e5, ncol=0)
 for (i in 1:4) { 
   a <- matrix(0, nrow=1e5, ncol=25); 
   a[,25] <- 1
   csparse <- cbind2(csparse, a)
 }})
 user system elapsed
 0.076 0.055 0.130

Using cbind/rbind (or in this case cbind2/rbind2) to incrementally increase the size of a matrix, as is done in the above example, is exactly the opposite of traditional good coding style in R, which emphasizes pre-allocation and vectorization in place of loops. Beginners to R or to other languages that emphasize an array programming approach often have to learn specifically not to do this. But in this case, because of the way our underlying data structures are defined, it’s more efficient to use a loop. To add more columns to a column-oriented sparse matrix, all you need to do is append to the three arrays, with no readjustment of the old values needed.

Knowing the sparsity of your data, knowing what kinds of operations you will be performing on your data, and thinking about your preferences when trading off memory consumption and speed can help you select appropriate formats and algorithms for statistical work on sparse matrices.

SignUpPlease2

Member Lifetime Value

Up until now, our predictive modeling at Custora has focused on understanding the behavior of paying customers. We’ve traditionally analyzed customer purchase patterns over time – and helped clients answer questions like, which channels, ad networks, or affiliates should I be looking towards to attract more customers like my highest-value shoppers?

However, lots of companies don’t directly acquire new customers – they acquire new members and then convert them over time into paying customers. This business model, often called “free-to-paid,” has become an increasingly common fixture in the e-commerce landscape. The model is now the norm in the daily deal and flash sale industries, where companies tend to sign up lots of unpaid subscribers, hoping to convince them to shell out for goods or services at some later point.

For these companies, homing in on customer lifetime value is obviously still important. But free-to-paid firms tend to spend acquisition dollars on getting new members – who may or may not go on to become customers. The only way for them to maximize their return on acquisition is to be able to predict how much a member will be worth over time, even when her first purchase may be far down the road.

With this challenge in mind, we recently introduced a new feature in Custora called Member Lifetime Value (or MLV). Now our clients operating a free-to-paid business model can see the predicted value of a new member. And they can further break down expected member value on acquisition factors like channel or demographic variables like geography.

This is a big win for clients like LivingSocial. But we’re sharing this to highlight some of the interesting modeling questions that come with changing the frame of reference from customer to member. Predicting member CLV requires the joining together of two separate models – a conversion model, and then a customer behavior model conditional upon conversion. So what are some of the main hurdles?

1) Conversion is all about timing. Across free-to-paid businesses, a certain proportion of members (usually between 5 and 15%) generally make a purchase immediately upon signup. Retailers shouldn’t ignore these customers – they tend to be particularly valuable, and may justify special attention to keep them coming back. But what about the vast majority of members who don’t convert right away?

Consider two identical members who both sign up at time t=0. Once converting to a paying customer, each member will go on to make regular purchases every four months, with each purchase netting $50 in profit. If member A converts at the 8-month mark, his expected two-year profit (starting at time t=0) is $250. But if member B converts at the one-year mark, his expected two-year profit (still starting at time t=0) is only $200.

In other words, conversion isn’t just a binary, “yes/no” variable with a single probability estimate. In order to predict a member’s long-term value, we need to model the entire distribution of possible times when they’ll convert.

2) Members, like customers, are all different. Customers come in all shapes in sizes – some make small purchases once a week, others splurge on big-ticket items once in a while, and many are “one-and-done” shoppers unlikely to return for a second purchase. It’s precisely this diversity in customer behavior that allows us to effectively use segmentation and targeting tools to get more efficient with marketing dollars.

The same is true of members. Some types of members are likely to convert very soon after signup, whereas others may take much longer to convert – if they ever do. Take a look at the following graph of member conversion behavior by acquisition keyword campaign for an actual free-to-paid company:

Screen Shot 2013-05-08 at 1.59.00 PM

For this company, members acquired through Keyword 3 were much more “conversion-prone” than those acquired through the Keyword 2 – by the end of one year, almost twice as many of them converted into paying customers. A robust model of conversion needs to factor in this underlying heterogeneity of conversion propensities across member segments.

3) Factoring in covariates. How do we predict at the moment we acquire a specific member how likely he or she is to convert into a paying customer – and when? We use covariates: the secondary data that a member record is “tagged” with at the time of registration. Variables like what channel or referral site a member came from can provide important clues about her underlying, unobserved likelihood of conversion.

4) Putting the pieces together. The conversion model is an essential component of understanding member value – but it’s only half the story. The other half is what the members actually do once they convert into paying customers. For example, a company might discover that members who sign up through a certain affiliate tend to be quick to convert – but then go on to make infrequent, low-value purchases over time. Both of these pieces need to come together to inform a prediction of the long-term value of members sourced from that affiliate.

 

Ultimately, introducing MLV is a step towards helping marketers at free-to-paid firms make smarter acquisition decisions. Any questions or thoughts on how we tackle MLV? We’d love to hear from you!

 

SignUpPlease2

Beyond Batch and Blast: Getting Started with Smart Email Marketing

Let’s say that you want to run an email marketing campaign. You know the importance of setting up an experiment, establishing a control group, and continuing with the holdout until the results have been validated.

But hold on a moment. How do you take that crucial first step towards actually running the marketing experiment — figuring out which customers to mail, when to mail them, and what message you want to reach them with?

For example, imagine that you believe you can stimulate additional repeat purchases by offering customers a 20% discount. One way to do it would be just to send an email to your entire customer base (minus the control, of course) with the offer, and see whether it results in a lift in repeat purchases.

We often hear this referred to as “batch and blast,” or, more recently, “spray and pray.” (This one time we heard “pow and chow,” but we couldn’t figure out what it actually meant.)

Anyway, here’s how an un-targeted promotion can be potentially risky:

-It’s costly. It’s true that a 20% promotion might help you reconnect with customers who have faded away over time — and stimulate some purchases that they might never otherwise have made. The problem is that plenty of the customers receiving this promotion would have made purchases anyway, with or without the discount. Giving them 20% off is just eating into your margins. What you would really like is a sharper way of targeting those customers who are fading away or at-risk — customers for whom any additional purchases will be incremental to what you have gotten without the promotion.

-It leaves money on the table. The flip side to sending a discount to a shopper who would have made a purchase anyway is missing the chance to send the most relevant promotion to a given customer. Your job is to connect with your customers with the right message at the right moment. It’s possible that all of your customers might be interested in a 20% discount regardless of their relationship history, purchase patterns, and current behavior. Possible…but unlikely. Ideally, you would want to figure out a way to email different customer segments with a message that is crafted to appeal specifically to them.


So how can we move towards a smarter approach to email marketing?

1) Tie email triggers to the customer lifecycle and your customers’ “temperature.”

Consider three customers: Jessica, who has bought jewelry from your website every month for the past five years; Vesper, who used to buy new shoes every half a year or so but hasn’t made a purchase in nine months; and Leanne, a new customer who made her first purchase of jeans last week. These customers don’t look too different at first glance. All three are female and have made purchases in the past year. But each is likely to be most responsive to a different kind of campaign.

For Leanne, a follow-up email at the 30-day mark — possibly with a discount — can help your brand remain top-of-mind and trigger a repeat purchase. Jessica, on the other hand, is an active customer who is “hot” and needs little additional prodding to buy; an email with a sneak peek at the new earring collection might be more meaningful (and cost-effective) for her. And Vesper is a customer who is steering of her normal purchase course — “cooling,” so to speak — and might need additional incentives to reconnect with your brand. A welcome-back message and special deal on shoes could help remind her why she loves you before she becomes truly “cold” (inactive and likely gone for good).

Tying email triggers to specific points in the customer lifecycle and aligning your email marketing efforts with your customers’ “temperatures” can help you serve up more relevant messages and offers.

2) Sharpen your messaging and offers with smarter segmentation.
One of the foundations of advanced customer analytics is the premise that your customers are all different — so they shouldn’t be treated the same. If you know that customers who reach your site through the Google adword “carburetor” are fundamentally different than those who reach you through adword “muffler” (different repeat purchase likelihood, different profit per order, and ultimately different customer lifetime value) consider running separate email campaigns with different messages and offers for each segment. It will help tie your marketing efforts to the real drivers of your company’s performance.

3) Keep on refining.
Email marketing is not “one-and-done.” It’s an ongoing and iterative process; today’s exciting new idea is tomorrow’s status quo. A/B testing — the idea of holding a bake-off between two or more competing ideas to see which produces the best results — is sometimes called the “champion-challenger” model when a new idea is being evaluated against an existing favorite. So make sure that you have a robust pipeline of challengers ready to go up against the current champion for supremacy. Observing that a 15% discount leads to revenue lift over an email with no promotional offer? Why not try a dollar-denominated discount instead or a buy one, get one promotion? Why not experiment with a new subject line or new creative? Effective email marketing is about continuous, incremental improvement rather than putting the “right” option on auto-pilot.

 

Ultimately, the promise of email marketing lies in its ability to enable a more personal, individual relationship with the customer. Acknowledging that customers are all different — and linking marketing efforts to the stage of their relationship with your brand — can help ensure that you deliver the right message to the right customer at the right time.

 

SignUpPlease2

Custora 4.0: Your complete “Retention Marketing Lab”

Our latest suite of upgrades includes one of the most frequent requests we’ve heard over the last 6 months: One-Time Campaigns.

Now, in addition to testing and automating trigger-based lifecycle marketing campaigns, it’s just as easy to test virtually any idea— online or offline— within a unique customer segment on a one-time basis.

Throughout the design and development process we worked with Revolve Clothing to gather valuable input and feedback. They used the tool to evaluate a holiday catalogue campaign that was sent to various high-value customer segments. Now Revolve has insight into how different customer segments respond to different catalog versions and promotions, and this knowledge will help shape Revolve’s mailer strategy in upcoming seasons.

Within the new tool, you can select any customer segment — using our segment builder or by uploading a list. Zoom in on high-value customers who first ordered something in your Pants department. Pick a group of customers who live in the Northwest. Choose a group of customers who just made their 5th purchase last week.

Then, once you have your desired customer group, Custora helps you set up a marketing experiment so you can measure how each of your marketing ideas impacts the bottom line.  Custora ensures you establish the proper control groups and runs all the A/B statistical analysis on your campaign.

Beyond the One-Time Campaign tool, we’ve also made refinements to how we display Lifecycle Trigger Marketing results, and overall performance should be zippier across the board.

We’re looking forward seeing how all our customers use our new marketing lab.  It’s one more step in our quest to help brands can easily and effectively test marketing ideas that resonate with their customers and drive results that have an impact on the bottom line.

Our ears are always perked for new feature suggestions. Keep them coming.
Thanks,

-Team Custora.

SignUpPlease2

Customer Segmentation in Retail: free online class, from basic to Bayesian

Customer segmentation has been a hot topic with our clients over the past few months: where to start, how to identify segments, and how to apply segmentation to deliver more relevant marketing experiences.

Many tools enable marketing teams to import a variety of “custom attributes” for each user that can be used for customer segmentation. For example, an email provider might enable the team to upload segmentation fields for attributes such as gender, age, spend to date, and more. However, deciding which fields to include is a difficult challenge. The goal of segmentation is to deliver more meaningful experiences to customers, yet there are an almost infinite number of segmentation approaches a firm can take.

This class will be use-case driven. We will introduce common retail marketing challenges, from driving the first repeat purchase to winning back customers who have faded away. For these situations, we’ll discuss techniques that range from simple demographic segmentation to more advanced forms of behavioral segmentation.

Thursday, 2-3pm EST

The agenda is as follows:

1. An introduction to e-commerce segmentation

What is segmentation and why it matters.

2. Segmentation strategy

What defines a “good” or “bad” segment, and common mistakes to avoid.

3. Discovering segments

Techniques and best practices to identify segments.

4. Techniques and case studies

A range of demographic and behavioral segmentation approaches, from the simple to the scientific: pros, cons, and use cases.

5. Q+A

The class will be held from 2-3p EST on Thursday, Feb 28th.

RSVP HERE.

 

Using Big Data to Craft the Well-timed Email (methodology)

The article “Using Big Data to Craft the Well-timed Email” explores a better way to use time-stamped transaction data. Thinking about data over the calendar year lends itself to generating forecasts of where sales (or revenue) are going in the coming periods. While these sorts of projection are useful to members of a finance team or to a company’s investors, they do little to help the online retailer hone a marketing strategy. However, if one aggregates time stamped transaction data by the second to analyze how volume and price change over the course of the day, one might be able to know when or on what day of the week to send email promotions.

The analysis conducted with a dataset of roughly 4.2 million time-stamped records from one of our clients.  Given that the intent of the analysis was to investigate variations in shopper behavior by time of day, the first step towards producing meaningful insights was to adjust the timestamps according to the timezone of the customers. While the primary dataset included a field indicating shipping state, the data was neither consistent nor specific enough to properly adjust the time.  As such, a second dataset consisting of 1.4 million records of Google analytics data was used to plot each transaction id according to the lat/long of the transaction city.  These coordinates were spatially joined to a polygon file consisting of world time zones and merged back to the original dataset by transaction id, at which point the timestamps were adjusted.  The final step of data preparation entailed removing observations from before 2006 on account of data-collection irregularities.

The analysis conducted was  primarily exploratory, using plots to drive intuition, and guide future research. For illustrative purposes, the plots displayed below are smoothed lines fitted to the data with a smoothing spline. The first plot shows that sales volume tends to peak right before lunch, remains close to that level throughout the workday, declines through dinner, and peaks again around 10 PM.

time_of_day

Producing the same sort of plot for price over adds more texture to variation in consumer behavior. We see that item price peaks in the morning and trends downward over the course of the day.

price_by_time_of_day

 

When this analysis was conducted for each time zone independently, we saw similar patterns across all time zones. Similarly, when volume and price were plotted by time of day and day of the week, weekday behavior was relatively consistent, with a notably lacking evening peak on Friday and morning peak on Sunday.

While the plots produced above are not intended to be an in-depth analysis of customer behavior by time of day and day of the week, the intuitions stemming from them are a good first step towards improving marketing efficacy. Additionally, this exploratory analysis will drive more in-depth research into the variations in consumer behavior by time of day.

 

Custora (YC W11) online class: An intro to Customer Lifetime Value

The most common question that online retailers ask us at Custora is, “How can I increase my Customer Lifetime Value.” And not far behind is, “So what exactly is CLV?”

This makes a lot of sense. After all, CLV is by far the most important metric that online retailers should be measuring. Yet despite it’s apparent simplicity — it’s literally just the average amount of profit generated from each user over their lifetime as a customer — there is a lot of hidden complexity.

That is why we’ve decided to host an online class this Thursday, Jan. 31th, at 2PM EST. This class will be a basic introduction to Customer Lifetime Value, along with the opportunity afterward to ask the founders questions about your specific business.

The class will be about one hour long, and will cover the following:

  1. A Brief CLV Primer

    Customer Lifetime Value explained, and why it’s the single most important metric for online retailers.

  2. Measuring CLV

    We’ll discuss the pros and cons of various methods — with a special focus on the benefits of the new Bayesian statistical approach developed by academics at The Wharton School and Columbia.

  3. Predictive vs. Historical CLV

    Why you can likely spend more to acquire each new customer and still increase profitability.

  4. Customer Segmentation by Source

    Know exactly what to spend to acquire each new customer from Google, Facebook, Groupon, etc.

  5. Case Studies

    Learn how retailers like Fab and Etsy have used CLV analysis to their advantage.

Afterward there will be a Q&A with the opportunity to ask questions specific to your business.

For anyone at a startup (or larger e-commerce company) that sells stuff online Lifetime Value is  invaluable to understand, so we’re excited to give this a try. If you’re able to join us, please fill out the form below to RSVP:

http://www.custora.com/home/clv_online_class

We’ll send you an email before the event with instructions for how to join.

Why average revenue per user is a useless metric

One commonly used metric retailers use when acquiring customers is revenue per user. This number is useless for any business that has repeat customers. Here’s why:

If you are trying to evaluate your customer base and want to figure out the value of a customer, a naive approach is to look at total revenue your business has made, and divide that by the number of customers.

Suppose one business sees that the average customer has spent $64. However this is not the lifetime value of customers, it is just the average observed value. We know that we can spend at least this much to acquire a new customer. However this is less than the lifetime value of a customer. This is especially true if our business is rapidly growing, and we have been acquiring many new customers. In this case, our customer database is full of young customers who are far away from realizing their full potential.

Instead it is important to look at the lifetime value of customers, so instead of dividing by the total number of customers, you can divide by the total number of years that a customers have been active. Then we get a number such as customers spend on average $53/year. Thus if we are interested in the two year value of customers, we realize that we can actually spend up to $106 per customer.

However, for many business (especially young ones) this is actually an overestimate of the customer value, since customers are most engaged after their first purchase then slow down their purchasing over time, or take their business elsewhere. In order to accurately predict customer lifetime value, you need to either only look at customers who have been alive for the period in question, or to use a customer lifetime value model to make the predictions. In the case of our example, the actual two-year value of a customer turned out to be $81, somewhere in between the two estimates.

Using Big Data to Craft the Well-timed Email

Check out our article in Multichannel Merchant on how to craft the perfectly timed email.

We analyzed the sales data for an online clothing retailer, roughly 1.5 million purchases over a seven year period. In doing so we were able to craft a pattern of the temporal patterns of purchases, which can be used in timing your emails:

The average customer is more likely to make purchases after they have settled in at work, answered their emails, and had a productive morning. This level of activity remains constant throughout the workday, before dropping off during dinner, between 6 p.m. and 8 p.m. Activity picks up again and reaches an absolute peak in the hours after dinner and before going to bed. Based on this information, you should send marketing content around mid-morning or sometime in the evening in order to reach the maximum number of likely purchasers.

Check back in on our blog tomorrow, to see a deeper dive into this data.

You can checkout the full article on multichannel merchant. Or you can learn the sales patterns for your business by checking out Custora’s Customer Segmentation platform.