About Andy Lim

Data scientist at Custora.
Follow custora @custora
Follow me @andy_wrote
Follow me on Google+

Custora at DataGotham

In September I had the pleasure of giving a five-minute “lightning talk” at Data Gotham, an annual conference that brings together New Yorkers who work with data in their professional lives. Speakers came from private businesses, the public sector, and academia. Videos of the speakers are available on YouTube, and if you have time to listen to two or three of them, I’ll also recommend Igor Elbert from Gilt or Jeff Hammerbacher, currently at Mt. Sinai.

But if you have time for just one, I’ll (not so) humbly promote my own talk (also embedded at the bottom of this post). I give a quick, high-level view of how we approach data analysis at Custora, intended to be understandable by people who do not necessarily have a statistical background or familiarity with the retailing or marketing industries.

For those who are new to the idea of data-driven e-commerce, the main takeaway is our longstanding mantra that every customer is different, and that businesses that identify these differences and tailor their strategies accordingly have an advantage over those that do not. We reflect this through predictive modeling at the customer level. This allows businesses to identify the value of their existing base, accurately budget for new customer acquisition, and customize the way they communicate with different people.

For those more comfortable with statistics, I’d highlight the way we generatively model CLV with a dependency structure of intermediate variables. For example, we treat CLV as a function, in part, of order rate and of “lifetime,” the duration of the customer’s relationship with the business in question. These are in turn assumed to vary across customers according to a model-assumed probability distribution (whose parameters we fit to the data). By modeling these intermediate dynamics, we create a framework that can be used not only to estimate CLV but other statistics with meaningful interpretations in the context of order rate and lifetime: the probability that a customer makes another purchase, expected number of orders over the next month, and so on.

I hope you enjoy the presentation, and if you are interested in data and in the New York area next fall, I encourage you to look out for the next iteration of Data Gotham. And if you are interested in learning more about customer lifetime value, check out our Custora U courses on the subject.

Sparse matrix formats: pros and cons

Under the mathematical hood at Custora we often have to work with sparse vectors and matrices, which are very large but whose elements are mostly zero. A simple example of this type of data structure that occurs in many statistical applications is a model matrix, a matrix whose rows correspond to observations, whose columns correspond to categorical membership, and whose elements indicate each observation’s membership in each category.

 

Customer From US From Europe From Asia
Chester McNebbins 1 0 0
Vesper Lynd 0 1 0
Walter Walterson 0 0 0

 

Storing these data structures with a naive matrix architecture, one allocated variable per element, can be very wasteful. In R, which we use to do our statistical work, there is a package called Matrix that is designed to handle these structures more efficiently. Let’s do some quick comparisons (all numbers that follow have been computed on a 3.4GHz iMac with 32 GB of RAM):

> standard <- matrix(0, nrow=1e6, ncol=100)
> object.size(standard)
800000200 bytes
> csparse <- Matrix(0, nrow=1e6, ncol=100)
> object.size(csparse)
1824 bytes

R in fact cannot handle ordinary matrices with more than 231 – 1 elements; try increasing nrow to 1e8 to see this. (The 231 – 1 upper limit is the max value of a signed 32-bit integer, and is accessible in R via .Machine$integer.max.) Aside from concerns about memory efficiency, this limit is restrictive for our purposes: if you’re dealing with 3 million users and you want to store 1000 variables per user in your matrix, you’re already out of luck.

The default sparse matrix architecture for Matrix is the column-oriented format (class dgCMatrix in the Matrix package). There is also a similar row-oriented format that is not as well supported by Matrix (class dgRMatrix). Finally, Matrix also offers a triplet format (class dgTMatrix). Wikipedia’s article on sparse matrix representation explains these formats nicely, but to summarize in brief:

  • The column- and row-oriented formats store data as three arrays (A,B,C). In the column-oriented format, A contains all of the nonzero entries reading top to bottom one column after the other, B contains indices of/pointers to A indicating where each new column begins, and C contains the row index of each element in A. The row-oriented format is similar except that A reads left to right one row after the other, B indicates where each new row begins, and C contains column indices. (I will disregard the row-oriented format from here on, without loss of generality, as it is architecturally the same as the column-oriented format.)
  • The triplet format consists of a list of triplets (row, column, value) for each element. As Matrix‘s documentation mentions, this implies that internal representation is not unique; a different reordering of the triplets would correspond to the same matrix.

These two architectures lead to various pros and cons relative to each other and to the standard matrix format. One disadvantage of both of the sparse formats is that assignment is much slower than for standard matrices, although this is a considerably worse problem for the column-oriented format than for the triplet format:

> standard <- matrix(0, nrow=1e5, ncol=100)
> csparse <- as(Matrix(0, nrow=1e5, ncol=100), "dgCMatrix")
> tsparse <- as(Matrix(0, nrow=1e5, ncol=100), "dgTMatrix")
> system.time(standard[,25] <- 1)
 user system elapsed
 0.001 0.000 0.001
> system.time(csparse[,25] <- 1)
 user system elapsed
 3.949 0.002 3.952
> system.time(tsparse[,25] <- 1)
 user system elapsed
 0.066 0.007 0.075

Standard matrices can just assign new values directly to space allocated in memory and so assignment takes very little time. The triplet format has a little more to do; it has to allocate memory for each new triplet and then assign values. The column-oriented format has the most work to do: it has to allocate for its array of nonzero values and assign values, allocate for its array of row indices and assign those values, and then adjust the pointer array so that the columns start in the right place.

A disadvantage of the triplet format is that it often consumes more memory than the column-oriented format:

> standard <- matrix(c(0,0,0,0,1), nrow=1e5, ncol=100)
> csparse <- as(Matrix(c(0,0,0,0,1), nrow=1e5, ncol=100), "dgCMatrix")
> tsparse <- as(Matrix(c(0,0,0,0,1), nrow=1e5, ncol=100), "dgTMatrix")
> object.size(standard)
80000200 bytes
> object.size(csparse)
24001824 bytes
> object.size(tsparse)
32001416 bytes

These matrices are still fairly sparse, and both sparse formats offer substantial memory improvements over the standard format, but the column-oriented format does better in this respect. Both formats do worse as the matrix becomes less and less sparse and will eventually consume more memory than the standard matrix format (try it out with a matrix consisting entirely of 1s).

One way to deal with these issues in Matrix is to convert between sparse matrix formats as needed, or to deal with smaller matrices in the standard format and use R’s cbind2 and rbind2 functions (which can combine two matrices along columns or rows into a single larger matrix) to attach the data into a sparse matrix. For example, if you are storing sparse matrices on disk you may prefer to keep them in column-oriented format. But when later working with them, you may want to convert them to triplet format, or, if possible, do all your operations with standard-format matrices and take care of conversion to a sparse format separately.

> system.time({
   csparse <- Matrix(0, nrow=1e5, ncol=100)
   csparse[,c(25,50,75,100)] <- 1
 })
 user system elapsed
 16.141 0.016 16.157
> system.time({ 
 csparse <- Matrix(0, nrow=1e5, ncol=0)
 for (i in 1:4) { 
   a <- matrix(0, nrow=1e5, ncol=25); 
   a[,25] <- 1
   csparse <- cbind2(csparse, a)
 }})
 user system elapsed
 0.076 0.055 0.130

Using cbind/rbind (or in this case cbind2/rbind2) to incrementally increase the size of a matrix, as is done in the above example, is exactly the opposite of traditional good coding style in R, which emphasizes pre-allocation and vectorization in place of loops. Beginners to R or to other languages that emphasize an array programming approach often have to learn specifically not to do this. But in this case, because of the way our underlying data structures are defined, it’s more efficient to use a loop. To add more columns to a column-oriented sparse matrix, all you need to do is append to the three arrays, with no readjustment of the old values needed.

Knowing the sparsity of your data, knowing what kinds of operations you will be performing on your data, and thinking about your preferences when trading off memory consumption and speed can help you select appropriate formats and algorithms for statistical work on sparse matrices.

SignUpPlease2