Random coefficients logit in Stan

Outline

Well-known quant marketing problems
Why Stan?
Random coefficients logit for CBC in Stan
Aggregate random coefficients logit
Endogeneity of prices and market equilibrium

Well known problems in quant marketing

Study type	Benefits	Problems
Survey-based	Experimental variation possible in product attributes. Controlled environment.	Selection-into-survey bias. Stated preferences only
Observational study	Revealed preference doesn't lie!	Endogenous price-setting. Need a lot of model structure to learn anything.
Field experiment	Revealed preferences are observed. Controlled setting allows for causal inference	Can be extremely costly or unethical. Harder to get buy-in.

Less well-understood problems in quant marketing

If you sell multiple products, cannibalization can result in the profit-maximizing prices/attributes being a long way from the prices/attributes that maximize profits if you sell only one product.
Price effects that ignore competitors' responses will result in elasticities too large/excessively low price recommendations.
Point-estimates (including Bayesian point-estimates) are of limited use in decision-making. (Posterior) uncertainty should enter our decision-rules directly.

Why Stan?

For run-of-the-mill applications, Sawtooth and its competitors are great. Yet hard/impossible to customize for special applications.
Stan uses a cutting-edge MCMC algorithm (NUTS) that efficiently samples up to hundreds of thousands of parameters.
For some models, we can get very fast approximate estimates using variational inference, allowing rapid prototyping.
Stan encourages exceptional workflow and deep understanding of model assumptions.
Can build arbitrarily complex models (with continuous unknowns)

Some background

Background

Any decision we make depends on getting a probabilistic estimate of some unknowns \(\theta\) in the general population/future
For instance, the unknowns might be \(\theta = \{\text{part-worths, price elasticities, covariance parameters etc.}\}\)
We form a posterior estimate of these parameters \(p(\theta|\, y)\), which is a multivariate distribution.
This requires coming up with priors \(p(\theta)\) and a likelihood \(p(y|\, \theta)\)

Likelihood refresher

What do we want?

A function \(L(\theta, y)\) that gives us a big number when the \(\theta\) provided "makes sense" of the data, and a small number if \(\theta\) is hopeless.
You might think of this as the reverse of a loss function.

Likelihood refresher

Let's say we have a very simple linear model of \(y\) with a single covariate \(x\). \(\alpha\), \(\beta\) and \(\sigma\) are the model's parameters.

\[ y_{i} = \alpha + \beta x_{i} + \epsilon_{i} \]

with \(\epsilon_{i} \sim \mbox{N}(0, \sigma)\). This is the same as saying that

\[ y_{i} \sim \mbox{N}(\alpha + \beta x_{i}, \sigma) \]

Likelihood refresher

Let's say we propose a set of parameters, \(\alpha = 2\), \(\beta = .5\) and \(\sigma = 1\). Then we have the "generative distribution"

Likelihood refresher

Now we have an obervation \(y_{i} = 1,\, x_{i} = 1\). Given the parameters and \(x_{i}\) we know that the outcomes \(y\) should be distributed as so, and can evaluate the density at that point.

Likelihood refresher

The density of the proposed distribution at the observed outcome is known as the likelihood contribution.
Let's call this \(f(y_{1} |x_{1},\, \theta)\)
The likelihood is the product of all likelihood contributions across the sample \(\prod_{i} f(y_{i} |x_{i}|,\, \theta)\)
For example, if we had five observations, all with \(x=1\), \(y = \{1, 3, 4, 0, 2\}\) the likelihood would be (for the proposed )

Likelihood refresher

As you can see, multiplying densities together makes very small numbers
These numbers get smaller as you have more observations
Computer precision doesn't work well with very small numbers!

Log Likelihood to the rescue

\[ \log(L(y|\theta)) = log\left(\prod_{i} f(y_{i} |x_{i}|\, \theta)\right) = \sum_{i}\log\left(f(y_{i} |x_{i}|,\, \theta)\right) \]

For our example \(\log(L(y|\theta)) = -15.21969\)

Choosing Priors

Bayes rule says that

\[ p(\theta|\, y) \propto p(\theta)\times p(y|\,\theta) \]

We need good values of \(p(\theta)\).

These depend on your model, and in choice problems, the dimensionality of your choice space (ie. how many choices are of offer)

NEVER choose priors by looking at the data. NEVER "estimate priors".

Choosing Priors

The workflow we strongly recommend is:

For some prior \(p(\theta)\)

Draw some fake values of the parameters from the prior \(\hat{\theta} \sim p({\theta})\)
Draw some fake values of the data from the generative model associated with the proposed likelihood parameterized by the fake parameter draws \(\hat{y} \sim p(y|\, x,\, \theta)\)
Visualise \(\hat{y}\)—if you saw data like that would you think something is wrong?
Repeat many times

This is known as generating prior predictive draws. If you have good priors, your prior predictive draws will look vaguely plausible.

Choosing Priors with logits is hard

Priors interact with the likelihood. Only this workflow will teach you how.

Let's say we have \(j=1:10\) choices each with \(u_{j} = \theta_{j} + \epsilon_{j}\) for epsilon Gumbel. Let's give \(p(\theta) = \text{Normal}(0, 1)\)

Choosing Priors with logits is hard

What if we change it to normal(0, 10)?

Choosing Priors with logits is hard

Using wide priors on part-worths is like saying "I think there's a very strong chance this is a winner-takes-all market"
These priors interact with the scale of the product attributes
Bad priors = bad sampling; bad inference
Always stick to this workflow to properly understand how your priors affect things.

Putting it together

Recall that the posterior

\[ p(\theta|\, y) \propto p(\theta)\times p(y|\,\theta) \]

if we take the log of both sides, we get

\[ \log(p(\theta|\, y)) \propto \log(p(\theta)) + \log(p(y|\,\theta)) \] which simply says, for a given value of \(\theta\) (which we treat as unknown) and data \(y\) (which we treat as known) the log posterior is proportional to the log likelihood plus the log prior evaluated at \(\theta\).

This value is what we need to provide to Stan.

The random coefficients logit model

The generative model at the level of the individual

The random coefficients logit model (called "Hierarchical Bayes" in Sawtooth)

Each customer \(i\) draws their tastes \(\beta_{i}\) from some common distribution

\[ \beta_{i} \sim \mbox{Multi normal}(\beta, \Sigma) \] And each product-market pair has a demand shock \(\xi_{jt}\) drawn from a common distribution

\[ \xi \sim\mbox{Normal}(0, \sigma_{xi}) \]

The generative model at the level of the individual

Finally each customer weighs up utility from each choice

\[ U_{itj} = X_{jt}\beta_{i} + \xi_{jt} + \epsilon_{itj} \text{ with } \epsilon_{itj} \sim \text{Gumbel}() \]

Product attributes aare \(X_jt\). We assume the demand shock is known to decisionmakers but unobserved by us.

The decision-maker \(i\) in market \(t\) simply chooses the product \(j\) from \(J + 1\) products that gives them the greatest utility \(U_{itj}\)

The \(J+1\)'th product is known as the "outside good", which is often the choice to make no choice.

The generative model at the level of the individual

Often in market research we are testing new products (which have no demand shock).

\[ U_{itj} = X_{jt}\beta_{i} + \epsilon_{itj} \text{ with } \epsilon_{itj} \sim \text{Gumbel}() \]

We use the Gumbel assumption for errors because

\[ (e_{ijt} - e_{ikt}) \sim \text{Logistic()} \]

From utilities to choice probabilities

Because we've made the assumption that \(\epsilon_{itj}\) are IID Gumbel, it follows that for \(U_{it} = (U_{i1t}, \dots, U_{iJt})'\)

\[ p_{ijt}(\beta_{i}, \xi_{jt}) = \text{Prob}(U_{ijt} = \max(U_{it})) = \frac{\exp(X_{jt}\beta_{i} + \xi_{jt})}{1 + \sum_{j=1}^{J}{\exp(X_{jt}\beta_{i} + \xi_{jt}})} \]

Where the 1 in the denominator comes from the \(J+1\)'th "outside good" mean utility of 0. In conjoint tasks a choice is necessary

See Luce and Suppes (1965) and McFadden (1974).

What have we done?

We've gone from a set of unknowns \(\beta_{i},\, \xi_{jt}\) and observed data \(X_{jt}\) to choice probabilities at the individual level.
For the individual model, we can now construct a log likelihood

Constructing a likelihood from individual choice data

Our choice data from a conjoint exercise will look something like this

Customer	Visit	Beer option	Chosen	Price	Hops	…
1	1	1	1	6.25	15	…
1	1	2	0	7.29	16	…
1	1	3	0	8.99	12	…
1	1	4	0	12.99	35	…
…	…	…	…	…	…	…

Constructng a likelihood from individual choice data

We can take the product attributes, and for a given set of parameters use our model to generate choice probabilities

Customer	Visit	Beer option	Chosen	Model probability
1	1	1	1	\(p(U_{111} = max(U_{11}))\)
1	1	2	0	\(p(U_{112} = max(U_{11}))\)
1	1	3	0	\(p(U_{113} = max(U_{11}))\)
1	1	4	0	\(p(U_{114} = max(U_{11}))\)
…	…	…	…	…

Constructing a likelihood/loss function

If we arrange data by stacking products/individuals/visits into a panel, we can calculate the log likelihood for a given \(\theta\) as

\[ \log(L(\theta)) = y'\log(p(y=1|\, \theta)) \]

Here, \(p(y=1|\, \theta)\) is simply the probability vector for a given choice set that comes out of the model. \(y\) is the binary choice vector.

Back to the model

We said before that

\[ \beta_{i} \sim \mbox{Multi normal}(\beta, \Sigma) \]

Now remember that if \(y \sim \mbox{Normal}(\mu,\sigma)\) then \(y = \mu + \sigma z\) for some \(z \sim \mbox{Normal}(0, 1)\)

There's something similar for multivariate distributions. Note that \(\sigma\) above is the square root of the variance. In a multivariate distribution the "variance" is the covariance matrix \(\Sigma\). One "square root" of the covariance matrix is a lower-triangular matrix \(L\) such that \(LL' = \Sigma\).

This allows us to say that

\[ \beta_{i} = \beta + z L \text{ where } z \sim \mbox{Normal}(0, 1) \]

In Stan

See https://gist.github.com/khakieconomics/d01b43a50c7904bd73e8c190e0a8ecf8 for a version of the model suitable for extremely large attribute spaces (diagonal covariance matrix).

And https://gist.github.com/khakieconomics/0333a54dff4fabf6204080ca5bf92cb6 for a version with a full covariance matrix for the individual part-worths.

Calling a Stan program

Stan can be called from withing R, Python, Matlab, Mathematica, Stata, Julia or at the command line (there are no excuses!).

In R, you would generate a list() object containing the data described in the data{} block of the .stan file, then

# Load the rstan library
library(rstan)
# Make sure you're using all your cores
options(mc.cores = parallel::detectCores())

# Compile the model
compiled_model <- stan_model("hierarchical_bayes_full_covmat.stan")

# Fit the model 
model_fit <- sampling(compiled_model, data_list = your_data_list_goes_here, iter = 600)

Part 2: The Aggregate (revealed preference) model

Concerning \(\xi\)

\(\xi\) is typically interpreted as a demand shock, observed by customers and sellers. It should at the very least be used by sellers in setting price.

As \(E[\xi \text{ price}] \neq 0\), we cannot just implement this as a random effect. Doing so will bias our estimates of \(b\), \(\tau\) and \(\Omega\).
You can think of this like a product's fashionability. Fashionable products will have higher prices and sales, but the researcher doesn't observe fashionability. Ignoring the relationship will lead to estimates of price sensitivity that are too low.
The two most common approaches are to treat it as a latent variable that causes price (see Allenby, Chen and Yang 2003 for a Bayesian approach), or ignoring Bayes altogether and using GMM. This latter approach relies on the fact that demand shocks \(\xi\) and some supply-shifting instruments are uncorrelated.

Modeling aggregate choice

What I have described so far is a model of individual choices. Yet often we only have data on aggregate sales.
Of course we can't use aggregate data to estimate individual-level \(\beta_{i}\)s, but with a few assumptions we can still identify \(b\), \(\tau\), \(L\), \(z\), and \(\xi_{t}\).
These parameters are all that is required to simulate substitution patterns with no strategic response from competitors. We'll model that response soon.

The additional assumptions we have to make to estimate aggregate random coefficients logit is that a) all customers in the same market face the same choice set. Or at least that we know the distribution of available choices. b) all customers face the same price, which we know. And c) customers don't substitute across markets. These might be strong assumptions.

Modeling aggregate choice

That is, our aggregate data look more like this:

Market	Product	Sales	Price	Hops	…
1	Beer 1	44	6.25	15	…
1	Beer 2	36	7.29	16	…
1	Beer 3	1	8.99	12	…
1	Beer 4	27	12.99	35	…
2	Beer 1	202	6.99	15	…
2	Beer 2	16	7.49	16	…
…	…	…	…	…	…

From individual choice to market shares

Recall that we have defined

\[ V_{itj}(\theta, X_{itj}) = X_{jt}\beta_{i} + \xi_{jt} = (X_{jt}b + \xi_{jt}) + (X_{jt}\text{diag}(\tau)L z_{i}) \] where the first parenthesis contains the bits of \(V\) common to customers in market \(t\), and the second parenthesis contains the idiosyncratic parts. For a given \(z_{i}\),

\[ \text{prob}(U_{itj}=\max(U_{it})|\, z_{i}) = \text{softmax}(V_{it}(\theta, X_{it}))_{j} \] We can convert this to market shares \(s_{j}(\theta, X_{tj})\) by integrating over \(z\) (which has density \(p(z)\)):

\[ s_{j}(\theta, X_{tj}) = \int_{z} \text{softmax}(V_{itj}(\theta, X_{itj})) p(z) dz \]

From individual choice to market shares

This is very easy to numerically approximate.

If \(X_{jt}\) has \(P\) columns, draw an \(NS\) by \(P\) matrix \(z\) from a standard normal distribution, where NS is a large number (500 or so)
We now "simulate" NS customers' choice probabilities:
- For each row in \(z_{i}\) in \(z\), calculate \(V_{itj} = X_{jt}\beta_{i} + \xi_{jt} = (X_{jt}b + \xi_{jt}) + (X_{jt}\text{diag}(\tau)L z_{i})\)
- Create an \(NS \times P+1\) matrix \(\Pi\) with 0s in the first column (assuming that \(V_{it0}=0\)) and the remaining elements in each row \(i\) being filled by \(\text{softmax}(V_{it})\) (note \(V_{itj}\) is a real while \(V_{it}\) is a vector.)
- Take column means of \(\Pi\)
- These column means are our estimate of \(s(\theta, X_{tj})\). Each entry is our estimate of a good's share \(s_{j}(\theta, X_{t})\)

Implementing in Stan

The method above is extremely easy to implement in Stan. All we need to do is define a function at the top of the file (in a block functions{}) that takes our model parameters \(\beta\), \(L\) and \(\xi\), and a matrix of normal(0, 1) data \(z\). This allows us to calculate the implied market shares:

vector get_shares(vector delta, // delta = X*beta + xi
                  matrix x, 
                  matrix z, 
                  matrix L) {
  matrix[rows(eta), rows(delta)] utils;
  matrix[rows(eta), rows(delta)] probs;
  vector[rows(delta)] shares;
    
  for(i in 1:rows(eta)) {
    utils[i] = delta' + z[i] * L * x';
    probs[i] = softmax(utils[i]')';
  }
  
  for(i in 1:cols(probs)) {
    shares[i] = mean(probs[:,i]);
  }
  
  return(shares);
}

From market shares to sales data

Imagine this was the "true" model and we knew all unknowns with certainty. In a limited-sized market, would we expect the observed market share to line up perfectly with the market shares above? No:

There will always be a small approximation error in our numerical integration
In limited-sized markets, we'll always get randomness in decision-making. In small markets, this measurement error (where we think of measurement error as being observed market share as a measurement of "true" model-consistent share) can be large.

With this in mind, a market of size \(N\) (real product sales—not nominal!) will have observed sales

\[ (\text{Outside sales}_{t}, \text{Sales}_{t}')' \sim \text{Multinomial}(N, s(\theta, X_{t})) \]

A model for the supply side

We have a pretty nasty problem. We have as many \(\xi_{jt}\)s as we have observations—our model is completely identified by the priors.
As before, \(\xi_{jt}\) is probably correlated with price, and possibly correlated with non-price elements of \(X_{itj}\).

We adopt an approach similar in spirit to Allenby, Chen and Yang (2003) for dealing with \(\xi_{jt}\). Treat it as a latent factor.

Because each \(\xi_{jt}\) "points at" both \(\text{Sales}_{jt}\) and \(\text{price}_{jt}\), it is no longer completely identified by its prior.

A model for the supply side–reduced form

One approach is to be theory-agnostic.

An increase in \(\xi\) increases price and utility;
\(\xi\) is uncorrelated to cost-shifting instruments \(Z\)

So we could model

\[ \text{price}_{jt} \sim \text{Normal}_{+}(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta + \lambda \xi_{jt}, \sigma_{p}) \] Where \(\lambda > 0\) and \(X_{jt-\text{price}}\) are the elements of \(X_{jt}\) excluding price. We could also use a lognormal distribution.

If we believed that \(\xi\) was also correlated with other elements of \(X\), this could be modeled

\[ X_{jt} \sim \text{Multi normal}(A + X_{jt}\Gamma + Z_{jt}\Delta + \Lambda \xi_{jt}, \Sigma_{X}) \] Where \(A\) is a vector of intercepts, \(\Gamma\) is a coefficient matrix with zero on the diagonal, \(\Delta\) is a coefficent matrix, and \(\Lambda\) is a loading vector with the first element restricted to be positive.

Reduced-form supply side is clearly ridiculous

A reduced-form supply side might help us identify latent \(\xi\) fairly robustly, and in turn help us reduce bias in our structural parameter estimates. But there are some issues.

This is kind of like saying that we expect the prices to have no relationship to the market-clearing equilibrium.
Competitors do not respond at all to you changing price or introducing a new product.
Your uplift estimates from pricing changes will be over-optimistic.

A simple structural model

A very simple "structural" supply model might be to assume that each producer in each market maximizes within-period profits. For ease of explanation assume each product is produced by a single firm (this is simple to relax).

\[ \text{profits}_{jt} = (\text{price}_{jt} - mc_{jt})s_{j}(\theta, X_{t})N \] Where \(mc_{jt}\) is the marginal cost. Rearranging the FOC gives

\[ \text{price}_{jt} = mc_{jt} - \frac{s_{j}(\theta, X_{t})}{\frac{\partial s_{j}(\theta, X_{t})}{\partial \text{price}_{jt}}} \]

A simple structural model

If we model

\[ \text{mc}_{jt} \sim \text{Normal}_{+}(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta, \sigma_{p}) \] then

\[ \text{price}_{jt} \sim \text{Normal}_{+}\left(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta - \frac{s_{j}(\theta, X_{t})}{\frac{\partial s_{j}(\theta, X_{t})}{\partial \text{price}_{jt}}}, \sigma_{p}\right) \]

A simple structural model

Here, the error in \(mc\) has two interpretations:

Firms are perfect pricers (within this game) and marginal costs are known for sure
Firms make pricing mistakes but we completely describe marginal costs as a linear combination of \(X\) and \(Z\).

In reality, there are both pricing errors and marginal cost shocks. It would be an interesting exercise to try to separately identify them.

Putting them together

There is no reason you cannot combine both aggregate and conjoint data into a single model.
In this case, you would use the same parameters (part-worths/covariance matrix) in both models, and have them both contribute to the likelihood.