27 September 2017

Outline

  • A background on choice models
  • The random coefficients logit model
  • Aggregate random coefficients logit
  • Our approach with reduced-form supply side
  • Our approach with a simple structural supply side
  • Pricing a portfolio of substitutable products
  • Modeling preferences of Tanzanian voters in the 1970s
  • Modeling preferences of Australian savers in the 00s

What do we want to model?

Fundamentally, we want to estimate

\[ \text{E}[\Delta \text{sales} |\, \text{do}(\text{price})] \] for a given product, where \(\text{do}(\text{price})\) is an exogenous variation in a good's price. This is a difficult problem:

  • Unless we're running experiments, we may not have observed exogenous variations in price. Historical price changes often driven by unobserved factors that also affect sales (demand shocks);
  • You'd normally want to estimate this in order to work out profit-maximizing prices. Yet customers substitute between products. If you sell competing products, the revenue-maximizing price for a product can be a long way from the portfolio-maximizing price (cannibalization);
  • Competitors will react to your exogenous price changes (strategic interaction). Assuming that competitors' prices are unchanged by your own will result in poor counterfactual predictions.

How do we want to model it?

  • These models are necessarily complex, and we want to make sure our software works;
  • Helpful to be able to simulate fake data from a model;

\[ \text{Draw } \hat\theta \sim \text{p}\left(\theta\right) \] \[ \text{Draw } \hat{y} \sim \text{p}\left(y |\, \hat\theta\right) \]

  • With these fake data \(\hat{y}\), make sure we can estimate \(\hat\theta\);
  • This process teaches us about sensible candidates for the prior \(\text{p}(\theta)\);
  • This is an analogy to unit testing for models;
  • If you can't do this, don't call your model generative! If you can't do this, it's impossible to test your estimation software.

We model utilities that people compare when making choices

Each customer \(i\) in market/time \(t\) is evaluating choices indexed by \(j\in 1:J\). Each customer compares each choice's utility to them \(U_{itj}\). This utility is comprised of additively seperable fixed utility \(V_{itj}\) and random utility \(\epsilon_{itj}\). We drop the product index to refer to a vector of utilities \(U_{it}\).

\[ U_{itj} = V_{itj} + \epsilon_{itj} \]

The choice to not make a choice (make no purchase) is called the outside option, and customarily indexed as choice \(j=0\). Normally for convenience, we assume \(V_{i0t}=0\). In the following, we assume choices are mutually exclusive and exhaustive.

Clearly, we never observe \(U_{itj}\). But we do observe choices. Each decisionmaker makes the choice that gives them the highest utility (fixed plus random). The random component means that two decisionmakers \(a\) and \(b\) with \(V_{at} = V_{bt}\) can make different choices.

But we observe choices, not utilities

"Beer choice" is \(j\), product characteristics \(X_{itj}\) are the price, hops, malt etc.

Customer Visit Beer choice Price Hops Malt
1 1 1 6.25 15 20
1 2 4 12.99 35 25
1 3 4 12.99 35 25
1 4 3 8.99 12 5
2 1 4 12.49 35 25
2 2 2 7.29 16 17

Options

We want to build a model of each customer's purchase probability given their characteristics, product characteristics &c. How might we do this?

  • One approach would be to just use machine learning. Predict Beer choice given everything else.
  • The problem is that the alternatives available to each customer matter. What they do not choose is also informative.
  • Not all customers face the same options. Can't use a standard classification approach as that will give probabilistic weight to choices that a customer does not face.

Maybe it will help to transform the data.

Expressing data as possible choices

We can think of our transformed data as describing the potential choices that each customer will make, with each "observation" (row) being a potential purchase. Note that because choices are mutually exclusive and exhaustive, Chosen can only be 1 for one "observation" in each customer visit.

Let's take the first customer's first visit.

Customer Visit Beer option Chosen Price Hops
1 1 1 1 6.25 15
1 1 2 0 7.29 16
1 1 3 0 8.99 12
1 1 4 0 12.99 35

From utilities to probability

Customer Visit Beer option Chosen Model probability
1 1 1 1 \(p(U_{111} = max(U_{11}))\)
1 1 2 0 \(p(U_{112} = max(U_{11}))\)
1 1 3 0 \(p(U_{113} = max(U_{11}))\)
1 1 4 0 \(p(U_{114} = max(U_{11}))\)
  • If we have a model that gives us probabilities \(p(U_{itj} = max(U_{it}))\) of each choice given the choices on offer, we can construct a loss function/likelihood for any observed choices. This loss function/likelihood can help us estimate model unknowns.
  • For each visit, these probabilities must sum to less than one (there should be a positive probability of no choice made).

Constructing a likelihood/loss function

In the previous table, we had the observed choice vector Chosen, \(y\), and Model probability given parameters, \(p(y |\, \theta)\). If we arrange data by stacking products/individuals/visits into a panel, we can calculate the log likelihood as

\[ \log(L(\theta)) = y'\log(p(y=1|\, \theta)) \] If we were optimizers, we'd want to maximize the log likelihood, or minimize the negative log likelihood.

All we need to do is come up with a way of going from our utility model to \(p(y=1 |\, \theta)\).

Back to utility

We want to build a model of the fixed component \(V_{itj}\) that depends on some model parameters \(\theta\) and information. Let's abuse notation and call it \(V_{itj}(\theta, X_{itj})\), where \(X_{itj}\) is the relevant information about the customer, product and market. For now we assume that all relevant information to the decision is observed.

Our utility is

\[ U_{itj} = V_{itj}(\theta, X_{itj}) + \epsilon_{itj} \]

There are many candidates for \(V_{itj}()\), and you could come up with many more. For the moment, let's take it as fixed and see how we'd go from our model to choice probabilities.

Simulating utilities

In principle (though not in practice), for any \(V_{itj}(\theta, X_{itj})\) we could simulate \(U_{itj}\) using any distribution of \(\epsilon_{itj}\). The maximum utility for each row—the choice that would have been made in that draw—is indicated with parentheses.

Beer 1 Beer 2 Beer 3 Beer 4 Beer 5 Beer 6 Beer 7 U_11 (draw)
1.22 (3.66) -4.02 -0.23 -1.62 3.05 0.4 \(= V_{11} + e_{11_{1}} = U_{11_{1}}\)
-2.92 (4.58) -3.55 -0.31 -3.17 0.92 -0.46 \(= V_{11} + e_{11_{2}} = U_{11_{2}}\)
0.49 (8.62) -3.5 0.64 1.46 2.83 -1.32 \(= V_{11} + e_{11_{3}} = U_{11_{3}}\)
-0.63 1.82 -7.53 (1.98) -4.07 0.88 -0.08 \(= V_{11} + e_{11_{4}} = U_{11_{4}}\)

Turning utilities into probabilities

Once we have utility draws, we can create row-max indicators and take column averages. This gives us the proportion of times each column (choice) "won". Over a large number of draws, this gives us the probability that a customer makes choice \(j\) for the given model \(V()\) parameterized at \(\theta\).

Beer 1 Beer 2 Beer 3 Beer 4 Beer 5 Beer 6 Beer 7 Draw
0 1 0 0 0 0 0 \(=I(U_{11_{1}} = \text{max}(U_{11_{1}}))\)
0 1 0 0 0 0 0 \(=I(U_{11_{2}} = \text{max}(U_{11_{2}}))\)
0 1 0 0 0 0 0 \(=I(U_{11_{3}} = \text{max}(U_{11_{3}}))\)
0 0 0 1 0 0 0 \(=I(U_{11_{4}} = \text{max}(U_{11_{4}}))\)
0.04 0.64 0.14 0.03 0.14 0.01 0.04 = Proportion

What have we done?

We've gone from \(\theta\rightarrow p(y)\) (for the first visit for the first person for a given \(V_{itj}(\theta, X_{itj})\))

Beer 1 Beer 2 Beer 3 Beer 4 Beer 5 Beer 6 Beer 7 Draw
0.04 0.64 0.14 0.03 0.14 0.01 0.04 = Proportion

We can

  1. Use it to generate new values/predictions for any set of model parameters \(\theta\)
  2. Go backwards—evaluate the likelihood \(p(y|\theta)\) using the method described above.

Why you shouldn't simulate choice like this

  • Arbitrary distributions of \(\epsilon_{itj}\) are not necessarily identifiable. See Train's (2009) chapter on Probit. Essentially it's possible to generate choice probabilities for any distribution, but not necessarily possible to estimate the distribution (even within a family) given observed data.
  • The standard error of the probabilities increases quickly with the number of alternatives. This means running more draws to simulate the choice probabilities. You need to simulate probabilities for every choice, and so this becomes computationally demanding.
  • There are more straightforward alternatives!

What to do instead?

  • Assume that the vector \(\epsilon_{it} \sim \text{Multi normal}(0, \Sigma)\), and simulate choice probabilities using GHK or alternatives:
    • This allows choice probabilities to be arbitrarily correlated and variance of certain choice shocks to differ; but
    • Can be tricky to have customers facing different numbers of choices;
    • Need to simulate probabilities for every choice made—computationally expensive.
  • Assume that \(\epsilon_{itj} \sim \text{Gumbel}()\) is iid:
    • Shocks won't be correlated (ie. need to explicitly model correlation in \(V_{ijt}()\)); but
    • No need to simulate probabilities, which take the form

\[ \text{p}(U_{itj}=\max(U_{it})) = \frac{\exp(V_{itj})}{\sum_{j=0}^{J} \exp(V_{itj})} \]

That is, \[ \text{p}(y_{it}=1 | \, \theta) = \mbox{softmax}(V_{it}) \]

Random coefficients logit

The random coefficients logit model (also known as "mixed logit") is

\[ U_{itj} = X_{jt}\beta_{i} + \xi_{jt} + \epsilon_{itj} \text{ with } \epsilon_{itj} \sim \text{Gumbel}() \]

\(\beta_{i}\) are individual slope coefficients, \(xi_{jt}\) is each product's demand shock, assumed to be known to decisionmakers but unobserved by us. This gives

\[ V_{itj}(\theta, X_{itj}) = X_{jt}\beta_{i} + \xi_{jt} \]

We normally model \(\beta_{i}\) as

\[ \beta_{i} \sim \text{Multi normal}(b, \Sigma_{\beta}) \]

Parameterization and priors

Covariance matrices are hard to think about/provide priors for. So we parameterize the model as

\[ \Sigma_{\beta} = \text{diag}(\tau)\, \Omega_{\beta}\, \text{diag}(\tau) \]

where \(\tau\) is a vector of scale parameters and \(\Omega\) is the correlation matrix of the individual slopes.

This parameterization enables us to give easy-to-understand priors to our parameters

\[ b_{j} \sim \text{Normal}(0, 1) \]

\[ \tau_{j} \sim \text{Cauchy}_{+}(0, 1) \] \[ \Omega \sim \text{LKJ}(4) \]

\[ \xi \sim \text{Normal}(0, \sigma_{\xi}) \text{ with } \sigma_{\xi} \sim \text{Normal}_{+}(0, .5) \]

(Priors values are obviously chosen to be context dependent)

Multi-normal reparameterization trick

Recall that if \(y \sim \text{Normal}(\mu, \sigma)\) then \(y = \mu + \sigma z\) where \(z \sim \text{Normal}(0, 1)\). The same idea works in higher dimensions, only we need the "square root" of the covariance matrix.

If

\[ Y \sim \text{Multi normal}(\mu, \Sigma) = \text{Multi normal}(\mu, \text{diag}(\tau)\, \Omega_{\beta}\, \text{diag}(\tau) ) \] Take the Cholesky factor \(LL' = \Omega\)

Then

\[ Y = \mu + \text{diag}(\tau)\,L z \text{ where each element of } z \sim \text{Normal}(0,1) \]

Parameterization

So we can parameterize our \(\beta_{i}\) as

\[ \beta_{i} = b + \text{diag}(\tau)\,L z_{i} \] with \(z_{ij} \sim \text{Normal}(0,1)\).

This parameterization will come in handy once we move to an aggregate model.

Note that there's nothing to say we can't model \(b\) with information about each individual also.

Putting it together

  • We now have prior distributions for all unknowns \(\theta = (b, \tau, \text{vec}(L), z, \xi)\) and known conditioning information \(X_{jt}\);
  • We can now draw from the prior and calculate \(V_{itj}()\) and thus produce choice probabilities. From a categorical distribution, we could simulate choices.
  • If we observe choices, we can generate penalized likelihood estimates for the unknowns.
  • As we have priors for all unknowns, we can use Stan to run MCMC and obtain Bayesian estimates of all parameters.

Concerning \(\xi\)

\(\xi\) is typically interpreted as a demand shock, observed by customers and sellers. It should at the very least be used by sellers in setting price.

  • As \(E[\xi \text{ price}] \neq 0\), we cannot just implement this as a random effect. Doing so will bias our estimates of \(b\), \(\tau\) and \(\Omega\).
  • You can think of this like a product's fashionability. Fashionable products will have higher prices and sales, but the researcher doesn't observe fashionability. Ignoring the relationship will lead to estimates of price sensitivity that are too low.
  • The two most common approaches are to treat it as a latent variable that causes price (see Allenby, Chen and Yang 2003 for a Bayesian approach), or ignoring Bayes altogether and using GMM. This latter approach relies on the fact that demand shocks \(\xi\) and some supply-shifting instruments are uncorrelated.

Modeling aggregate choice

  • What I have described so far is a model of individual choices. Yet often we only have data on aggregate sales.
  • Of course we can't use aggregate data to estimate individual-level \(\beta_{i}\)s, but with a few assumptions we can still identify \(b\), \(\tau\), \(L\), \(z\), and \(\xi_{t}\).
  • These parameters are all that is required to simulate substitution patterns with no strategic response from competitors. We'll model that response soon.

The additional assumptions we have to make to estimate aggregate random coefficients logit is that a) all customers in the same market face the same choice set. Or at least that we know the distribution of available choices. b) all customers face the same price, which we know. And c) customers don't substitute across markets. These might be strong assumptions.

Modeling aggregate choice

That is, our aggregate data look more like this:

Market Product Sales Price Hops
1 Beer 1 44 6.25 15
1 Beer 2 36 7.29 16
1 Beer 3 1 8.99 12
1 Beer 4 27 12.99 35
2 Beer 1 202 6.99 15
2 Beer 2 16 7.49 16

From individual choice to market shares

Recall that we have defined

\[ V_{itj}(\theta, X_{itj}) = X_{jt}\beta_{i} + \xi_{jt} = (X_{jt}b + \xi_{jt}) + (X_{jt}\text{diag}(\tau)L z_{i}) \] where the first parenthesis contains the bits of \(V\) common to customers in market \(t\), and the second parenthesis contains the idiosyncratic parts. For a given \(z_{i}\),

\[ \text{prob}(U_{itj}=\max(U_{it})|\, z_{i}) = \text{softmax}(V_{it}(\theta, X_{it}))_{j} \] We can convert this to market shares \(s_{j}(\theta, X_{tj})\) by integrating over \(z\) (which has density \(p(z)\)):

\[ s_{j}(\theta, X_{tj}) = \int_{z} \text{softmax}(V_{itj}(\theta, X_{itj})) p(z) dz \]

From individual choice to market shares

This is very easy to numerically approximate.

  • If \(X_{jt}\) has \(P\) columns, draw an \(NS\) by \(P\) matrix \(z\) from a standard normal distribution, where NS is a large number (500 or so)
  • We now "simulate" NS customers' choice probabilities:
    • For each row in \(z_{i}\) in \(z\), calculate \(V_{itj} = X_{jt}\beta_{i} + \xi_{jt} = (X_{jt}b + \xi_{jt}) + (X_{jt}\text{diag}(\tau)L z_{i})\)
    • Create an \(NS \times P+1\) matrix \(\Pi\) with 0s in the first column (assuming that \(V_{it0}=0\)) and the remaining elements in each row \(i\) being filled by \(\text{softmax}(V_{it})\) (note \(V_{itj}\) is a real while \(V_{it}\) is a vector.)
    • Take column means of \(\Pi\)
    • These column means are our estimate of \(s(\theta, X_{tj})\). Each entry is our estimate of a good's share \(s_{j}(\theta, X_{t})\)

From market shares to sales data

Imagine this was the "true" model and we knew all unknowns with certainty. In a limited-sized market, would we expect the observed market share to line up perfectly with the market shares above? No:

  • There will always be a small approximation error in our numerical integration
  • In limited-sized markets, we'll always get randomness in decision-making. In small markets, this measurement error (where we think of measurement error as being observed market share as a measurement of "true" model-consistent share) can be large.

With this in mind, a market of size \(N\) (real product sales—not nominal!) will have observed sales

\[ (\text{Outside sales}_{t}, \text{Sales}_{t}')' \sim \text{Multinomial}(N, s(\theta, X_{t})) \]

A model for the supply side

  • We have a pretty nasty problem. We have as many \(\xi_{jt}\)s as we have observations—our model is completely identified by the priors.
  • As before, \(\xi_{jt}\) is probably correlated with price, and possibly correlated with non-price elements of \(X_{itj}\).

We adopt an approach similar in spirit to Allenby, Chen and Yang (2003) for dealing with \(\xi_{jt}\). Treat it as a latent factor.

  • Because each \(\xi_{jt}\) "points at" both \(\text{Sales}_{jt}\) and \(\text{price}_{jt}\), it is no longer completely identified by its prior.

A model for the supply side–reduced form

One approach is to be theory-agnostic.

  • An increase in \(\xi\) increases price and utility;
  • \(\xi\) is uncorrelated to cost-shifting instruments \(Z\)

So we could model

\[ \text{price}_{jt} \sim \text{Normal}_{+}(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta + \lambda \xi_{jt}, \sigma_{p}) \] Where \(\lambda > 0\) and \(X_{jt-\text{price}}\) are the elements of \(X_{jt}\) excluding price. We could also use a lognormal distribution.

If we believed that \(\xi\) was also correlated with other elements of \(X\), this could be modeled

\[ X_{jt} \sim \text{Multi normal}(A + X_{jt}\Gamma + Z_{jt}\Delta + \Lambda \xi_{jt}, \Sigma_{X}) \] Where \(A\) is a vector of intercepts, \(\Gamma\) is a coefficient matrix with zero on the diagonal, \(\Delta\) is a coefficent matrix, and \(\Lambda\) is a loading vector with the first element restricted to be positive.

Reduced-form supply side is clearly ridiculous

A reduced-form supply side might help us identify latent \(\xi\) fairly robustly, and in turn help us reduce bias in our structural parameter estimates. But there are some issues.

  • This is kind of like saying that we expect the prices to have no relationship to the market-clearing equilibrium.
  • Competitors do not respond at all to you changing price or introducing a new product.
  • Your uplift estimates from pricing changes will be over-optimistic.

A simple structural model

A very simple "structural" supply model might be to assume that each producer in each market maximizes within-period profits. For ease of explanation assume each product is produced by a single firm (this is simple to relax).

\[ \text{profits}_{jt} = (\text{price}_{jt} - mc_{jt})s_{j}(\theta, X_{t})N \] Where \(mc_{jt}\) is the marginal cost. Rearranging the FOC gives

\[ \text{price}_{jt} = mc_{jt} - \frac{s_{j}(\theta, X_{t})}{\frac{\partial s_{j}(\theta, X_{t})}{\partial \text{price}_{jt}}} \]

A simple structural model

If we model

\[ \text{mc}_{jt} \sim \text{Normal}_{+}(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta, \sigma_{p}) \] then

\[ \text{price}_{jt} \sim \text{Normal}_{+}\left(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta - \frac{s_{j}(\theta, X_{t})}{\frac{\partial s_{j}(\theta, X_{t})}{\partial \text{price}_{jt}}}, \sigma_{p}\right) \]

A simple structural model

Here, the error in \(mc\) has two interpretations:

  1. Firms are perfect pricers (within this game) and marginal costs are known for sure
  2. Firms make pricing mistakes but we completely describe marginal costs as a linear combination of \(X\) and \(Z\).

In reality, there are both pricing errors and marginal cost shocks. It would be an interesting exercise to try to separately identify them.

Implementation

  • We use Stan to model this. This allows us to estimate via penalized likelihood (fast!) or Hamiltonian Monte Carlo (slow!).
  • Use the same values of \(z\) for every iteration. Provide these as data to your code.
  • Estimation is easy:
    • For a given set of parameters and \(X\), generate predicted shares;
    • Model sales as multinomial distributed around predicted shares;
    • Model price using your choice of price model.
  • We should thank Mike Betancourt and Andy Gelman for helping us out with some fitting issues earlier in the piece.

A big warning

  • Identification of \(\xi\) depends on a sensible supply-side model.
  • There may not be unique equilibria to pricing games. In which case there is no well-defined likelihood for price.
  • This is still the case for policy simulations (like merger simulations) with other approaches like BLP, but our case relies on a good supply-side definition to estimate demand-side parameters.

BLP

The normal way to estimate these models is to use the algorithm of Berry, Levinsohn and Pakes (1995).

It differs considerably from our approach

  • Do not treat \(\xi\) as a random variable; rather, it's unobserved data;
  • No measurement error;
  • Uses GMM; depends on bootstrap or large-sample asymptotics for uncertainty;
  • \(\xi\) is identified without supply-side model;
  • Computationally expensive fixed-point iteration is required;
  • Not especially robust. Requires blood sacrifice to work properly.

Very influential, 4450 citations, will win them a gong before long.

BLP (very quickly!)

The idea is to perform the following operations

  1. For a given fixed set of parameters \(\theta\) (which now exclude \(\xi\)), back out \(\xi\) using fixed point iteration;
  2. Interpret \(\xi\) as a demand shock, which will be uncorrelated with cost-shifting instruments in equilibrium. Use this fact to construct moment conditions \(g(\theta | \xi, Z)\);
  3. Minimize moment conditions over \(\theta\).

Repeat 1:3 until convergence.

Why use our method over BLP?

  • Easier to implement;
  • Get full posterior over \(\theta\), which you can use in decision-theoretic framework;
  • You're dealing with small markets which might have measurement error;
  • You care about cross-validation. With BLP, shares predictions require \(\xi\), which is backed out of fixed-point iterations that require shares (which you don't have!). By contrast our method allows you to integrate predictions over the posterior distribution of a new \(\hat\xi\). This is important!

Why use BLP over our method

  • Weird pricing games being played (price wars/loss-leading/predatory pricing). I really don't want to down-play the importance of a reasonable price model being necessary to identifying \(\xi\)
  • Very large markets -> no measurement error;
  • You care about being published.

Applications: Estimating known parameters

Applications: Pricing portfolios of competing products

Applications: Tanzanian elections