Study type | Benefits | Problems |
---|---|---|
Survey-based | Experimental variation possible in product attributes. Controlled environment. | Selection-into-survey bias. Stated preferences only |
Observational study | Revealed preference doesn't lie! | Endogenous price-setting. Need a lot of model structure to learn anything. |
Field experiment | Revealed preferences are observed. Controlled setting allows for causal inference | Can be extremely costly or unethical. Harder to get buy-in. |
If you sell multiple products, cannibalization can result in the profit-maximizing prices/attributes being a long way from the prices/attributes that maximize profits if you sell only one product.
Price effects that ignore competitors' responses will result in elasticities too large/excessively low price recommendations.
Point-estimates (including Bayesian point-estimates) are of limited use in decision-making. (Posterior) uncertainty should enter our decision-rules directly.
Any decision we make depends on getting a probabilistic estimate of some unknowns \(\theta\) in the general population/future
For instance, the unknowns might be \(\theta = \{\text{part-worths, price elasticities, covariance parameters etc.}\}\)
We form a posterior estimate of these parameters \(p(\theta|\, y)\), which is a multivariate distribution.
This requires coming up with priors \(p(\theta)\) and a likelihood \(p(y|\, \theta)\)
What do we want?
A function \(L(\theta, y)\) that gives us a big number when the \(\theta\) provided "makes sense" of the data, and a small number if \(\theta\) is hopeless.
You might think of this as the reverse of a loss function.
Let's say we have a very simple linear model of \(y\) with a single covariate \(x\). \(\alpha\), \(\beta\) and \(\sigma\) are the model's parameters.
\[ y_{i} = \alpha + \beta x_{i} + \epsilon_{i} \]
with \(\epsilon_{i} \sim \mbox{N}(0, \sigma)\). This is the same as saying that
\[ y_{i} \sim \mbox{N}(\alpha + \beta x_{i}, \sigma) \]
Let's say we propose a set of parameters, \(\alpha = 2\), \(\beta = .5\) and \(\sigma = 1\). Then we have the "generative distribution"
Now we have an obervation \(y_{i} = 1,\, x_{i} = 1\). Given the parameters and \(x_{i}\) we know that the outcomes \(y\) should be distributed as so, and can evaluate the density at that point.
The density of the proposed distribution at the observed outcome is known as the likelihood contribution.
Let's call this \(f(y_{1} |x_{1},\, \theta)\)
The likelihood is the product of all likelihood contributions across the sample \(\prod_{i} f(y_{i} |x_{i}|,\, \theta)\)
For example, if we had five observations, all with \(x=1\), \(y = \{1, 3, 4, 0, 2\}\) the likelihood would be (for the proposed )
\[ L(y|\theta) = f(1|\theta)\times f(3|\theta)\times f(4|\theta)\times f(0|\theta) \times f(2|\theta) = 0.00000024 \]
As you can see, multiplying densities together makes very small numbers
These numbers get smaller as you have more observations
Computer precision doesn't work well with very small numbers!
Log Likelihood to the rescue
\[ \log(L(y|\theta)) = log\left(\prod_{i} f(y_{i} |x_{i}|\, \theta)\right) = \sum_{i}\log\left(f(y_{i} |x_{i}|,\, \theta)\right) \]
Bayes rule says that
\[ p(\theta|\, y) \propto p(\theta)\times p(y|\,\theta) \]
We need good values of \(p(\theta)\).
These depend on your model, and in choice problems, the dimensionality of your choice space (ie. how many choices are of offer)
NEVER choose priors by looking at the data. NEVER "estimate priors".
The workflow we strongly recommend is:
For some prior \(p(\theta)\)
This is known as generating prior predictive draws. If you have good priors, your prior predictive draws will look vaguely plausible.
Priors interact with the likelihood. Only this workflow will teach you how.
Let's say we have \(j=1:10\) choices each with \(u_{j} = \theta_{j} + \epsilon_{j}\) for epsilon Gumbel. Let's give \(p(\theta) = \text{Normal}(0, 1)\)
What if we change it to normal(0, 10)?
Using wide priors on part-worths is like saying "I think there's a very strong chance this is a winner-takes-all market"
These priors interact with the scale of the product attributes
Bad priors = bad sampling; bad inference
Always stick to this workflow to properly understand how your priors affect things.
Recall that the posterior
\[ p(\theta|\, y) \propto p(\theta)\times p(y|\,\theta) \]
if we take the log of both sides, we get
\[ \log(p(\theta|\, y)) \propto \log(p(\theta)) + \log(p(y|\,\theta)) \] which simply says, for a given value of \(\theta\) (which we treat as unknown) and data \(y\) (which we treat as known) the log posterior is proportional to the log likelihood plus the log prior evaluated at \(\theta\).
This value is what we need to provide to Stan.
The random coefficients logit model (called "Hierarchical Bayes" in Sawtooth)
Each customer \(i\) draws their tastes \(\beta_{i}\) from some common distribution
\[ \beta_{i} \sim \mbox{Multi normal}(\beta, \Sigma) \] And each product-market pair has a demand shock \(\xi_{jt}\) drawn from a common distribution
\[ \xi \sim\mbox{Normal}(0, \sigma_{xi}) \]
Finally each customer weighs up utility from each choice
\[ U_{itj} = X_{jt}\beta_{i} + \xi_{jt} + \epsilon_{itj} \text{ with } \epsilon_{itj} \sim \text{Gumbel}() \]
Product attributes aare \(X_jt\). We assume the demand shock is known to decisionmakers but unobserved by us.
The decision-maker \(i\) in market \(t\) simply chooses the product \(j\) from \(J + 1\) products that gives them the greatest utility \(U_{itj}\)
The \(J+1\)'th product is known as the "outside good", which is often the choice to make no choice.
Often in market research we are testing new products (which have no demand shock).
\[ U_{itj} = X_{jt}\beta_{i} + \epsilon_{itj} \text{ with } \epsilon_{itj} \sim \text{Gumbel}() \]
We use the Gumbel assumption for errors because
\[ (e_{ijt} - e_{ikt}) \sim \text{Logistic()} \]
Because we've made the assumption that \(\epsilon_{itj}\) are IID Gumbel, it follows that for \(U_{it} = (U_{i1t}, \dots, U_{iJt})'\)
\[ p_{ijt}(\beta_{i}, \xi_{jt}) = \text{Prob}(U_{ijt} = \max(U_{it})) = \frac{\exp(X_{jt}\beta_{i} + \xi_{jt})}{1 + \sum_{j=1}^{J}{\exp(X_{jt}\beta_{i} + \xi_{jt}})} \]
Where the 1 in the denominator comes from the \(J+1\)'th "outside good" mean utility of 0. In conjoint tasks a choice is necessary
See Luce and Suppes (1965) and McFadden (1974).
We've gone from a set of unknowns \(\beta_{i},\, \xi_{jt}\) and observed data \(X_{jt}\) to choice probabilities at the individual level.
For the individual model, we can now construct a log likelihood
Our choice data from a conjoint exercise will look something like this
Customer | Visit | Beer option | Chosen | Price | Hops | … |
---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 6.25 | 15 | … |
1 | 1 | 2 | 0 | 7.29 | 16 | … |
1 | 1 | 3 | 0 | 8.99 | 12 | … |
1 | 1 | 4 | 0 | 12.99 | 35 | … |
… | … | … | … | … | … | … |
We can take the product attributes, and for a given set of parameters use our model to generate choice probabilities
Customer | Visit | Beer option | Chosen | Model probability |
---|---|---|---|---|
1 | 1 | 1 | 1 | \(p(U_{111} = max(U_{11}))\) |
1 | 1 | 2 | 0 | \(p(U_{112} = max(U_{11}))\) |
1 | 1 | 3 | 0 | \(p(U_{113} = max(U_{11}))\) |
1 | 1 | 4 | 0 | \(p(U_{114} = max(U_{11}))\) |
… | … | … | … | … |
If we arrange data by stacking products/individuals/visits into a panel, we can calculate the log likelihood for a given \(\theta\) as
\[ \log(L(\theta)) = y'\log(p(y=1|\, \theta)) \]
Here, \(p(y=1|\, \theta)\) is simply the probability vector for a given choice set that comes out of the model. \(y\) is the binary choice vector.
We said before that
\[ \beta_{i} \sim \mbox{Multi normal}(\beta, \Sigma) \]
Now remember that if \(y \sim \mbox{Normal}(\mu,\sigma)\) then \(y = \mu + \sigma z\) for some \(z \sim \mbox{Normal}(0, 1)\)
There's something similar for multivariate distributions. Note that \(\sigma\) above is the square root of the variance. In a multivariate distribution the "variance" is the covariance matrix \(\Sigma\). One "square root" of the covariance matrix is a lower-triangular matrix \(L\) such that \(LL' = \Sigma\).
This allows us to say that
\[ \beta_{i} = \beta + z L \text{ where } z \sim \mbox{Normal}(0, 1) \]
See https://gist.github.com/khakieconomics/d01b43a50c7904bd73e8c190e0a8ecf8 for a version of the model suitable for extremely large attribute spaces (diagonal covariance matrix).
And https://gist.github.com/khakieconomics/0333a54dff4fabf6204080ca5bf92cb6 for a version with a full covariance matrix for the individual part-worths.
Stan can be called from withing R, Python, Matlab, Mathematica, Stata, Julia or at the command line (there are no excuses!).
In R, you would generate a list()
object containing the data described in the data{}
block of the .stan
file, then
# Load the rstan library library(rstan) # Make sure you're using all your cores options(mc.cores = parallel::detectCores()) # Compile the model compiled_model <- stan_model("hierarchical_bayes_full_covmat.stan") # Fit the model model_fit <- sampling(compiled_model, data_list = your_data_list_goes_here, iter = 600)
\(\xi\) is typically interpreted as a demand shock, observed by customers and sellers. It should at the very least be used by sellers in setting price.
The additional assumptions we have to make to estimate aggregate random coefficients logit is that a) all customers in the same market face the same choice set. Or at least that we know the distribution of available choices. b) all customers face the same price, which we know. And c) customers don't substitute across markets. These might be strong assumptions.
That is, our aggregate data look more like this:
Market | Product | Sales | Price | Hops | … |
---|---|---|---|---|---|
1 | Beer 1 | 44 | 6.25 | 15 | … |
1 | Beer 2 | 36 | 7.29 | 16 | … |
1 | Beer 3 | 1 | 8.99 | 12 | … |
1 | Beer 4 | 27 | 12.99 | 35 | … |
2 | Beer 1 | 202 | 6.99 | 15 | … |
2 | Beer 2 | 16 | 7.49 | 16 | … |
… | … | … | … | … | … |
Recall that we have defined
\[ V_{itj}(\theta, X_{itj}) = X_{jt}\beta_{i} + \xi_{jt} = (X_{jt}b + \xi_{jt}) + (X_{jt}\text{diag}(\tau)L z_{i}) \] where the first parenthesis contains the bits of \(V\) common to customers in market \(t\), and the second parenthesis contains the idiosyncratic parts. For a given \(z_{i}\),
\[ \text{prob}(U_{itj}=\max(U_{it})|\, z_{i}) = \text{softmax}(V_{it}(\theta, X_{it}))_{j} \] We can convert this to market shares \(s_{j}(\theta, X_{tj})\) by integrating over \(z\) (which has density \(p(z)\)):
\[ s_{j}(\theta, X_{tj}) = \int_{z} \text{softmax}(V_{itj}(\theta, X_{itj})) p(z) dz \]
This is very easy to numerically approximate.
The method above is extremely easy to implement in Stan. All we need to do is define a function at the top of the file (in a block functions{}
) that takes our model parameters \(\beta\), \(L\) and \(\xi\), and a matrix of normal(0, 1) data \(z\). This allows us to calculate the implied market shares:
vector get_shares(vector delta, // delta = X*beta + xi matrix x, matrix z, matrix L) { matrix[rows(eta), rows(delta)] utils; matrix[rows(eta), rows(delta)] probs; vector[rows(delta)] shares; for(i in 1:rows(eta)) { utils[i] = delta' + z[i] * L * x'; probs[i] = softmax(utils[i]')'; } for(i in 1:cols(probs)) { shares[i] = mean(probs[:,i]); } return(shares); }
Imagine this was the "true" model and we knew all unknowns with certainty. In a limited-sized market, would we expect the observed market share to line up perfectly with the market shares above? No:
With this in mind, a market of size \(N\) (real product sales—not nominal!) will have observed sales
\[ (\text{Outside sales}_{t}, \text{Sales}_{t}')' \sim \text{Multinomial}(N, s(\theta, X_{t})) \]
We adopt an approach similar in spirit to Allenby, Chen and Yang (2003) for dealing with \(\xi_{jt}\). Treat it as a latent factor.
One approach is to be theory-agnostic.
So we could model
\[ \text{price}_{jt} \sim \text{Normal}_{+}(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta + \lambda \xi_{jt}, \sigma_{p}) \] Where \(\lambda > 0\) and \(X_{jt-\text{price}}\) are the elements of \(X_{jt}\) excluding price. We could also use a lognormal distribution.
If we believed that \(\xi\) was also correlated with other elements of \(X\), this could be modeled
\[ X_{jt} \sim \text{Multi normal}(A + X_{jt}\Gamma + Z_{jt}\Delta + \Lambda \xi_{jt}, \Sigma_{X}) \] Where \(A\) is a vector of intercepts, \(\Gamma\) is a coefficient matrix with zero on the diagonal, \(\Delta\) is a coefficent matrix, and \(\Lambda\) is a loading vector with the first element restricted to be positive.
A reduced-form supply side might help us identify latent \(\xi\) fairly robustly, and in turn help us reduce bias in our structural parameter estimates. But there are some issues.
A very simple "structural" supply model might be to assume that each producer in each market maximizes within-period profits. For ease of explanation assume each product is produced by a single firm (this is simple to relax).
\[ \text{profits}_{jt} = (\text{price}_{jt} - mc_{jt})s_{j}(\theta, X_{t})N \] Where \(mc_{jt}\) is the marginal cost. Rearranging the FOC gives
\[ \text{price}_{jt} = mc_{jt} - \frac{s_{j}(\theta, X_{t})}{\frac{\partial s_{j}(\theta, X_{t})}{\partial \text{price}_{jt}}} \]
If we model
\[ \text{mc}_{jt} \sim \text{Normal}_{+}(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta, \sigma_{p}) \] then
\[ \text{price}_{jt} \sim \text{Normal}_{+}\left(\alpha + X_{jt-\text{price}} \gamma + Z_{jt} \delta - \frac{s_{j}(\theta, X_{t})}{\frac{\partial s_{j}(\theta, X_{t})}{\partial \text{price}_{jt}}}, \sigma_{p}\right) \]
Here, the error in \(mc\) has two interpretations:
In reality, there are both pricing errors and marginal cost shocks. It would be an interesting exercise to try to separately identify them.
There is no reason you cannot combine both aggregate and conjoint data into a single model.
In this case, you would use the same parameters (part-worths/covariance matrix) in both models, and have them both contribute to the likelihood.