Our model chunk is almost done—all we need is some good priors. Choosing priors for logit models is quite difficult, and we have to ask several questions:

We strongly advise against dispersed priors like \(\beta \sim \text{Normal}(0,100)\) as these can invite computational issues, as well as ignoring sources of important prior information that can and should be used in fitting the model. Let’s illustrate these dangers with an example.

Imagine an attribute matrix \(X\) that simply includes a dummy for each choice, like so:

Choice 1 dummy Choice 2 dummy Choice 3 dummy
0 0 0
1 0 0
0 1 0
0 0 1

Let’s set \(\xi = 0\) for all choices, and examine how the implied prior choice probabilities vary with choice of prior for \(\beta\). In the simulation below, we first create the \(X\) matrix as above, then draw \(\hat\beta\) from three priors, a \(\text{Normal}(0, 1)\), \(\text{Normal}(0, 5)\) and \(\text{Normal}(0, 20)\). We then take the maximum of \(\text{softmax}(X\hat\beta)\) and save it for the three draws. We repeat the exercise 2000 times. Finally we plot histograms of the maximum probabilities.

softmax <- function(x) exp(x)/sum(exp(x))
N_rep <- 2000
X <- matrix(c(0, 0, 0,
              1, 0, 0, 
              0, 1, 0, 
              0, 0, 1), 4, 3, byrow= T)

out <- data_frame(`Normal(0,1)` = rep(NA, N_rep),
                  `Normal(0,5)` = NA,
                  `Normal(0,20)` = NA)

for(i in 1:N_rep) {
 beta_1 <- rnorm(3, 0, 1)
 beta_2 <- rnorm(3, 0, 5)
 beta_3 <- rnorm(3, 0, 20)
 out[i,1] <- max(softmax(X %*% beta_1))
 out[i,2] <- max(softmax(X %*% beta_2))
 out[i,3] <- max(softmax(X %*% beta_3))
}

In this toy example, you can see the problem: as our prior on \(\beta\) widens, it’s like we’re saying that we give no weight to the possibility that there can be any world other than one where a single choice has almost complete market share.

How does the dimensionality of \(X\) affect choice probability given our prior?

In marketing applications it’s common to “one-hot encode” both qualitative and (discretized) quantitative product attributes when analyzing data from surveys on experimental product design. This sort of data might look like so, potentially with dozens of dummy columns:

Price = $1.50 Price = $1.99 Price = $2.50 Price = $3.99 Size = Medium Size = Large “…”
0 0 0 0 0 0
0 0 1 0 1 0
0 0 0 1 0 1
0 0 0 0 0 0
0 0 1 0 1 0
0 0 0 1 0 1
0 0 0 0 0 0
0 0 0 1 0 1
1 0 0 0 0 1
0 1 0 0 1 0

How does the scale of prior choice affect the distribution of maximum probabilities as \(P\), the number of columns in \(X\), increases? Let’s repeat the simulation study above, but instead of varying the prior scale, we’ll vary the number of attributes. For the time being we’ll keep on using dummies to describe the product attributes.

N_rep <- 2000
P1 <- 3
P2 <- 10
P3 <- 100

out <- data_frame(`P = 3` = rep(NA, N_rep),
                  `P = 10` = NA,
                  `P = 100` = NA)

for(i in 1:N_rep) {
 beta_1 <- rnorm(P1, 0, 1)
 beta_2 <- rnorm(P2, 0, 1)
 beta_3 <- rnorm(P3, 0, 1)
 
 X1 <- rbind(0, matrix(sample(0:1, 3*P1, replace = T), 3, P1))
 X2 <- rbind(0, matrix(sample(0:1, 3*P2, replace = T), 3, P2))
 X3 <- rbind(0, matrix(sample(0:1, 3*P3, replace = T), 3, P3))
 
 out[i,1] <- max(softmax(X1 %*% beta_1))
 out[i,2] <- max(softmax(X2 %*% beta_2))
 out[i,3] <- max(softmax(X3 %*% beta_3))
}

The lesson here is that as the number of choice attributes increases, if we do not alter our priors on \(\beta\) it’s the same as having a prior of increased concentration in the market. This does not make a great deal of sense from a theoretical perspective, and so indicates the importance of choosing the appropriate prior for the analysis at hand.

How about the scale (and correlation) of X? This has the same effect as a widening of the priors. To see this, let’s repeat the simulation, but this time scaling the magnitude of the attribute matrix X. It shows clearly that larger scales of the attribute matrix imply a more concentrated prior probability of a given choice.

X <- matrix(c(0, 0, 0,
              1, 0, 0, 
              0, 1, 0, 
              0, 0, 1), 4, 3, byrow= T)

out <- data_frame(`Original scale` = rep(NA, N_rep),
                  `2x original scale` = NA,
                  `10x original scale` = NA)

for(i in 1:N_rep) {
 beta <- rnorm(3, 0, 1)
 out[i,1] <- max(softmax(X %*% beta))
 out[i,2] <- max(softmax(2 * X %*% beta))
 out[i,3] <- max(softmax(10 * X %*% beta))
}

We recommend scaling continuous product attributes (like cost), and keeping binary variables as 0-1. This changes the interpretation of the estimated \(\beta\)s (to the marginal) change in utility from a one standard deviation increase in \(x\), but makes prior specification more straightforward, and can aid with computing.

A simulation approach to choosing priors

The results above should convince you that using default priors for all cases is a bad idea. Priors will interact with the specifics of the problem, the scale and covariance of the data, the number of covariates etc. We recommend a simulation-based approach, with one of two objectives.

The first approach we recommend is to start by having a prior over the maximum choice probability for a single option, and then back out priors for the \(\beta\) and \(\xi\) that would imply that choice probabilities larger than this prior are possible but not expected. The second approach—which might be more comfortable for those unwilling to take a stand on what they think the maximum choice probability is going to be—is to figure out the priors on \(\beta\) and \(\xi\) that result in a uniform prior probability in the preference space. Note that shares will necessarily not be uniform along each margin— one choice’s gain is another’s loss, and all shares need to add up to 1. But we can stastically test for uniformity in the joint share model by performing a Dirichlet regression on our simulated shares \(s_{i} \sim \text{Dirichlet}(\alpha)\). When all \(\alpha = 1\), then \(s\) is uniformly distributed.

We first recommend scaling the continuous choice attributes in \(X\) to have a mean of 0 and standard deviation of 1. Dummy variables and the intercept column can be left un-scaled. Then, with an attributes matrix \(\hat{X}\) that contains all possible choices:

  1. Draw \(\beta,\, \xi\) from their priors. Call these draws \(\hat\beta\) and \(\hat\xi\)
  2. Calculate \(\hat\mu = \hat{X}\hat\beta + \hat\xi\) and \(\hat\pi = \text{softmax}(\hat\mu)\), the probability of each choice
  3. Save the maximum probability and the distribution of probabilities across choices
  4. Repeat 1:3 many times.
  5. If using the uniform prior on the outcome space, estimate the concentration vector of a Dirichlet distribution over the simualted shares. Create a loss equal to the sum of squared differences between the estimated \(\alpha\) and 1. Repeat 1-4 varying hyperpriors on \(\beta\) and \(\xi\) until loss is sufficiently small. We illustrate this step below.

For the actual choices of priors for \(\beta\) and \(\xi\):

Let’s impliment a prior choice simulation with the aim of choosing priors that result in a uniform prior in the choice probability space.

First, we’ll write a function that implements a Dirichlet regression on a vector of concentration parameters.

library(MCMCpack) # contains dirichlet density function

# Write out the negative log likelihood of a matrix of shares (each row a simplex)
dirichlet_n_log_lik <- function(alpha, share_matrix) {
  if(length(alpha) != ncol(share_matrix)) stop("alpha is not the same
                                               length as the number of choices")
  out <- NULL
  for(i in 1:nrow(share_matrix)) {
    out[i] <- log(ddirichlet(t(share_matrix[i,]), alpha))
  }
  -sum(out)
}

# Implement Dirichlet regression
dirichlet_reg <- function(share_matrix) {
  # This finds values of par to minimize the negative log likelihood function
  out <- optim(par = rep(1, ncol(share_matrix)), 
               fn = dirichlet_n_log_lik, 
               share_matrix = share_matrix,
               method = "L-BFGS-B", lower = 0.001)
  out
}

Next we simulate. For the purpose of this simulation, let’s keep the feature matrix fixed and vary only the prior. For each prior scale, we save the loss, which is the sum of squared differences between the estimates \(\alpha\) and 1. We run the simulation several times with different prior scales. Notice that as \(\xi\) is not identified until we have a supply-side model, we set it to 0 here.

# Fake attribute matrix of 10 choices with 8 attributes, plus
# an outside good
X <- rbind(0, matrix(sample(0:1, 10*8, replace = T), 10, 8))

N_rep <- 100
beta_scale_grid <- c(0.1, 0.2, 0.5, 0.75, 1, 1.2, 1.5, 2)
loss <- rep(NA, 8)

count <- 0
for(sig in beta_scale_grid) {
  count <- count + 1
  out <- matrix(NA, N_rep, 11)
  
  for(i in 1:N_rep) {
    beta <- rnorm(8, 0, sig)
    out[i,] <- t(softmax(X %*% beta))
  }
  
  reg <- dirichlet_reg(out)
  
  loss[count] <- sum((1 - reg$par)^2)
}

Now that we’ve run the simulation, we can see which prior scale resulted in the most uniform prior probability— this will be the smallest value of the loss function.

This indicates that for this particular problem, a prior for the preferences \(\beta \sim \text{Normal}(0, .75)\) defines a fairly uniform prior over the choice shares.