The logit choice model

Discrete choice modeling requires the researcher to make some choices of their own. We need to choose a function to approximate \(\mu_{ij}\), a distribution for \(\epsilon_{ij}\), and a mapping between \(u_{ij}\) and our data, which might be at the level of the individual choice, or might be aggregated choices or choice shares across many individuals.

The Logit model is a basic choice model. It has the advantage of being simple to explain and mathematically clean, but in practical terms is more a learning tool than one for serious analysis. The model makes two assumptions. First, all decision-makers get the same fixed utility for each choice. Franklin likes BLT sandwiches just as much as Aisha does. The functional form is assumed to be a linear combination of the attributes of each choice. For example, we’d represent a sandwich in terms of its price, # of calories, whether or not the sandwich is suitable for vegetarians, whether or not it contains pork, etc., and call this attribute vector for each choice \(j\) \(X_{j}\). Fixed utility is therefore

\[ \mu_{ij} = \mu_{j} = X_{j}\beta \] A common addition to the model is for each choice to have an unobserved utility shock \(\xi_{j}\). This shock is observed by all decision-makers but unobserved by the researcher. You might think of it as containing unobserved choice attributes, in which case it is almost certainly correlated with the observed choice attributes in \(X\).

\[ \mu_{ij} = \mu_{j} = X_{j}\beta + \xi_{j} \]

The second assumption is that the random component of utility is IID and takes a Gumbel distribution, often called the Generalized Extreme Value distribution Type 1 (you can perhaps see why we prefer to call it the Gumbel distribution). The Gumbel distribution is skewed distribution with two parameters, a location \(\mu\) and scale \(\sigma\). It has the helpful property that if \(X \sim \text{Gumbel}(\mu, \sigma)\) and \(Y \sim \text{Gumbel}(\mu, \sigma)\), then \(X - Y \sim \text{Logistic}(0, \sigma)\). If we fix \(\sigma = 1\) for the identifiability reasons described above, this helpful property allows us to formulate the probability of each choice with only the values for each \(\mu_{j}\) using the softmax function, a multivariate generalization of the inverse logit function. See Train (2001, Chapter 3) for the derivation.

The softmax function takes a vector with \(P\) unbounded elements \(x\), and converts these elements to a simplex—a vector of values all between 0 and 1 that add up to 1. The p-th element of this simplex vector \(\text{softmax}(x)_{p}\) is defined as

\[ \text{softmax}(x)_{p} = \frac{\exp(x_{p})}{\sum_{q=1}^{P}\exp(x_{q})} \]

We typically refer to the full simplex vector as \(\text{softmax}(x)\). Because the vector is a simplex, it can be thought of as a set of probabilities of mutually exclusive events. In the case of the logit model of choice we take the softmax of the unbound fixed utilities vector with elements defined by \(\mu_{j} = X_{j}\beta + \xi_{j}\). Softmax gives us the probabilities that the decision-maker will make choice \(p\) out of the available choices. Let’s simulate some utilities to illustrate.

set.seed(13)
# Define the softmax function
softmax <- function(x) exp(x)/sum(exp(x))
P <- 10 # Number of alternative choices
# Create a random characteristic matrix and weights matrix X and beta
X <- matrix(rnorm(P*5), P, 5)
beta <- rnorm(5)
# Draw some random demand shocks for each product
xi <- rnorm(P)
# Generate the fixed utilities
mu <- X%*%beta + xi

# illustrate the probabilities
probs <- softmax(mu)
data.frame(probs_percent = round(probs*100, 2), cumulative_probability = cumsum(probs))

##    probs_percent cumulative_probability
## 1           0.98            0.009779541
## 2           0.75            0.017297128
## 3           4.55            0.062772591
## 4           8.82            0.151005651
## 5           4.56            0.196561651
## 6           0.84            0.204980945
## 7          23.44            0.439378311
## 8           4.02            0.479612967
## 9           2.33            0.502867434
## 10         49.71            1.000000000

What can we make out from this simple example? Given the attributes is that describe each choice \(X\), the weights that influence how these attributes affect utility \(\beta\), and the demand shocks \(\xi\), we can now calculate the probability that the decision-maker will make a choice between each alternative. If we had some observed choice data, then this would be enough to construct a likelihood.

Yet we’ve not yet gone the other direction: showing that given choice probabilities map to fixed values of \(\beta,\, \xi\). It turns out that unless we “pin down” the model, these values are not identified. You can see this by adding \(10\) to each fixed utility—we get precisely the same choice probabilities.

round(softmax(mu + 10)*100, 2)

##        [,1]
##  [1,]  0.98
##  [2,]  0.75
##  [3,]  4.55
##  [4,]  8.82
##  [5,]  4.56
##  [6,]  0.84
##  [7,] 23.44
##  [8,]  4.02
##  [9,]  2.33
## [10,] 49.71

Introducing an outside good

The outside good is the “default option”, ie. the choice to make no choice, or do without any of the available choices. We typically say that the outside good has \(X_{\text{outside}} = 0\) and \(\xi_{\text{outside}} = 0\), which makes \(\mu_{\text{outside}} = 0\). If we say that the outside good is the first good in the choice bundle, then this makes the vector of probability of choices defined by elements

\[ \text{softmax}(\mu)_{p} = \frac{\exp(X_{p}\beta + \xi_{p})}{\sum_{q=1}^{P}\exp(X_{q}\beta + \xi_{q})} = \frac{\exp(X_{p}\beta + \xi_{p})}{\exp(0) + \sum_{q=2}^{P}\exp(X_{q}\beta + \xi_{q})} = \frac{\exp(X_{p}\beta + \xi_{p})}{1 + \sum_{q=2}^{P}\exp(X_{q}\beta + \xi_{q})} \]

This outside good “pins down” the values of \(\beta\) that could conceivably lead to a given set of choice probabilities. Now, if we add some value to all the fixed utilities except the fixed utility of the outside good, the choice probabilities change. The model is now identified.

Summary

The logit model of choice models the latent fixed utility for every decision-maker as a linear model of the choice characteristics \(u_{ij} = X_{j}\beta + \xi_{j} + \epsilon_{ij}\). The random component is assumed to be iid distributed according to a Gumbel distribution with fixed variance. If decision-makers make the choice that maximizes their utility, then the probability of each choice can be described using the softmax of the fixed components of utility. The fixed utility from one choice is typically set to be 0 in order to identify the model.

In the next three sections we show how to fit this simple choice model using micro (choice-level) data as is typically available from so-called “conjoint” studies in quantitative marketing, as well as using aggregate sales and market shares data which typically arise in economic studies. We will then use the fitted models to examine some counterfactuals, including a price change and a corporate merger.

Microdata model: conjoint studies and observational choice data

The microdata version of the logit model is what we fit to observations of individual choices. In this type of data, we observe what each person chooses as well as what they do not choose, that is, we see their full choice set as well as their choice. This data type is most commonly exploited in quantitative marketing, where such studies are known as “conjoint analysis”. In conjoint studies, participants are repeatedly shown sets of alternative choices, and asked to express their choice, either by chosing their favored chocie, or by making one choice and selecting least favored choice, or by indicating quantities of items—“allocating chips”. Such studies tend to be very high-powered, and can help to identify complex substitution patterns. They also have the advantage that the researcher can experimentally vary the choice set, and the characteristics of each choice. And so they might overcome some of the endogeneity concerns common in observational studies. Yet they do not actually observe choices; they only measure stated preferences.

Microdata models can also be applied in an observational setting, say, by using scanner or sales data. This form of data has the advantage of being a measurement of revealed preferences rather than stated preferences. Yet it can be harder to observe the complete choice set available to customers. Also, in most observational settings, the choice set and choice attributes normally aren’t varied experimentally, meaning we have to contend with the endogeneity of these attributes of the data. Later in this chapter we discuss approaches to doing so.

The microdata implementation of the Logit choice model is one that could simulate individual choices, which will occur with the probabilities described above. The only thing we need is a sampling distribution for the choices. We have this in the categorical distribution, which is a multinomial distribution—a distribution over unordered qualitative variables—of size 1. You can think of a categorical distribution as being like rolling a weighted dice.

Let’s continute to use the simulated data we specified above, though we’ll add \(\mu_{1} = 0\) to signify the fixed utility of the outside choice being zero. We’ll then simulate the choices of 100 customers who all face precisely the same choices. Once we’ve done that, we’ll derive the likelihood and write the model out in Stan. Finally we’ll see what happens when each customer faces a different set of choices.

First we simulate and plot some choices for 100 customers:

# Variable sizes
N_customers <- 100
N_choices <- (P+1) * N_customers
# Probabilities including outside good
probs <- softmax(c(0, mu))
simulated_data <- data_frame(choice = rep(0, N_choices),
                             customer_id = rep(1:N_customers, each = P+1),
                             product = rep(1:(P+1), N_customers))
# Draw a choice and indicate this with choice = 1; all unchosen 
# products remain with choice = 0
for(i in 1:N_customers) {
  choice <- sample(1:(P+1), size = 1, prob = probs)
  simulated_data[simulated_data$customer_id == i & simulated_data$product==choice, 1] <- 1
}

# Plot the choices vs the generative probabilities
simulated_data %>%
  group_by(product) %>% 
  summarise(`Proportion of choices` = sum(choice)/N_customers) %>%
  mutate(`Underlying probabilities` = probs) %>%
  ggplot(aes(x = factor(product,ordered = T), y = `Proportion of choices`)) +
  geom_bar(stat = "identity") +
  geom_point(aes(y = `Underlying probabilities`), colour = "red") +
  annotate("text", x = 4, y = 0.15, label = "True preference shares", colour = "red") +
  annotate("text", x = 9, y = 0.35, label = "Market shares", alpha = 0.7) +
  theme_hc() + labs(x = "Product", title = "With 100 customers")

What do we notice here? First, the more popular products (those with higher values for \(\mu_{j}\)) indeed get more choices. But the variation around the probabilities is quite high. How does this change with the number of decisionmakers?

The lesson here is that the variation of the observed choice shares around the true preference shares depends strongly on the sample size (or market size, if we’re modeling markets). This shouldn’t surprise you much—more observations drive more precise estimates. And so when you are modeling markets with few sales, or few customers, you should always use a model that makes use of this information. Of the three Logit implementations described here, this includes the microdata model and sales model—not the shares model.

Constructing the likelihood of the microdata model

Recall from the likelihood post what a likelihood is in a practical sense: a way scoring whether some model parameters \(\theta\) make sense of the observed data under the assumption that the data were generated by our model. If our choice data is a sequence of choices that were made by decisionmakers, the likelihood is simply the product of the probabilities that our model would give to each of those choices being made (for a given set of parameter values). More formally, let \(y\) be the vector of all customer choices (So it might look like \(y = (\text{Good 12, Good 4, Outside good}, \, \dots)'\)). If we denote the choice probability for individual \(i\)’s choice under the model parameterized by some values \(\theta\) as \(p_{y_i}(\theta)\), then the likelihood is

\[ p(y |\, \theta) = \prod_{i=1}^{N}p_{y_i}(\theta) \] Note that each of these probabilities is between 0 and 1 by construction, so their product will be a very small number, and so will invite computational issues. This motivates taking the log likelihood

\[ \log\left(p(y |\, \theta)\right) = \sum_{i=1}^{N}\log\left(p_{y_i}(\theta)\right) \]

To construct the log likelihood, all we need is to sum up the log probabilities (under the model evaluated with a given set of parameters \(\theta\)) of each choice that was made. One straightforward way of doing that is to encode \(y\) as a binary vector which is 1 if a customer made a choice, and 0 otherwise. Note that it each decision-maker’s choice set and choice is “stacked” on top of the other. Then we can construct a vector of choice probabilities \(\pi\) for our dataset—both for the choices that were made and those that were not. The log likelihood is then

\[ \log\left(p(y |\, \theta)\right) = y' \log(\pi) \] Of course this can be evaluated only where \(\pi \neq 0\). To make this more concrete, it might help to look at the data constructed this way

Individual	Possible choice	y	pi_theta	price	Size = Large
1	Outside option	1	\(\pi_{\text{Outside option}}(\theta)\)	0	0
1	Choice 1	0	\(\pi_{\text{Choice 1}}(\theta)\)	2.50	0
1	Choice 2	0	\(\pi_{\text{Choice 2}}(\theta)\)	3.99	1
2	Outside option	0	\(\pi_{\text{Outside option}}(\theta)\)	0	0
2	Choice 1	0	\(\pi_{\text{Choice 1}}(\theta)\)	2.50	0
2	Choice 2	1	\(\pi_{\text{Choice 2}}(\theta)\)	3.99	1
3	Outside option	0	\(\pi_{\text{Outside option}}(\theta)\)	0	0
3	Choice 1	0	\(\pi_{\text{Choice 1}}(\theta)\)	2.50	0
3	Choice 2	1	\(\pi_{\text{Choice 2}}(\theta)\)	3.99	1
…	…	…	…	…	…

The log likelihood of this model is the sum of the product of columns y and the log of pi_theta. Or in linear algebra

\[ \log\left(p(y |\, \theta)\right) = y' \log(\pi) = 1 \times \log(\pi_{\text{Outside option}}(\theta)) + 1 \times \log(\pi_{\text{Choice 2}}(\theta)) + 1 \times \log(\pi_{\text{Choice 2}}(\theta)) \]

In Stan, the log likelihood is added to the log posterior density accumulator (the target) like so:

target += y ' * log(choice_probabilities)

All we need to do is show Stan how to construct the choice_probabilities vector.

Calculating the choice probabilities

We just saw how to construct the log likelihood from a vector of choices and a set of choice probabilities. And we know that the choice probabilities are just the softmax of the fixed component of utility. If all decision-makers faced precisely the same choice-set, then we could construct a common choice probability vector

\[ \pi_{i} = \pi = \text{softmax}(X\beta + \xi) \]

in Stan, like

model {
  vector[N] choice_probability = rep_vector(softmax(X*beta + xi), I);
  
  // priors on beta and xi here
  //... 
  
  // evaluate log likelihood
  target += y' * log(choice_probability);
}

Here N is the number of observations in y and I is the number of decision-makers \(X\) has as many rows as there are choices—including the outside option—as does xi. Both the first entry in xi and X’s first row are all 0s, so that X[1]*beta + xi[1] is equal to zero—the fixed utility from the outside option. Using the fake data illustrated above, \(X\) would be

Intercept column	price	Size = Large
0	0	0
1	2.50	0
1	3.99	1

Note that we include an intercept column here. In cases where the choice characteristics are multicolinear with 1 (such as brand dummies), you should not use an intercept column. Also note that the intercept entries for the outside good are equal to 0, for identifiability reasons discussed above.

The only tricky thing to do is deal with the possibility that each individual might face different choice sets. We accomplish this using indexes that indicate where in our data each choice set starts and finishes, and which choice is being evaluated in each row, then constructing the choice probabilities for each individual in a loop across choice sets.

To illustrate, say that instead our choice data is

Individual	Possible choice	y	price	Size = Large
1	Outside option	1	0	0
1	Choice 1	0	2.50	0
1	Choice 2	0	3.99	1
2	Outside option	0	0	0
2	Choice 1	0	2.50	0
2	Choice 2	1	3.99	1
3	Outside option	0	0	0
3	Choice 2	0	3.99	1
3	Choice 3	1	1.50	1
3	Choice 4	0	1.99	0
…	…	…	…	…

Now each decision-maker faces different choices, or even different numbers of choices. We can no longer describe an \(X\) matrix that applies to all individuals, so instead the \(X\) matrix contains the attributes for all individuals, stacked on top of each other, like so

Intercept column	price	Size = Large
0	0	0
1	2.50	0
1	3.99	1
0	0	0
1	2.50	0
1	3.99	1
0	0	0
1	3.99	1
1	1.50	1
1	1.99	0
…	…	…

Recall that \(\mu = X\beta + \xi\). It doesn’t matter how many rows \(X\) has, each row of \(X\beta\) will be the same. But we’d ideally want to construct mu in Stan like

vector[N] mu = X * beta + xi;

yet xi has only as many elements as there are choices, while X*beta has as many rows as there are possible choices made. The trick is to use an indexing vector choice_index which tells us which possible choice is being considered in each row. Then we can define

vector[N] mu = X * beta + xi[choice_index];

This gives us the fixed component of utility for each decision-maker, but we still need to convert it into probabilities. Note that you must not just take the softmax of mu; we need to take the softmax of mu within each choice set. If each individual has only one choice set, then we can loop across these with indexing vectors that tell us on which row each choice set starts and ends. Let’s call these indexes start and end. Our model block is therefore

model {
  vector[N] mu = X * beta + xi[choice_index];
  vector[N] choice_probability;
  
  // fill in the choice probabilities for each choice set
  for(i in 1:I) {
    choice_probability[start[i]:end[i]] = softmax(mu[start[i]:end[i]]);
  }
  
  // priors go here
  // ...
  
  // Evaluate the log likelihood
  target += y' * log(choice_probabilities);
}

So how best to build up the indexes start, end and choice_index? There are many ways, but we find that this is easy:

library(tidyverse)
# Note that we name "Outside option" "1. Outside option"
my_data <- data_frame(Individual = c(rep(1:3, each = 3), 3), 
           `Possible choice` = c(rep(c("1. Outside option", "Choice 1", "Choice 2"), 2), 
                                 c("1. Outside option", "Choice 2", "Choice 3", "Choice 4")),
           y = c(1, 0, 0, 0, 0, 1, 0, 0, 1, 0),
           price = c(rep(c(0, "2.50", "3.99"), 2), 0, "3.99", "1.50", "1.99"),
           `Size = Large` = c(0, 0, 1, 0, 0, 1, 0, 1, 1, 0))

# Ordered will order the factor alphabetically, and we want the outside option to have index = 1
choice_index <- as.numeric(ordered(my_data$`Possible choice`))

# Now construct start and end
tmp <- my_data %>% 
  mutate(row = 1:n()) %>% 
  group_by(Individual) %>%
  summarise(start = first(row),
            end = last(row))
  
start <- tmp$start
end <- tmp$end

The logit choice model

Jim Savage

3/18/2019

Microdata model: conjoint studies and observational choice data