The field of discrete choice modeling is concerned with modeling how people make choices between alternatives, typically as a function of the characteristics of the alternatives and the decision-maker. In doing so, we want to allow for various types of behavior we observe empirically. These are:
Ideally our models of decision-makers’ choices should allow for these empirical behaviors. A very simple model of a decision being made by people indexed by \(i\) across choice-alternatives indexed by \(j\) combines a score that each individual assigns to each possible choice, and a decision rule that maps the score to the decision that will be made. It is common in discrete choice to call the score utility \(u_{ij}\), and use the decision rule “make the choice that provides the highest utility to the decision-maker”. Out job is to come up with functions for this score that describe what we see in the data, and jell with what we know about human behavior.
If each individual \(i\) gives utility \(u_{ij}\) to good \(j\) and this value is fixed, then they will make the same choice whenever presented with the same options. This is not what we observe; rather people will tend to choose the things they like, but mix it up a bit. And so we typically divide utility into two additively separable components: a fixed part \(\mu_{ij}\) and random part \(\epsilon_{ij}\).
\[ u_{ij} = \mu_{ij} + \epsilon_{ij} \]
The fixed part of utility \(\mu_{ij}\) is the same each time person \(i\) is presented with choice \(j\). The random part \(\epsilon_{ij}\) mightn’t be. If we combine this simple model with the decision rule “choose good \(j\) that provides you with highest utility” then we have a model that describes the sort of behavior that we observe (people tending to make choices that they value highly— ie. choices with a high value of \(\mu_{ij}\), but sometimes making different choices). To make the model tractable—and make statements about the probability of \(i\) making choice \(j\)—we need to propose a distribution for \(\epsilon_{ij}\). If there was no limit on computing power, we could propose any distribution for this random component. Yet practically we use two distributions: the normal (Gaussian) distribution, which gives rise to Probit models of choice, or the Gumbel distribution, which gives rise to the Logit models of choice we cover in this chapter. Of these two, Logit models are less computationally expensive to estimate, but make slightly stronger assumptions.
Let’s illustrate with an example. Say consumer 1 is evaluating two choices. Choice 1 has \(\mu_{11} = 1\) and choice 2 has \(\mu_{12} = 3\). The random component \(\epsilon_{ij}\) is normally distributed with a mean of 0 and standard deviation of 1. We illustrate the distributions by drawing random values for \(\epsilon_{ij}\). The marginal distributions of these draws are illustrated on the axes; the orange region illustrates the draws where \(u_{11} > u_{12}\).
If the utilities for the two choices are distributed as above, what is the probability that the decision-maker will make choice 2? This is the probability \(p(u_{12} > u_{11}) = p(u_{12} -u_{11} > 0)\). What is the distribution of \(u_{12} - u_{11}\)? Once we know that, we can ask what proportion of it falls to the right of 0, and that is the probability of the choice being made.
Given we made the assumption that \(\epsilon_{ij} \sim \text{Normal}(0,1)\), we can use a helpful mathematical result about the difference in normally distributed random variables. If \(x \sim \mbox{Normal}(\mu_{1}, \sigma_{1})\) and \(y \sim \mbox{Normal}(\mu_{2}, \sigma_{2})\) then \(x - y \sim \mbox{Normal}\left(x - y, \sqrt{\sigma_{1}^{2} + \sigma_{2}^{2}}\right)\). We now use the standard normal CDF \(\Phi\) to convert this into the probability that the difference is less than 0. Plugging in our values for the parameters, this gives us
\[ \Phi\left(\frac{\mu_{12} - \mu_{11}}{\sqrt{\sigma_{1}^{2} + \sigma_{2}^{2}}}\right) = \Phi\left(\frac{3 - 1}{\sqrt{2}}\right) = .92 \] Which we could evaluate in R with
round(pnorm(2, 0, sqrt(2)), 2)
## [1] 0.92
We could also get the same result by simulation. We’d do this like so:
N_sims <- 1e4
u_11 <- rnorm(N_sims, 1, 1)
u_12 <- rnorm(N_sims, 3, 1)
# What proportion of utilities for good 2 are greater than good 1?
round(mean(u_12 > u_11), 2)
## [1] 0.92
Let’s now try to build some intuition about the properties of this approach to modeling choice.
If we add some constant to all choices’ fixed utilities \(\mu_{ij}\), the choice probability is unchanged
Given that our choice rule is “make the choice that has the highest utility”, the actual values of the utilities do not matter—only their value in relation to each other. If we add some constant to the utility of all choices, the choice probabilities are unchanged. Let’s add 50 to both \(\mu_{11}\) and \(\mu_{12}\)
round(pnorm(53 - 51, 0, sqrt(2)), 2)
## [1] 0.92
Why is this important? When it comes to estimating the mean utilities of each choice, the mean utilities themselves will be unidentified. We typically overcome this by pegging down the utility of a single choice (often called the outside choice). More on this soon.
For fixed \(\mu_{ij}\), choice probabilities can be influenced by the variance of \(\epsilon_{it}\)
So far we’ve kept the scale of the random components of utility fixed. What happens if we relax this assumption? It turns out that for given values of the mean utilities \(\mu\), we can justify a huge number of probabilities by allowing the scale of the random components to change. For example, if the scale is \(\sigma_{11} = \sigma_{12}= 3\) rather than \(1\), we get choice probability of
round(pnorm(2, 0, sqrt(3^2 + 3^2)), 2)
## [1] 0.68
If the scale is very large we get almost equivalent choice probabilities
round(pnorm(2, 0, sqrt(100^2 + 100^2)), 2)
## [1] 0.51
We have also assumed that each decisionmaker has the same variance in the random part of their utility. This turns out to be quite a strong assumption. We’ll explore this more when describing the shortcomings of the logit model.
For fixed (finite) variance of \(\epsilon_{it}\), different values of \(\mu_{ij}\) can imply many choice probabilities
Similarly, now that we have inflated the scales, we can still get back the original choice probabilities by varying the mean utilities. Let’s set \(\sigma_{11} = \sigma_{12} = 100\) and \(\mu_{11} = 100\) with \(\mu_{12} = 300\) then the choice probability is
round(pnorm(300-100, 0, sqrt(100^2 + 100^2)), 2)
## [1] 0.92
Why are these fairly mechanical relationships important? The intuition here is that because any combination of mean utilities and random component scales can generate a very wide range of choice probabilities (and conversely—that a given choice probability can be expressed with an infinitely large set of of \(\mu\) and \(\sigma\)), we can’t identify both the mean utilities and the scale of the random component. We have to fix one of them. Practically this means that we fix the scale of the random components.
Summary