Individual \(i\in 1:I\) compares products \(j\in 1:J\) in choice task \(t\in 1:T\). Each product generates utility according to a linear combination of choice-specific attributes \(X_{jt}\) and decision-maker specific attributes \(X2_{i}\). The conditional logit parameters \(\beta\) may be fixed at the level of the individual, but can vary across individuals (in a set-up known as “random coefficients logit” in economics, “hierarchical Bayes” in marketing, or “mixed logit” in discrete choice). A matrix of choice-specific parameters \(\Gamma\) with row \(\Gamma_{j}\) converts individual attributes to utilities for choices. Each decisionmaker at each choice also experiences an iid, Gumbel-distributed utility shock \(\epsilon_{ijt}\).
In each decision task there is an outside good, good \(J\), with all product atributes equal to 0. The corresponding row in \(\Gamma\) is also equal to 0. The model is identified due to this restriction.
\[ u_{ijt} = X_{jt}\beta_{i} + \Gamma_{j} X2_{i} + \epsilon_{ijt} \] We assume that each individual’s \(\beta_{i}\)s, known in marketing as “individual part-worths”, are distributed Gaussian around a hypermean \(\beta\) and covariance matrix \(\Sigma\). Standard logit gives all individuals the same \(\beta\).
\[ \beta_{i} \sim \text{Multi normal}(\beta, \Sigma) \]
Note here that each row of \(\Gamma\) corresponds to a product. So it is a bit weird to both have products that are fixed but product attributes that vary. You should only really use this specification when you are sure the choices are the same but attributes varying. For example, each choice is a packet of chips, and the \(X\) variables might be price and packet size.
We’ll fix this problem in a bit.
The decisionmaker simply chooses which ever option in each task provides the highest utility. Due to the Gumbel error distribution, the probability of each choice is
\[ p(u_{ijt} = \max(u_{it})) = \frac{\exp(X_{jt}\beta_{i} + \Gamma_{j} X2_{i})}{\sum_{j=1}^{J} \exp(X_{jt}\beta_{i} + \Gamma_{j} X2_{i})} \]
Let’s make some fake data
# We'll do 10 tasks per individual for 50 individuals comparing 5 choices
I <- 50
T <- I*10
K <- 5
P <- 2 # We'll vary two product attributes for each choice at each task
P2 <- 4
N <- T*K
X <- matrix(rnorm(N*P), N, P)
# Need to make sure that every Kth row is zero (the outside option has no attributes)
X[1:nrow(X)%%K==0,] <- 0
X2 <- matrix(rnorm(I*P2), I, P2)
beta <- rnorm(P)
# Random correlation matrix to construct a covariance matrix of the individual betas
tau <- rgamma(P, 4, 2)
Sigma <- diag(tau) %*% cor(matrix(rnorm(P*(P+1)), P+1, P)) %*% diag(tau)
beta_individual <- MASS::mvrnorm(I, beta, Sigma)
gamma <- matrix(rnorm(P2*K), K, P2)
gamma[K,] <- 0
indexes <- data_frame(individual = rep(1:I, each = K*10),
task = rep(1:T, each = K),
row = 1:(T*K)) %>%
group_by(task) %>%
summarise(task_individual = first(individual),
start = first(row),
end = last(row))
softmax <- function(x) exp(x)/sum(exp(x))
choice <- rep(0, N)
for(t in 1:T) {
# Add product attribute utility
utility <- X[indexes$start[t]:indexes$end[t],] %*% beta_individual[indexes$task_individual[t],]
utility <- utility + gamma %*% X2[indexes$task_individual[t],]
choice[indexes$start[t]:indexes$end[t]] <- as.numeric(rmultinom(1, 1, softmax(as.numeric(utility))))
}
Here is some Stan code that will estimate the model
// saved as mixed_conditional_plus_mnp_logit.stan
data {
int N; // number of rows
int T; // number of inidvidual-choice sets/task combinations
int I; // number of Individuals
int P; // number of covariates that vary by choice
int P2; // number of covariates that vary by individual
int K; // number of choices
vector<lower = 0, upper = 1>[N] choice; // binary indicator for choice
matrix[N, P] X; // choice attributes
matrix[I, P2] X2; // individual attributes
int task[T]; // index for tasks
int task_individual[T]; // index for individual
int start[T]; // the starting observation for each task
int end[T]; // the ending observation for each task
}
parameters {
vector[P] beta; // hypermeans of the part-worths
matrix[K-1, P2] Gamma_raw; // coefficient matrix on individual attributes
vector<lower = 0>[P] tau; // diagonal of the part-worth covariance matrix
matrix[I, P] z; // individual random effects (unscaled)
cholesky_factor_corr[P] L_Omega; // the cholesky factor of the correlation matrix of tastes/part-worths
}
transformed parameters {
// here we use the reparameterization discussed on slide 30
matrix[I, P] beta_individual = rep_matrix(beta', I) + z*diag_pre_multiply(tau, L_Omega);
matrix[K, P2] Gamma; // we set the final row to zero for identification
Gamma[1:(K-1), :] = Gamma_raw;
Gamma[K,:] = rep_row_vector(0.0, P2);
}
model {
// create a temporary holding vector
vector[N] log_prob;
// priors on the parameters
tau ~ normal(0, .5);
beta ~ normal(0, .5);
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(4);
to_vector(Gamma_raw) ~ normal(0, 1);
// log probabilities of each choice in the dataset
for(t in 1:T) {
vector[K] utilities; // tmp vector holding the utilities for the task/individual combination
// add utility from product attributes with individual part-worths/marginal utilities
utilities = X[start[t]:end[t]]*beta_individual[task_individual[t]]';
// Add the utilty from the individual attributes for each choice
utilities += Gamma * X2[task_individual[t]]';
log_prob[start[t]:end[t]] = log(softmax(utilities));
}
// use the likelihood derivation on slide 29
target += log_prob' * choice;
}
# Fit the model in Stan ---------------------------------------------------
# So you'll want to have data that look like so:
# choice level data
bind_cols(choice = choice, as_data_frame(X)) %>%
head(10)
## # A tibble: 10 x 3
## choice V1 V2
## <dbl> <dbl> <dbl>
## 1 0. 0.471 0.535
## 2 1. 1.10 -1.02
## 3 0. -0.0746 -0.0411
## 4 0. -1.10 1.05
## 5 0. 0. 0.
## 6 0. -2.55 -1.02
## 7 0. 0.189 -0.933
## 8 0. -0.780 -0.615
## 9 1. 1.17 -1.68
## 10 0. 0. 0.
# task-level data
bind_cols(indexes) %>%
head(10)
## # A tibble: 10 x 4
## task task_individual start end
## <int> <int> <int> <int>
## 1 1 1 1 5
## 2 2 1 6 10
## 3 3 1 11 15
## 4 4 1 16 20
## 5 5 1 21 25
## 6 6 1 26 30
## 7 7 1 31 35
## 8 8 1 36 40
## 9 9 1 41 45
## 10 10 1 46 50
# The start and end refer to the rows in the choice-level data that correspond to the choice task
# Individual-level data
bind_cols(individual = 1:I, as_data_frame(X2)) %>%
head(10)
## # A tibble: 10 x 5
## individual V1 V2 V3 V4
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.254 0.804 -0.281 -0.785
## 2 2 0.701 -0.0505 -0.276 0.434
## 3 3 -0.662 0.0145 -0.00667 -0.152
## 4 4 -1.45 2.12 0.631 -0.496
## 5 5 -0.0412 1.64 0.221 -0.482
## 6 6 0.0490 -1.18 -0.466 1.84
## 7 7 2.22 1.34 -0.515 0.851
## 8 8 -0.912 0.736 -1.37 0.421
## 9 9 0.466 -0.515 -0.476 1.37
## 10 10 1.30 -0.964 2.56 0.432
compiled_model <- stan_model("mixed_conditional_plus_mnp_logit.stan")
data_list <- list(N = N,
I = I,
P = P,
P2 = P2,
K = K,
T = T,
X = X,
X2 = X2,
choice = choice,
start = indexes$start,
end = indexes$end,
task_individual = indexes$task_individual,
task = indexes$task)
model_fit <- sampling(compiled_model, data = data_list, cores = 4, iter = 1000)
# Check model fit
print(model_fit, pars = "beta")
## Inference for Stan model: mixed_conditional_plus_mnp_logit.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta[1] 0.32 0.01 0.31 -0.28 0.12 0.32 0.53 0.93 861 1.00
## beta[2] -1.59 0.01 0.32 -2.24 -1.80 -1.59 -1.39 -0.98 604 1.01
##
## Samples were drawn using NUTS(diag_e) at Wed Mar 14 12:24:58 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(model_fit, pars = "Gamma")
## Inference for Stan model: mixed_conditional_plus_mnp_logit.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## Gamma[1,1] 0.47 0.00 0.20 0.08 0.33 0.46 0.61 0.86 2000 1
## Gamma[1,2] 0.55 0.00 0.21 0.13 0.42 0.55 0.69 0.97 1841 1
## Gamma[1,3] 0.17 0.00 0.20 -0.23 0.02 0.17 0.30 0.56 1848 1
## Gamma[1,4] -0.09 0.01 0.22 -0.51 -0.23 -0.09 0.06 0.32 1712 1
## Gamma[2,1] -0.48 0.01 0.25 -0.98 -0.64 -0.48 -0.33 -0.02 2000 1
## Gamma[2,2] 1.20 0.01 0.25 0.73 1.02 1.20 1.35 1.69 2000 1
## Gamma[2,3] -1.13 0.01 0.26 -1.68 -1.30 -1.13 -0.95 -0.64 2000 1
## Gamma[2,4] -0.63 0.01 0.23 -1.09 -0.78 -0.63 -0.48 -0.18 1584 1
## Gamma[3,1] -0.62 0.01 0.23 -1.06 -0.78 -0.62 -0.46 -0.18 2000 1
## Gamma[3,2] 0.65 0.01 0.23 0.21 0.49 0.65 0.80 1.14 1666 1
## Gamma[3,3] 0.68 0.01 0.21 0.28 0.55 0.68 0.82 1.11 1710 1
## Gamma[3,4] -0.33 0.01 0.22 -0.75 -0.48 -0.33 -0.18 0.08 1555 1
## Gamma[4,1] -0.63 0.01 0.24 -1.11 -0.79 -0.62 -0.47 -0.15 2000 1
## Gamma[4,2] -0.35 0.01 0.24 -0.81 -0.51 -0.35 -0.19 0.10 2000 1
## Gamma[4,3] -1.69 0.01 0.28 -2.25 -1.88 -1.68 -1.50 -1.15 2000 1
## Gamma[4,4] -0.70 0.01 0.23 -1.14 -0.86 -0.71 -0.55 -0.26 1753 1
## Gamma[5,1] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2000 NaN
## Gamma[5,2] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2000 NaN
## Gamma[5,3] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2000 NaN
## Gamma[5,4] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2000 NaN
##
## Samples were drawn using NUTS(diag_e) at Wed Mar 14 12:24:58 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Let’s plot the known unknowns with the estimated values
gamma_means <- get_posterior_mean(model_fit, pars = "Gamma")[,5]
gamma_means <- matrix(gamma_means, K, P2, byrow = T)
plot(gamma_means, gamma)
abline(0, 1)
beta_individual_means <- get_posterior_mean(model_fit, pars = "beta_individual")[,5]
beta_individual_means <- matrix(beta_individual_means, I, P, byrow = T)
plot(beta_individual, beta_individual_means)
abline(0, 1)
The model above assumes that each choice type is the same fundamental choice. Sometimes we want to be able to give decisionmakers different sets of choices that vary by attributes, in which each product might not have the same position in the line-up in each task. This wouldn’t work with the above specification, where rows of \(\Gamma\) correspond to a choice.
In this model, we still think that customer attributes matter, but in this case they matter only through how they influence individuals’ part-worths. That is, we have
\[ u_{ijt} = X_{jt}\beta_{i}+ \epsilon_{ijt} \] And
\[ \beta_{i} \sim \text{Multi normal}(\beta + X2_{i}\Gamma' , \Sigma) \]
# We'll do 10 tasks per individual for 50 individuals comparing 5 choices
I <- 50
T <- I*10
K <- 5
P <- 2 # We'll vary two product attributes for each choice at each task
P2 <- 4
N <- T*K
X <- matrix(rnorm(N*P), N, P)
# Need to make sure that every Kth row is zero (the outside option has no attributes)
X[1:nrow(X)%%K==0,] <- 0
X2 <- matrix(rnorm(I*P2), I, P2)
beta <- rnorm(P)
gamma <- matrix(rnorm(P2*K), P, P2)
# Random correlation matrix to construct a covariance matrix of the individual betas
tau <- rgamma(P, 4, 2)
Sigma <- diag(tau) %*% cor(matrix(rnorm(P*(P+1)), P+1, P)) %*% diag(tau)
beta_individual <- MASS::mvrnorm(I, beta, Sigma) + X2 %*% t(gamma)
indexes <- data_frame(individual = rep(1:I, each = K*10),
task = rep(1:T, each = K),
row = 1:(T*K)) %>%
group_by(task) %>%
summarise(task_individual = first(individual),
start = first(row),
end = last(row))
softmax <- function(x) exp(x)/sum(exp(x))
choice <- rep(0, N)
for(t in 1:T) {
# Add product attribute utility
utility <- X[indexes$start[t]:indexes$end[t],] %*% beta_individual[indexes$task_individual[t],]
choice[indexes$start[t]:indexes$end[t]] <- as.numeric(rmultinom(1, 1, softmax(as.numeric(utility))))
}
The Stan code for the problem is here:
// saved as mixed_conditional_individual_effects.stan
data {
int N; // number of rows
int T; // number of inidvidual-choice sets/task combinations
int I; // number of Individuals
int P; // number of covariates that vary by choice
int P2; // number of covariates that vary by individual
int K; // number of choices
vector<lower = 0, upper = 1>[N] choice; // binary indicator for choice
matrix[N, P] X; // choice attributes
matrix[I, P2] X2; // individual attributes
int task[T]; // index for tasks
int task_individual[T]; // index for individual
int start[T]; // the starting observation for each task
int end[T]; // the ending observation for each task
}
parameters {
vector[P] beta; // hypermeans of the part-worths
matrix[P, P2] Gamma; // coefficient matrix on individual attributes
vector<lower = 0>[P] tau; // diagonal of the part-worth covariance matrix
matrix[I, P] z; // individual random effects (unscaled)
cholesky_factor_corr[P] L_Omega; // the cholesky factor of the correlation matrix of tastes/part-worths
}
transformed parameters {
// here we use the reparameterization discussed on slide 30
matrix[I, P] beta_individual = rep_matrix(beta', I) + X2 * Gamma' + z*diag_pre_multiply(tau, L_Omega);
}
model {
// create a temporary holding vector
vector[N] log_prob;
// priors on the parameters
tau ~ normal(0, .5);
beta ~ normal(0, .5);
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(4);
to_vector(Gamma) ~ normal(0, 1);
// log probabilities of each choice in the dataset
for(t in 1:T) {
vector[K] utilities; // tmp vector holding the utilities for the task/individual combination
// add utility from product attributes with individual part-worths/marginal utilities
utilities = X[start[t]:end[t]]*beta_individual[task_individual[t]]';
log_prob[start[t]:end[t]] = log(softmax(utilities));
}
// use the likelihood derivation on slide 29
target += log_prob' * choice;
}
compiled_model <- stan_model("mixed_conditional_individual_effects.stan")
data_list <- list(N = N,
I = I,
P = P,
P2 = P2,
K = K,
T = T,
X = X,
X2 = X2,
choice = choice,
start = indexes$start,
end = indexes$end,
task_individual = indexes$task_individual,
task = indexes$task)
model_fit <- sampling(compiled_model, data = data_list, cores = 4, iter = 1000)
And plot estimates and actuals
# Check model fit
print(model_fit, pars = "beta")
## Inference for Stan model: mixed_conditional_individual_effects.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta[1] 0.51 0.01 0.32 -0.10 0.30 0.50 0.74 1.12 747 1
## beta[2] 0.69 0.01 0.27 0.17 0.51 0.69 0.87 1.21 741 1
##
## Samples were drawn using NUTS(diag_e) at Wed Mar 14 12:26:25 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(model_fit, pars = "Gamma")
## Inference for Stan model: mixed_conditional_individual_effects.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## Gamma[1,1] 0.20 0.01 0.30 -0.39 -0.01 0.19 0.40 0.79 808 1.01
## Gamma[1,2] 1.07 0.01 0.29 0.52 0.88 1.07 1.26 1.65 716 1.00
## Gamma[1,3] -0.03 0.01 0.33 -0.71 -0.26 -0.03 0.19 0.63 591 1.00
## Gamma[1,4] -1.34 0.01 0.28 -1.90 -1.53 -1.34 -1.16 -0.79 753 1.00
## Gamma[2,1] 2.12 0.01 0.28 1.57 1.93 2.12 2.30 2.68 799 1.01
## Gamma[2,2] 0.08 0.01 0.23 -0.37 -0.08 0.08 0.23 0.55 625 1.01
## Gamma[2,3] 2.46 0.01 0.31 1.85 2.24 2.45 2.66 3.08 640 1.00
## Gamma[2,4] 1.06 0.01 0.23 0.63 0.90 1.06 1.22 1.53 795 1.00
##
## Samples were drawn using NUTS(diag_e) at Wed Mar 14 12:26:25 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
# Looks pretty good!
# Let's check that our estimates line up with the "known unknowns"
gamma_means <- get_posterior_mean(model_fit, pars = "Gamma")[,5]
gamma_means <- matrix(gamma_means, P, P2, byrow = T)
plot(gamma_means, gamma)
abline(0, 1)
beta_individual_means <- get_posterior_mean(model_fit, pars = "beta_individual")[,5]
beta_individual_means <- matrix(beta_individual_means, I, P, byrow = T)
plot(beta_individual, beta_individual_means)
abline(0, 1)
Hey presto! Now we’re cooking with gas.