The generative model

Individual \(i\in 1:I\) compares products \(j\in 1:J\) in choice task \(t\in 1:T\). Each product generates utility according to a linear combination of choice-specific attributes \(X_{jt}\) and decision-maker specific attributes \(X2_{i}\). The conditional logit parameters \(\beta\) may be fixed at the level of the individual, but can vary across individuals (in a set-up known as “random coefficients logit” in economics, “hierarchical Bayes” in marketing, or “mixed logit” in discrete choice). A matrix of choice-specific parameters \(\Gamma\) with row \(\Gamma_{j}\) converts individual attributes to utilities for choices. Each decisionmaker at each choice also experiences an iid, Gumbel-distributed utility shock \(\epsilon_{ijt}\).

In each decision task there is an outside good, good \(J\), with all product atributes equal to 0. The corresponding row in \(\Gamma\) is also equal to 0. The model is identified due to this restriction.

\[ u_{ijt} = X_{jt}\beta_{i} + \Gamma_{j} X2_{i} + \epsilon_{ijt} \] We assume that each individual’s \(\beta_{i}\)s, known in marketing as “individual part-worths”, are distributed Gaussian around a hypermean \(\beta\) and covariance matrix \(\Sigma\). Standard logit gives all individuals the same \(\beta\).

\[ \beta_{i} \sim \text{Multi normal}(\beta, \Sigma) \]

Note here that each row of \(\Gamma\) corresponds to a product. So it is a bit weird to both have products that are fixed but product attributes that vary. You should only really use this specification when you are sure the choices are the same but attributes varying. For example, each choice is a packet of chips, and the \(X\) variables might be price and packet size.

We’ll fix this problem in a bit.

The decisionmaker simply chooses which ever option in each task provides the highest utility. Due to the Gumbel error distribution, the probability of each choice is

\[ p(u_{ijt} = \max(u_{it})) = \frac{\exp(X_{jt}\beta_{i} + \Gamma_{j} X2_{i})}{\sum_{j=1}^{J} \exp(X_{jt}\beta_{i} + \Gamma_{j} X2_{i})} \]

Simulate some fake data from the model

Let’s make some fake data

# We'll do 10 tasks per individual for 50 individuals comparing 5 choices
I <- 50
T <- I*10
K <- 5
P <- 2 # We'll vary two product attributes for each choice at each task
P2 <- 4
N <- T*K

X <- matrix(rnorm(N*P), N, P)
# Need to make sure that every Kth row is zero (the outside option has no attributes)
X[1:nrow(X)%%K==0,] <- 0
X2 <- matrix(rnorm(I*P2), I, P2)

beta <- rnorm(P)
# Random correlation matrix to construct a covariance matrix of the individual betas
tau <- rgamma(P, 4, 2)
Sigma <- diag(tau) %*% cor(matrix(rnorm(P*(P+1)), P+1, P)) %*% diag(tau) 
beta_individual <- MASS::mvrnorm(I, beta, Sigma)

gamma <- matrix(rnorm(P2*K), K, P2)
gamma[K,] <- 0

indexes <- data_frame(individual = rep(1:I, each = K*10),
                      task = rep(1:T, each = K),
                      row = 1:(T*K)) %>%
  group_by(task) %>% 
  summarise(task_individual = first(individual),
            start = first(row),
            end = last(row)) 
  
softmax <- function(x) exp(x)/sum(exp(x))

choice <- rep(0, N)

for(t in 1:T) {
  # Add product attribute utility
  utility <- X[indexes$start[t]:indexes$end[t],] %*% beta_individual[indexes$task_individual[t],]
  utility <- utility + gamma %*% X2[indexes$task_individual[t],]
  choice[indexes$start[t]:indexes$end[t]] <- as.numeric(rmultinom(1, 1, softmax(as.numeric(utility))))
}

Stan code

Here is some Stan code that will estimate the model

// saved as mixed_conditional_plus_mnp_logit.stan
data {
  int N; // number of rows
  int T; // number of inidvidual-choice sets/task combinations
  int I; // number of Individuals
  int P; // number of covariates that vary by choice
  int P2; // number of covariates that vary by individual
  int K; // number of choices
  
  vector<lower = 0, upper = 1>[N] choice; // binary indicator for choice
  matrix[N, P] X; // choice attributes
  matrix[I, P2] X2; // individual attributes
  
  int task[T]; // index for tasks
  int task_individual[T]; // index for individual
  int start[T]; // the starting observation for each task
  int end[T]; // the ending observation for each task
}
parameters {
  vector[P] beta; // hypermeans of the part-worths
  matrix[K-1, P2] Gamma_raw; // coefficient matrix on individual attributes
  vector<lower = 0>[P] tau; // diagonal of the part-worth covariance matrix
  matrix[I, P] z; // individual random effects (unscaled)
  cholesky_factor_corr[P] L_Omega; // the cholesky factor of the correlation matrix of tastes/part-worths
}
transformed parameters {
  // here we use the reparameterization discussed on slide 30
  matrix[I, P] beta_individual = rep_matrix(beta', I) + z*diag_pre_multiply(tau, L_Omega);
  matrix[K, P2] Gamma; // we set the final row to zero for identification
  Gamma[1:(K-1), :] = Gamma_raw;
  Gamma[K,:] = rep_row_vector(0.0, P2);
}
model {
  // create a temporary holding vector
  vector[N] log_prob;
  
  // priors on the parameters
  tau ~ normal(0, .5);
  beta ~ normal(0, .5);
  to_vector(z) ~ normal(0, 1);
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(Gamma_raw) ~ normal(0, 1);
  
  // log probabilities of each choice in the dataset
  for(t in 1:T) {
    vector[K] utilities; // tmp vector holding the utilities for the task/individual combination
    // add utility from product attributes with individual part-worths/marginal utilities
    utilities = X[start[t]:end[t]]*beta_individual[task_individual[t]]';
    // Add the utilty from the individual attributes for each choice
    utilities += Gamma * X2[task_individual[t]]';  
    
    log_prob[start[t]:end[t]] = log(softmax(utilities));
  }
  
  // use the likelihood derivation on slide 29
  target += log_prob' * choice;
}

Recapturing known parameters

# Fit the model in Stan ---------------------------------------------------

# So you'll want to have data that look like so: 

# choice level data
bind_cols(choice = choice, as_data_frame(X)) %>% 
  head(10)

## # A tibble: 10 x 3
##    choice      V1      V2
##     <dbl>   <dbl>   <dbl>
##  1     0.  0.471   0.535 
##  2     1.  1.10   -1.02  
##  3     0. -0.0746 -0.0411
##  4     0. -1.10    1.05  
##  5     0.  0.      0.    
##  6     0. -2.55   -1.02  
##  7     0.  0.189  -0.933 
##  8     0. -0.780  -0.615 
##  9     1.  1.17   -1.68  
## 10     0.  0.      0.

# task-level data
bind_cols(indexes) %>% 
  head(10)

## # A tibble: 10 x 4
##     task task_individual start   end
##    <int>           <int> <int> <int>
##  1     1               1     1     5
##  2     2               1     6    10
##  3     3               1    11    15
##  4     4               1    16    20
##  5     5               1    21    25
##  6     6               1    26    30
##  7     7               1    31    35
##  8     8               1    36    40
##  9     9               1    41    45
## 10    10               1    46    50

# The start and end refer to the rows in the choice-level data that correspond to the choice task

# Individual-level data
bind_cols(individual = 1:I, as_data_frame(X2)) %>% 
  head(10)

## # A tibble: 10 x 5
##    individual      V1      V2       V3     V4
##         <int>   <dbl>   <dbl>    <dbl>  <dbl>
##  1          1  0.254   0.804  -0.281   -0.785
##  2          2  0.701  -0.0505 -0.276    0.434
##  3          3 -0.662   0.0145 -0.00667 -0.152
##  4          4 -1.45    2.12    0.631   -0.496
##  5          5 -0.0412  1.64    0.221   -0.482
##  6          6  0.0490 -1.18   -0.466    1.84 
##  7          7  2.22    1.34   -0.515    0.851
##  8          8 -0.912   0.736  -1.37     0.421
##  9          9  0.466  -0.515  -0.476    1.37 
## 10         10  1.30   -0.964   2.56     0.432

compiled_model <- stan_model("mixed_conditional_plus_mnp_logit.stan")

data_list <- list(N = N, 
                  I = I, 
                  P = P, 
                  P2 = P2, 
                  K = K, 
                  T = T, 
                  X = X, 
                  X2 = X2, 
                  choice = choice, 
                  start = indexes$start,
                  end = indexes$end, 
                  task_individual = indexes$task_individual,
                  task = indexes$task)


model_fit <- sampling(compiled_model, data = data_list, cores = 4, iter = 1000)

# Check model fit
print(model_fit, pars = "beta")

## Inference for Stan model: mixed_conditional_plus_mnp_logit.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## beta[1]  0.32    0.01 0.31 -0.28  0.12  0.32  0.53  0.93   861 1.00
## beta[2] -1.59    0.01 0.32 -2.24 -1.80 -1.59 -1.39 -0.98   604 1.01
## 
## Samples were drawn using NUTS(diag_e) at Wed Mar 14 12:24:58 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

print(model_fit, pars = "Gamma")

## Inference for Stan model: mixed_conditional_plus_mnp_logit.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##             mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## Gamma[1,1]  0.47    0.00 0.20  0.08  0.33  0.46  0.61  0.86  2000    1
## Gamma[1,2]  0.55    0.00 0.21  0.13  0.42  0.55  0.69  0.97  1841    1
## Gamma[1,3]  0.17    0.00 0.20 -0.23  0.02  0.17  0.30  0.56  1848    1
## Gamma[1,4] -0.09    0.01 0.22 -0.51 -0.23 -0.09  0.06  0.32  1712    1
## Gamma[2,1] -0.48    0.01 0.25 -0.98 -0.64 -0.48 -0.33 -0.02  2000    1
## Gamma[2,2]  1.20    0.01 0.25  0.73  1.02  1.20  1.35  1.69  2000    1
## Gamma[2,3] -1.13    0.01 0.26 -1.68 -1.30 -1.13 -0.95 -0.64  2000    1
## Gamma[2,4] -0.63    0.01 0.23 -1.09 -0.78 -0.63 -0.48 -0.18  1584    1
## Gamma[3,1] -0.62    0.01 0.23 -1.06 -0.78 -0.62 -0.46 -0.18  2000    1
## Gamma[3,2]  0.65    0.01 0.23  0.21  0.49  0.65  0.80  1.14  1666    1
## Gamma[3,3]  0.68    0.01 0.21  0.28  0.55  0.68  0.82  1.11  1710    1
## Gamma[3,4] -0.33    0.01 0.22 -0.75 -0.48 -0.33 -0.18  0.08  1555    1
## Gamma[4,1] -0.63    0.01 0.24 -1.11 -0.79 -0.62 -0.47 -0.15  2000    1
## Gamma[4,2] -0.35    0.01 0.24 -0.81 -0.51 -0.35 -0.19  0.10  2000    1
## Gamma[4,3] -1.69    0.01 0.28 -2.25 -1.88 -1.68 -1.50 -1.15  2000    1
## Gamma[4,4] -0.70    0.01 0.23 -1.14 -0.86 -0.71 -0.55 -0.26  1753    1
## Gamma[5,1]  0.00    0.00 0.00  0.00  0.00  0.00  0.00  0.00  2000  NaN
## Gamma[5,2]  0.00    0.00 0.00  0.00  0.00  0.00  0.00  0.00  2000  NaN
## Gamma[5,3]  0.00    0.00 0.00  0.00  0.00  0.00  0.00  0.00  2000  NaN
## Gamma[5,4]  0.00    0.00 0.00  0.00  0.00  0.00  0.00  0.00  2000  NaN
## 
## Samples were drawn using NUTS(diag_e) at Wed Mar 14 12:24:58 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Let’s plot the known unknowns with the estimated values

gamma_means <- get_posterior_mean(model_fit, pars = "Gamma")[,5]
gamma_means <- matrix(gamma_means, K, P2, byrow = T)
plot(gamma_means, gamma)
abline(0, 1)

beta_individual_means <- get_posterior_mean(model_fit, pars = "beta_individual")[,5]
beta_individual_means <- matrix(beta_individual_means, I, P, byrow = T)

plot(beta_individual, beta_individual_means)
abline(0, 1)

Another approach

The model above assumes that each choice type is the same fundamental choice. Sometimes we want to be able to give decisionmakers different sets of choices that vary by attributes, in which each product might not have the same position in the line-up in each task. This wouldn’t work with the above specification, where rows of \(\Gamma\) correspond to a choice.

In this model, we still think that customer attributes matter, but in this case they matter only through how they influence individuals’ part-worths. That is, we have

\[ u_{ijt} = X_{jt}\beta_{i}+ \epsilon_{ijt} \] And

\[ \beta_{i} \sim \text{Multi normal}(\beta + X2_{i}\Gamma' , \Sigma) \]

Simulated data

# We'll do 10 tasks per individual for 50 individuals comparing 5 choices
I <- 50
T <- I*10
K <- 5
P <- 2 # We'll vary two product attributes for each choice at each task
P2 <- 4
N <- T*K

X <- matrix(rnorm(N*P), N, P)
# Need to make sure that every Kth row is zero (the outside option has no attributes)
X[1:nrow(X)%%K==0,] <- 0
X2 <- matrix(rnorm(I*P2), I, P2)

beta <- rnorm(P)
gamma <- matrix(rnorm(P2*K), P, P2)
# Random correlation matrix to construct a covariance matrix of the individual betas
tau <- rgamma(P, 4, 2)
Sigma <- diag(tau) %*% cor(matrix(rnorm(P*(P+1)), P+1, P)) %*% diag(tau) 

beta_individual <- MASS::mvrnorm(I, beta, Sigma) + X2 %*% t(gamma)


indexes <- data_frame(individual = rep(1:I, each = K*10),
                      task = rep(1:T, each = K),
                      row = 1:(T*K)) %>%
  group_by(task) %>% 
  summarise(task_individual = first(individual),
            start = first(row),
            end = last(row)) 
  
softmax <- function(x) exp(x)/sum(exp(x))

choice <- rep(0, N)

for(t in 1:T) {
  # Add product attribute utility
  utility <- X[indexes$start[t]:indexes$end[t],] %*% beta_individual[indexes$task_individual[t],]
  choice[indexes$start[t]:indexes$end[t]] <- as.numeric(rmultinom(1, 1, softmax(as.numeric(utility))))
}

Stan code

The Stan code for the problem is here:

// saved as mixed_conditional_individual_effects.stan
data {
  int N; // number of rows
  int T; // number of inidvidual-choice sets/task combinations
  int I; // number of Individuals
  int P; // number of covariates that vary by choice
  int P2; // number of covariates that vary by individual
  int K; // number of choices
  
  vector<lower = 0, upper = 1>[N] choice; // binary indicator for choice
  matrix[N, P] X; // choice attributes
  matrix[I, P2] X2; // individual attributes
  
  int task[T]; // index for tasks
  int task_individual[T]; // index for individual
  int start[T]; // the starting observation for each task
  int end[T]; // the ending observation for each task
}
parameters {
  vector[P] beta; // hypermeans of the part-worths
  matrix[P, P2] Gamma; // coefficient matrix on individual attributes
  vector<lower = 0>[P] tau; // diagonal of the part-worth covariance matrix
  matrix[I, P] z; // individual random effects (unscaled)
  cholesky_factor_corr[P] L_Omega; // the cholesky factor of the correlation matrix of tastes/part-worths
}
transformed parameters {
  // here we use the reparameterization discussed on slide 30
  matrix[I, P] beta_individual = rep_matrix(beta', I) + X2 * Gamma' + z*diag_pre_multiply(tau, L_Omega);
}
model {
  // create a temporary holding vector
  vector[N] log_prob;
  
  // priors on the parameters
  tau ~ normal(0, .5);
  beta ~ normal(0, .5);
  to_vector(z) ~ normal(0, 1);
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(Gamma) ~ normal(0, 1);
  
  // log probabilities of each choice in the dataset
  for(t in 1:T) {
    vector[K] utilities; // tmp vector holding the utilities for the task/individual combination
    // add utility from product attributes with individual part-worths/marginal utilities
    utilities = X[start[t]:end[t]]*beta_individual[task_individual[t]]';
    
    log_prob[start[t]:end[t]] = log(softmax(utilities));
  }
  
  // use the likelihood derivation on slide 29
  target += log_prob' * choice;
}

Recapturing known parameters

compiled_model <- stan_model("mixed_conditional_individual_effects.stan")

data_list <- list(N = N, 
                  I = I, 
                  P = P, 
                  P2 = P2, 
                  K = K, 
                  T = T, 
                  X = X, 
                  X2 = X2, 
                  choice = choice, 
                  start = indexes$start,
                  end = indexes$end, 
                  task_individual = indexes$task_individual,
                  task = indexes$task)


model_fit <- sampling(compiled_model, data = data_list, cores = 4, iter = 1000)

And plot estimates and actuals

# Check model fit
print(model_fit, pars = "beta")

## Inference for Stan model: mixed_conditional_individual_effects.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##         mean se_mean   sd  2.5%  25%  50%  75% 97.5% n_eff Rhat
## beta[1] 0.51    0.01 0.32 -0.10 0.30 0.50 0.74  1.12   747    1
## beta[2] 0.69    0.01 0.27  0.17 0.51 0.69 0.87  1.21   741    1
## 
## Samples were drawn using NUTS(diag_e) at Wed Mar 14 12:26:25 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

print(model_fit, pars = "Gamma")

## Inference for Stan model: mixed_conditional_individual_effects.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##             mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## Gamma[1,1]  0.20    0.01 0.30 -0.39 -0.01  0.19  0.40  0.79   808 1.01
## Gamma[1,2]  1.07    0.01 0.29  0.52  0.88  1.07  1.26  1.65   716 1.00
## Gamma[1,3] -0.03    0.01 0.33 -0.71 -0.26 -0.03  0.19  0.63   591 1.00
## Gamma[1,4] -1.34    0.01 0.28 -1.90 -1.53 -1.34 -1.16 -0.79   753 1.00
## Gamma[2,1]  2.12    0.01 0.28  1.57  1.93  2.12  2.30  2.68   799 1.01
## Gamma[2,2]  0.08    0.01 0.23 -0.37 -0.08  0.08  0.23  0.55   625 1.01
## Gamma[2,3]  2.46    0.01 0.31  1.85  2.24  2.45  2.66  3.08   640 1.00
## Gamma[2,4]  1.06    0.01 0.23  0.63  0.90  1.06  1.22  1.53   795 1.00
## 
## Samples were drawn using NUTS(diag_e) at Wed Mar 14 12:26:25 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

# Looks pretty good!

# Let's check that our estimates line up with the "known unknowns" 

gamma_means <- get_posterior_mean(model_fit, pars = "Gamma")[,5]
gamma_means <- matrix(gamma_means, P, P2, byrow = T)
plot(gamma_means, gamma)
abline(0, 1)

beta_individual_means <- get_posterior_mean(model_fit, pars = "beta_individual")[,5]
beta_individual_means <- matrix(beta_individual_means, I, P, byrow = T)

plot(beta_individual, beta_individual_means)
abline(0, 1)

Hey presto! Now we’re cooking with gas.

Combined conditional/MNL in Stan

Jim Savage

14 March 2018

The generative model

Simulate some fake data from the model

Stan code

Recapturing known parameters

Another approach

Simulated data

Stan code

Recapturing known parameters