Ranked choice random coefficients logit in Stan

Note: this has been updated with some excellent suggestions from Ben Goodrich.

This post describes how to fit a random coefficients logit model to ranked data. Such data often arise in conjoint studies, where participants are presented with a number of choices and asked to rank them. In theory at least, ranked choice responses should be able to provide more precise estimates of parameters than best-choice responses, as they contain more information. In practice ranked choices might contain little information in middle rankings—as anyone who has voted below the line in an Australian Senate election (an exercise that involves ranking dozens of candidates) can tell you. And so it is common for ranked choice exercises to be evaluated only for the top few choices—this is a very simple extension to the method described here.

This post describes the generative model which leads to both sets of data, explains the likelihood, simulates some fake data in R, and fits models to the simple best-choice model, and the ranked choice model. This illustrates the increase in precision that comes from using more information.

The generative models

Each individual \(i\in 1:I\) completes \(t\in1:T\) tasks. These tasks involve making choices from \(j\in1:J\) possible options under three setups. The options are described by a vector of choice attributes \(X_{ijt}\), over which each individual has “part-worth” preferences \(\beta_{i}\). This gives rise to a utility from each possible choice

\[ u_{ijt} = X_{ijt}\beta_{i} + \epsilon_{ijt} \]

where \(\epsilon_{ijt}\) is IID Gumbel distributed.

In each task, individual \(i\) is asked to produce some data under the two setups.

In the best-choice setup, they indicate the choice \(j\) that provides them the highest utility.
In the ranked-choice setup, they indicate a rank ordering over all choices, with 1 going to the choice that provides highest utility, and \(J\) going to the choice with the lowest utility.

Deriving the log likelihood contributions

Best-choice logit

Given the Gumbel distribution of \(\epsilon\), the probability that a choice \(j\) has the highest utility is

\[ \frac{\exp(X_{ijt}\beta_{i})}{\sum_{k= 1}^{J}\exp(X_{ikt}\beta_{i})} \]

And so if individual \(i\) chooses choice \(j'\) as their best-choice in task \(t\), then log likelihood contribution is

\[ \log{L}_{it} = \log\left(\frac{\exp(X_{ijt}\beta_{i})}{\sum_{k= 1}^{J}\exp(X_{ikt}\beta_{i})}\right) \]

In Stan, we can implement this as the inner product of a vector \(y\) which is all 0 bar a 1 in the \(j'\)th position, and the softmax of \(X_{it}\beta_{i}\)

target += y' * log(softmax(X*beta));

Ranked logit

The ranked logit likelihood is the probability of observing a given ranking under the utility model described above. The way to think about this is as a sequence of best-choice problems. In the first round, the best choice is chosen, and then removed from the choice set. Next the best of the choices remaining is chosen; conveniently, this is the same formulation as above, but calculated only over the remaining choices.

If we order the choices by their ranking, such that \(j=1\) implies the first choice and \(j = J\) implies the least favoured choice, then the likelihood contribution of an individual-task is

\[ \prod_{l=1}^{J-1}\frac{\exp(X_{ilt}\beta_{i})}{\sum_{k= l}^{J}\exp(X_{ikt}\beta_{i})} \]

Note that the product only includes the first \(J-1\) entries, as the probability of the last choice is 1. Taking logs, we get the log likelihood contribution from a task

\[ \log(L_{it}) = \sum_{l=1}^{J-1}\log\left(\frac{\exp(X_{ilt}\beta_{i})}{\sum_{k= l}^{J}\exp(X_{ikt}\beta_{i})}\right) \]

In Stan, we implement this by ordering the choices (within a task) by the ranks, and then updating the log likelihood for each ranking in a loop. We write a user-defined likelihood function as

real rank_logit_lpmf(int[] rank_order, vector delta) {
  // We reorder the raw utilities so that the first rank is first, second rank second... 
  vector[rows(delta)] tmp = delta[rank_order];
  real out;
  // ... and sequentially take the log of the first element of the softmax applied to the remaining
  // unranked elements.
  for(i in 1:(rows(tmp) - 1)) {
    if(i == 1) {
      out = tmp[1] - log_sum_exp(tmp);
    } else {
      out += tmp[i] - log_sum_exp(tmp[i:]);
    }
  }
  // And return the log likelihood of observing that ranking
  return(out);
}

Note that we use tmp[1] - log_sum_exp(tmp) rather than log(softmax(tmp)[1]) as this is more numerically stable (it is just the log of the softmax)—thanks to Ben for this tip.

Writing out our likelihood function like this allows us to use the syntax

order ~ rank_logit(X*beta);

in the modeling block.

Note that we don’t use the raw ranks in the model. Instead we use the ordering of the ranks. So if the raw ranks are 4, 5, 1, 3, 2, then the order is 3, 5, 4, 1, 2, ie. the indexing that would result in the ranks being in ascending order (the 3rd entry of the ranks is 1, the 5th entry is 2, …).

The hierarchical prior on \(\beta_{i}\)

If we had a huge number of responses for \(\beta_{i}\), we could just use maximum likelihood to fit these models. But more typically we observe few, if any, observations per individual. Consequently we make use of a hierarchical prior, which allows “partial pooling”, or borrowing information across individuals. We use a Gaussian hierarchical prior, where individual demographics, \(W_{i}\), may provide information about the values of part-worths.

\[ \beta_{i} \sim \text{multi normal}(\beta + \Gamma W_{i}, \Sigma) \] where \(\Gamma\) is a loading matrix that loads individual demographics onto the part-worths, and \(\Sigma\) is a covariance matrix. I decompose \(\Sigma\) as \(\text{diag}(\tau)\,\Omega\,\text{diag}(\tau)\), where \(\tau\) are vectors of scales of \(\beta_{i}\) (in this case, standard deviations of deviations from the conditional means across individuals), and \(\Omega\) is the correlation matrix of the variation across individuals. We further decompose \(\Omega = L_{\Omega}L_{\Omega}'\), where \(L_{\Omega}\) is the lower-triangular Cholesky decomposition of the correlation matrix. This implies that

\[ \beta_{i} = \beta + \Gamma W_{i} + \text{diag}(\tau)\, L_{\Omega}\, z_{i} \text{ for } z_{i} \sim \text{normal}(0, 1) \]

And so priors on \(\tau\), \(L_{\Omega}\), \(\Gamma\) and \(\beta\) imply a hierarchical prior over \(\beta_{i}\).

Simulating some data

The below simulates data for the two response mechanisms described above.

library(tidyverse);

# Number of individuals
I <- 30
# Number of tasks per individual
Tasks <- 15
# Number of choices per task
J <- 5
# Dimension of covariate matrix
P <- 5
# Dimension of demographic matrix
P2 <- 6

# demographic matrix
W <- matrix(rnorm(I*P2), I, P2)
# Loading matrix
Gamma <- matrix(rnorm(P*P2), P, P2)
W %*% t(Gamma)

##               [,1]        [,2]       [,3]        [,4]       [,5]
##  [1,] -0.729685021  2.92387601 -1.8390104 -1.53260198  3.4845376
##  [2,] -0.201224725  0.30608067 -0.4071412  0.02022101  0.8533658
##  [3,]  0.615601606 -4.07950347  0.4209342 -5.04696357 -2.1933635
##  [4,] -1.656498876 -2.48974019 -1.6095018 -3.63565528 -1.4600342
##  [5,]  0.196180999  4.97955388 -1.9247819 -5.34039780  5.6712667
##  [6,]  0.256573043 -0.39819511  1.3331337  1.24296907 -1.4190493
##  [7,] -0.809830950  3.64467828 -2.8346495 -5.23334966  4.5095516
##  [8,]  1.088247565 -0.01285538  2.6460900 -1.99303124 -0.6491558
##  [9,]  0.056423310  0.15004340 -0.9119325  1.00782696  0.7264345
## [10,]  1.916270721 -1.49256493  3.2010977  0.81293956 -1.9029778
## [11,]  2.520926424 -0.30069864  0.8597536 -0.83825431  1.1918057
## [12,]  0.060969314 -0.25003478 -0.6087940 -1.30517534  0.8292028
## [13,] -0.091159366 -1.48918920  0.6505017  6.13024401 -1.9710295
## [14,]  0.986789391 -5.12155979  2.8920573 -0.55141240 -4.9631199
## [15,]  1.428865361 -0.94168283 -0.3321451 -3.61244711  1.0486189
## [16,]  0.992994955 -0.13854780 -1.1307426  1.85581467  1.8537610
## [17,]  1.057609283  0.57366361  4.8993310  6.88698306 -3.1731174
## [18,] -0.174380829  2.62356397  2.5801133  3.44451453 -1.0209550
## [19,] -0.887996390 -1.17633746  1.0871463  0.32486896 -3.6090731
## [20,]  0.830753584  0.74564722 -1.6546886  1.11059549  2.4904740
## [21,] -0.215467142  0.60951175 -2.6260406 -2.91201987  2.5908361
## [22,] -1.042579443  0.84575249  2.9742729  2.00880459 -3.0333374
## [23,] -0.427074151  4.56203366 -1.8120988  1.65426268  4.8560474
## [24,] -0.015599325  2.97800095  1.7200639 -0.34904589  1.2151825
## [25,]  1.009876637  2.92644385 -0.1691742  1.78272531  3.3449461
## [26,] -0.362500082 -3.37245311  1.4102831 -2.52726232 -3.7527214
## [27,] -0.003330021 -0.01425753  0.9881537  2.53833235  0.5971061
## [28,]  0.374936923  0.58373092  0.4522197 -2.30717227  0.6914501
## [29,] -0.127511703  3.14025615  0.7430891  0.85002422  1.0622258
## [30,] -2.693056074 -4.57878724 -8.5365613  1.37913101  1.3381547

# Correlation of decisionmaker random slopes
Omega <- cor(matrix(rnorm(P*(P+2)), P+2, P))

# Scale of decisionmaker random slopes
tau <- abs(rnorm(P, 0, .5))

# Covariance matrix of decisionmaker random slopes
Sigma <- diag(tau) %*% Omega %*% diag(tau)

# Centers of random slopes
beta <- rnorm(P)

# Individual slopes
beta_i <- MASS::mvrnorm(I, beta, Sigma) + W %*% t(Gamma)
plot(as.data.frame(beta_i))

# Create X -- let's make this a dummy matrix
X <- matrix(sample(0:1, Tasks*I*J*P, replace = T), Tasks*I*J, P)
# Each of the rows in this matrix correspond to a choice presented to a given individual
# in a given task

indexes <- crossing(individual = 1:I, task = 1:Tasks, option = 1:J) %>% 
  mutate(row = 1:n())
  
# Write a Gumbel random number generator using inverse CDF trick
rgumbel <- function(n, mu = 0, beta = 1) mu - beta * log(-log(runif(n)))
mean(rgumbel(1e6))

## [1] 0.5804272

# This should be approximately equal to 0.577 -- the Euler–Mascheroni constant 

# Ok, now we need to simulate choices. Each person in each task compares each 
# choice according to X*beta_i + epsilon, where epsilon is gumbel distributed. 
# They return their rankings. 

ranked_options <- indexes %>% 
  group_by(individual, task) %>% 
  mutate(fixed_utility = as.numeric(X[row,] %*% as.numeric(beta_i[first(individual),])),
         plus_gumbel_error = fixed_utility + rgumbel(n()),
         rank = rank(-plus_gumbel_error),
         # We're going to use the order rather than the rank in the Stan part of the model
         order = order(rank),
         # And here we create a dummy vector for the best choice
         best_choice = as.numeric(1:n() == which.max(plus_gumbel_error))
         )


tt <- ranked_options %>% 
  group_by(individual, task) %>%
  summarise(start = min(row), 
            end = max(row)) %>% 
  ungroup %>%
  mutate(task_number = 1:n())

Stan code for the best-choice model

// saved as mixed_conditional_individual_effects.stan
data {
  int N; // number of rows
  int T; // number of inidvidual-choice sets/task combinations
  int I; // number of Individuals
  int P; // number of covariates that vary by choice
  int P2; // number of covariates that vary by individual
  int K; // number of choices
  
  vector<lower = 0, upper = 1>[N] choice; // binary indicator for choice
  matrix[N, P] X; // choice attributes
  matrix[I, P2] X2; // individual attributes
  
  int task[T]; // index for tasks
  int task_individual[T]; // index for individual
  int start[T]; // the starting observation for each task
  int end[T]; // the ending observation for each task
}
parameters {
  vector[P] beta; // hypermeans of the part-worths
  matrix[P, P2] Gamma; // coefficient matrix on individual attributes
  vector<lower = 0>[P] tau; // diagonal of the part-worth covariance matrix
  matrix[I, P] z; // individual random effects (unscaled)
  cholesky_factor_corr[P] L_Omega; // the cholesky factor of the correlation matrix of tastes/part-worths
}
transformed parameters {
  // here we use the reparameterization discussed on slide 30
  matrix[I, P] beta_individual = rep_matrix(beta', I) + X2 * Gamma' + z*diag_pre_multiply(tau, L_Omega);
}
model {
  // create a temporary holding vector
  vector[N] log_prob;
  
  // priors on the parameters
  tau ~ normal(0, .5);
  beta ~ normal(0, .5);
  to_vector(z) ~ normal(0, 1);
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(Gamma) ~ normal(0, 1);
  
  // log probabilities of each choice in the dataset
  for(t in 1:T) {
    vector[K] utilities; // tmp vector holding the utilities for the task/individual combination
    // add utility from product attributes with individual part-worths/marginal utilities
    utilities = X[start[t]:end[t]]*beta_individual[task_individual[t]]';
    
    log_prob[start[t]:end[t]] = log_softmax(utilities);
  }
  
  // use the likelihood derivation on slide 29
  target += log_prob' * choice;
}

Stan code for the ranked choice model

// saved as ranked_rcl.stan
functions {
  real rank_logit_lpmf(int[] rank_order, vector delta) {
    // We reorder the raw utilities so that the first rank is first, second rank second... 
    vector[rows(delta)] tmp = delta[rank_order];
    real out;
    // ... and sequentially take the log of the first element of the softmax applied to the remaining
    // unranked elements.
    for(i in 1:(rows(tmp) - 1)) {
      if(i == 1) {
        out = tmp[1] - log_sum_exp(tmp);
      } else {
        out += tmp[i] - log_sum_exp(tmp[i:]);
      }
    }
    // And return the log likelihood of observing that ranking
    return(out);
  }
}
data {
  int N; // number of rows
  int T; // number of inidvidual-choice sets/task combinations
  int I; // number of Individuals
  int P; // number of covariates that vary by choice
  int P2; // number of covariates that vary by individual
  int K; // number of choices
  
  int rank_order[N]; // The vector describing the index (within each task) of the first, second, third, ... choices. 
  // In R, this is order(-utility) within each task
  matrix[N, P] X; // choice attributes
  matrix[I, P2] X2; // individual attributes
  
  int task[T]; // index for tasks
  int task_individual[T]; // index for individual
  int start[T]; // the starting observation for each task
  int end[T]; // the ending observation for each task
}
parameters {
  vector[P] beta; // hypermeans of the part-worths
  matrix[P, P2] Gamma; // coefficient matrix on individual attributes
  vector<lower = 0>[P] tau; // diagonal of the part-worth covariance matrix
  matrix[I, P] z; // individual random effects (unscaled)
  cholesky_factor_corr[P] L_Omega; // the cholesky factor of the correlation matrix of tastes/part-worths
}
transformed parameters {
  // here we use the reparameterization discussed on slide 30
  matrix[I, P] beta_individual = rep_matrix(beta', I) + X2 * Gamma' + z * diag_pre_multiply(tau, L_Omega);
}
model {
  // priors on the parameters
  tau ~ normal(0, .5);
  beta ~ normal(0, 1);
  to_vector(z) ~ normal(0, 1);
  L_Omega ~ lkj_corr_cholesky(4);
  to_vector(Gamma) ~ normal(0, 1);
  
  // log probabilities of each choice in the dataset
  for(t in 1:T) {
    vector[K] utilities; // tmp vector holding the utilities for the task/individual combination
    // add utility from product attributes with individual part-worths/marginal utilities
    utilities = X[start[t]:end[t]]*beta_individual[task_individual[t]]';
    rank_order[start[t]:end[t]] ~ rank_logit(utilities);
  }
}

Creating the data lists and running the models

library(rstan)
data_list_best_choice <-list(N = nrow(X),
                             T = nrow(tt),
                             I = I, 
                             P = P, 
                             P2 = P2, 
                             K = J, 
                             # NOTE!! This is the tricky bit -- we use the order of the ranks (within task)
                             # Not the raw rank orderings. This is how we get the likelihood evaluation to be pretty quick
                             choice = ranked_options$best_choice,
                             X = X, 
                             X2 = W, 
                             task = tt$task_number, 
                             task_individual = tt$individual,
                             start = tt$start, 
                             end = tt$end) 

data_list_ranked_rcl <- list(N = nrow(X),
                             T = nrow(tt),
                             I = I, 
                             P = P, 
                             P2 = P2, 
                             K = J, 
                             # NOTE!! This is the tricky bit -- we use the order of the ranks (within task)
                             # Not the raw rank orderings. This is how we get the likelihood evaluation to be pretty quick
                             rank_order = ranked_options$order,
                             X = X, 
                             X2 = W, 
                             task = tt$task_number, 
                             task_individual = tt$individual,
                             start = tt$start, 
                             end = tt$end)

compiled_best_choice_model <- stan_model("mixed_conditional_individual_effects.stan")

compiled_ranked_choice_model <- stan_model("ranked_rcl.stan")

best_choice_fit <- sampling(compiled_best_choice_model, 
                            data = data_list_best_choice, 
                            cores = 4, iter = 600)

ranked_choice_fit <- sampling(compiled_ranked_choice_model,
                              data = data_list_ranked_rcl, 
                              cores = 4, iter = 600)

Let’s check that the two models get similar parameter estimates

plot(get_posterior_mean(best_choice_fit, "beta_individual")[,5], get_posterior_mean(ranked_choice_fit,"beta_individual")[,5])
abline(0, 1)

That’s a good sign! Let’s check out the difference in precision of the estimates across models.

as.data.frame(best_choice_fit, pars = "beta_individual") %>% 
  gather(Parameter, Value) %>% 
  group_by(Parameter) %>% 
  summarise(median = median(Value),
            lower = quantile(Value, .05),
            upper = quantile(Value, .95)) %>%
  mutate(individual = str_extract(Parameter, "[0-9]{1,2},") %>% parse_number,
         column = str_extract(Parameter, ",[0-9]{1,2}") %>% parse_number) %>% 
  arrange(individual, column) %>% 
  mutate(`True value` = as.numeric(t(beta_i))) %>% 
  ggplot(aes(x = `True value`)) +
  geom_linerange(aes(ymin = lower, ymax = upper), colour = "green", alpha = 0.3) +
  geom_point(aes(y = median), colour = "green", alpha = 0.5) +
  ggthemes::theme_pander() +
  geom_abline(aes(intercept = 0, slope = 1)) +
  labs(y = "Estimates",
       title = "Estimates from best choice model",
       subtitle = "With interior 90% credibility intervals")

as.data.frame(ranked_choice_fit, pars = "beta_individual") %>% 
  gather(Parameter, Value) %>% 
  group_by(Parameter) %>% 
  summarise(median = median(Value),
            lower = quantile(Value, .05),
            upper = quantile(Value, .95)) %>%
  mutate(individual = str_extract(Parameter, "[0-9]{1,2},") %>% parse_number,
         column = str_extract(Parameter, ",[0-9]{1,2}") %>% parse_number) %>% 
  arrange(individual, column) %>% 
  mutate(`True value` = as.numeric(t(beta_i))) %>% 
  ggplot(aes(x = `True value`)) +
  geom_linerange(aes(ymin = lower, ymax = upper), colour = "green", alpha = 0.3) +
  geom_point(aes(y = median), colour = "green", alpha = 0.5) +
  ggthemes::theme_pander() +
  geom_abline(aes(intercept = 0, slope = 1)) +
  labs(y = "Estimates",
       title = "Estimates from ranked choice model",
       subtitle = "With interior 90% credibility intervals")

And there we go – we can clearly see that the precision of estimates of the ranked choice model is much greater, which we expect given we’re extracting more information from study participants.