Note: this has been updated with some excellent suggestions from Ben Goodrich.
This post describes how to fit a random coefficients logit model to ranked data. Such data often arise in conjoint studies, where participants are presented with a number of choices and asked to rank them. In theory at least, ranked choice responses should be able to provide more precise estimates of parameters than best-choice responses, as they contain more information. In practice ranked choices might contain little information in middle rankings—as anyone who has voted below the line in an Australian Senate election (an exercise that involves ranking dozens of candidates) can tell you. And so it is common for ranked choice exercises to be evaluated only for the top few choices—this is a very simple extension to the method described here.
This post describes the generative model which leads to both sets of data, explains the likelihood, simulates some fake data in R, and fits models to the simple best-choice model, and the ranked choice model. This illustrates the increase in precision that comes from using more information.
Each individual \(i\in 1:I\) completes \(t\in1:T\) tasks. These tasks involve making choices from \(j\in1:J\) possible options under three setups. The options are described by a vector of choice attributes \(X_{ijt}\), over which each individual has “part-worth” preferences \(\beta_{i}\). This gives rise to a utility from each possible choice
\[ u_{ijt} = X_{ijt}\beta_{i} + \epsilon_{ijt} \]
where \(\epsilon_{ijt}\) is IID Gumbel distributed.
In each task, individual \(i\) is asked to produce some data under the two setups.
Given the Gumbel distribution of \(\epsilon\), the probability that a choice \(j\) has the highest utility is
\[ \frac{\exp(X_{ijt}\beta_{i})}{\sum_{k= 1}^{J}\exp(X_{ikt}\beta_{i})} \]
And so if individual \(i\) chooses choice \(j'\) as their best-choice in task \(t\), then log likelihood contribution is
\[ \log{L}_{it} = \log\left(\frac{\exp(X_{ijt}\beta_{i})}{\sum_{k= 1}^{J}\exp(X_{ikt}\beta_{i})}\right) \]
In Stan, we can implement this as the inner product of a vector \(y\) which is all 0 bar a 1 in the \(j'\)th position, and the softmax of \(X_{it}\beta_{i}\)
target += y' * log(softmax(X*beta));
The ranked logit likelihood is the probability of observing a given ranking under the utility model described above. The way to think about this is as a sequence of best-choice problems. In the first round, the best choice is chosen, and then removed from the choice set. Next the best of the choices remaining is chosen; conveniently, this is the same formulation as above, but calculated only over the remaining choices.
If we order the choices by their ranking, such that \(j=1\) implies the first choice and \(j = J\) implies the least favoured choice, then the likelihood contribution of an individual-task is
\[ \prod_{l=1}^{J-1}\frac{\exp(X_{ilt}\beta_{i})}{\sum_{k= l}^{J}\exp(X_{ikt}\beta_{i})} \]
Note that the product only includes the first \(J-1\) entries, as the probability of the last choice is 1. Taking logs, we get the log likelihood contribution from a task
\[ \log(L_{it}) = \sum_{l=1}^{J-1}\log\left(\frac{\exp(X_{ilt}\beta_{i})}{\sum_{k= l}^{J}\exp(X_{ikt}\beta_{i})}\right) \]
In Stan, we implement this by ordering the choices (within a task) by the ranks, and then updating the log likelihood for each ranking in a loop. We write a user-defined likelihood function as
real rank_logit_lpmf(int[] rank_order, vector delta) {
// We reorder the raw utilities so that the first rank is first, second rank second...
vector[rows(delta)] tmp = delta[rank_order];
real out;
// ... and sequentially take the log of the first element of the softmax applied to the remaining
// unranked elements.
for(i in 1:(rows(tmp) - 1)) {
if(i == 1) {
out = tmp[1] - log_sum_exp(tmp);
} else {
out += tmp[i] - log_sum_exp(tmp[i:]);
}
}
// And return the log likelihood of observing that ranking
return(out);
}
Note that we use tmp[1] - log_sum_exp(tmp)
rather than log(softmax(tmp)[1])
as this is more numerically stable (it is just the log of the softmax)—thanks to Ben for this tip.
Writing out our likelihood function like this allows us to use the syntax
order ~ rank_logit(X*beta);
in the modeling block.
Note that we don’t use the raw ranks in the model. Instead we use the ordering of the ranks. So if the raw ranks are 4, 5, 1, 3, 2
, then the order is 3, 5, 4, 1, 2
, ie. the indexing that would result in the ranks being in ascending order (the 3rd entry of the ranks is 1, the 5th entry is 2, …).
If we had a huge number of responses for \(\beta_{i}\), we could just use maximum likelihood to fit these models. But more typically we observe few, if any, observations per individual. Consequently we make use of a hierarchical prior, which allows “partial pooling”, or borrowing information across individuals. We use a Gaussian hierarchical prior, where individual demographics, \(W_{i}\), may provide information about the values of part-worths.
\[ \beta_{i} \sim \text{multi normal}(\beta + \Gamma W_{i}, \Sigma) \] where \(\Gamma\) is a loading matrix that loads individual demographics onto the part-worths, and \(\Sigma\) is a covariance matrix. I decompose \(\Sigma\) as \(\text{diag}(\tau)\,\Omega\,\text{diag}(\tau)\), where \(\tau\) are vectors of scales of \(\beta_{i}\) (in this case, standard deviations of deviations from the conditional means across individuals), and \(\Omega\) is the correlation matrix of the variation across individuals. We further decompose \(\Omega = L_{\Omega}L_{\Omega}'\), where \(L_{\Omega}\) is the lower-triangular Cholesky decomposition of the correlation matrix. This implies that
\[ \beta_{i} = \beta + \Gamma W_{i} + \text{diag}(\tau)\, L_{\Omega}\, z_{i} \text{ for } z_{i} \sim \text{normal}(0, 1) \]
And so priors on \(\tau\), \(L_{\Omega}\), \(\Gamma\) and \(\beta\) imply a hierarchical prior over \(\beta_{i}\).
The below simulates data for the two response mechanisms described above.
library(tidyverse);
# Number of individuals
I <- 30
# Number of tasks per individual
Tasks <- 15
# Number of choices per task
J <- 5
# Dimension of covariate matrix
P <- 5
# Dimension of demographic matrix
P2 <- 6
# demographic matrix
W <- matrix(rnorm(I*P2), I, P2)
# Loading matrix
Gamma <- matrix(rnorm(P*P2), P, P2)
W %*% t(Gamma)
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.729685021 2.92387601 -1.8390104 -1.53260198 3.4845376
## [2,] -0.201224725 0.30608067 -0.4071412 0.02022101 0.8533658
## [3,] 0.615601606 -4.07950347 0.4209342 -5.04696357 -2.1933635
## [4,] -1.656498876 -2.48974019 -1.6095018 -3.63565528 -1.4600342
## [5,] 0.196180999 4.97955388 -1.9247819 -5.34039780 5.6712667
## [6,] 0.256573043 -0.39819511 1.3331337 1.24296907 -1.4190493
## [7,] -0.809830950 3.64467828 -2.8346495 -5.23334966 4.5095516
## [8,] 1.088247565 -0.01285538 2.6460900 -1.99303124 -0.6491558
## [9,] 0.056423310 0.15004340 -0.9119325 1.00782696 0.7264345
## [10,] 1.916270721 -1.49256493 3.2010977 0.81293956 -1.9029778
## [11,] 2.520926424 -0.30069864 0.8597536 -0.83825431 1.1918057
## [12,] 0.060969314 -0.25003478 -0.6087940 -1.30517534 0.8292028
## [13,] -0.091159366 -1.48918920 0.6505017 6.13024401 -1.9710295
## [14,] 0.986789391 -5.12155979 2.8920573 -0.55141240 -4.9631199
## [15,] 1.428865361 -0.94168283 -0.3321451 -3.61244711 1.0486189
## [16,] 0.992994955 -0.13854780 -1.1307426 1.85581467 1.8537610
## [17,] 1.057609283 0.57366361 4.8993310 6.88698306 -3.1731174
## [18,] -0.174380829 2.62356397 2.5801133 3.44451453 -1.0209550
## [19,] -0.887996390 -1.17633746 1.0871463 0.32486896 -3.6090731
## [20,] 0.830753584 0.74564722 -1.6546886 1.11059549 2.4904740
## [21,] -0.215467142 0.60951175 -2.6260406 -2.91201987 2.5908361
## [22,] -1.042579443 0.84575249 2.9742729 2.00880459 -3.0333374
## [23,] -0.427074151 4.56203366 -1.8120988 1.65426268 4.8560474
## [24,] -0.015599325 2.97800095 1.7200639 -0.34904589 1.2151825
## [25,] 1.009876637 2.92644385 -0.1691742 1.78272531 3.3449461
## [26,] -0.362500082 -3.37245311 1.4102831 -2.52726232 -3.7527214
## [27,] -0.003330021 -0.01425753 0.9881537 2.53833235 0.5971061
## [28,] 0.374936923 0.58373092 0.4522197 -2.30717227 0.6914501
## [29,] -0.127511703 3.14025615 0.7430891 0.85002422 1.0622258
## [30,] -2.693056074 -4.57878724 -8.5365613 1.37913101 1.3381547
# Correlation of decisionmaker random slopes
Omega <- cor(matrix(rnorm(P*(P+2)), P+2, P))
# Scale of decisionmaker random slopes
tau <- abs(rnorm(P, 0, .5))
# Covariance matrix of decisionmaker random slopes
Sigma <- diag(tau) %*% Omega %*% diag(tau)
# Centers of random slopes
beta <- rnorm(P)
# Individual slopes
beta_i <- MASS::mvrnorm(I, beta, Sigma) + W %*% t(Gamma)
plot(as.data.frame(beta_i))
# Create X -- let's make this a dummy matrix
X <- matrix(sample(0:1, Tasks*I*J*P, replace = T), Tasks*I*J, P)
# Each of the rows in this matrix correspond to a choice presented to a given individual
# in a given task
indexes <- crossing(individual = 1:I, task = 1:Tasks, option = 1:J) %>%
mutate(row = 1:n())
# Write a Gumbel random number generator using inverse CDF trick
rgumbel <- function(n, mu = 0, beta = 1) mu - beta * log(-log(runif(n)))
mean(rgumbel(1e6))
## [1] 0.5804272
# This should be approximately equal to 0.577 -- the Euler–Mascheroni constant
# Ok, now we need to simulate choices. Each person in each task compares each
# choice according to X*beta_i + epsilon, where epsilon is gumbel distributed.
# They return their rankings.
ranked_options <- indexes %>%
group_by(individual, task) %>%
mutate(fixed_utility = as.numeric(X[row,] %*% as.numeric(beta_i[first(individual),])),
plus_gumbel_error = fixed_utility + rgumbel(n()),
rank = rank(-plus_gumbel_error),
# We're going to use the order rather than the rank in the Stan part of the model
order = order(rank),
# And here we create a dummy vector for the best choice
best_choice = as.numeric(1:n() == which.max(plus_gumbel_error))
)
tt <- ranked_options %>%
group_by(individual, task) %>%
summarise(start = min(row),
end = max(row)) %>%
ungroup %>%
mutate(task_number = 1:n())
// saved as mixed_conditional_individual_effects.stan
data {
int N; // number of rows
int T; // number of inidvidual-choice sets/task combinations
int I; // number of Individuals
int P; // number of covariates that vary by choice
int P2; // number of covariates that vary by individual
int K; // number of choices
vector<lower = 0, upper = 1>[N] choice; // binary indicator for choice
matrix[N, P] X; // choice attributes
matrix[I, P2] X2; // individual attributes
int task[T]; // index for tasks
int task_individual[T]; // index for individual
int start[T]; // the starting observation for each task
int end[T]; // the ending observation for each task
}
parameters {
vector[P] beta; // hypermeans of the part-worths
matrix[P, P2] Gamma; // coefficient matrix on individual attributes
vector<lower = 0>[P] tau; // diagonal of the part-worth covariance matrix
matrix[I, P] z; // individual random effects (unscaled)
cholesky_factor_corr[P] L_Omega; // the cholesky factor of the correlation matrix of tastes/part-worths
}
transformed parameters {
// here we use the reparameterization discussed on slide 30
matrix[I, P] beta_individual = rep_matrix(beta', I) + X2 * Gamma' + z*diag_pre_multiply(tau, L_Omega);
}
model {
// create a temporary holding vector
vector[N] log_prob;
// priors on the parameters
tau ~ normal(0, .5);
beta ~ normal(0, .5);
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(4);
to_vector(Gamma) ~ normal(0, 1);
// log probabilities of each choice in the dataset
for(t in 1:T) {
vector[K] utilities; // tmp vector holding the utilities for the task/individual combination
// add utility from product attributes with individual part-worths/marginal utilities
utilities = X[start[t]:end[t]]*beta_individual[task_individual[t]]';
log_prob[start[t]:end[t]] = log_softmax(utilities);
}
// use the likelihood derivation on slide 29
target += log_prob' * choice;
}
// saved as ranked_rcl.stan
functions {
real rank_logit_lpmf(int[] rank_order, vector delta) {
// We reorder the raw utilities so that the first rank is first, second rank second...
vector[rows(delta)] tmp = delta[rank_order];
real out;
// ... and sequentially take the log of the first element of the softmax applied to the remaining
// unranked elements.
for(i in 1:(rows(tmp) - 1)) {
if(i == 1) {
out = tmp[1] - log_sum_exp(tmp);
} else {
out += tmp[i] - log_sum_exp(tmp[i:]);
}
}
// And return the log likelihood of observing that ranking
return(out);
}
}
data {
int N; // number of rows
int T; // number of inidvidual-choice sets/task combinations
int I; // number of Individuals
int P; // number of covariates that vary by choice
int P2; // number of covariates that vary by individual
int K; // number of choices
int rank_order[N]; // The vector describing the index (within each task) of the first, second, third, ... choices.
// In R, this is order(-utility) within each task
matrix[N, P] X; // choice attributes
matrix[I, P2] X2; // individual attributes
int task[T]; // index for tasks
int task_individual[T]; // index for individual
int start[T]; // the starting observation for each task
int end[T]; // the ending observation for each task
}
parameters {
vector[P] beta; // hypermeans of the part-worths
matrix[P, P2] Gamma; // coefficient matrix on individual attributes
vector<lower = 0>[P] tau; // diagonal of the part-worth covariance matrix
matrix[I, P] z; // individual random effects (unscaled)
cholesky_factor_corr[P] L_Omega; // the cholesky factor of the correlation matrix of tastes/part-worths
}
transformed parameters {
// here we use the reparameterization discussed on slide 30
matrix[I, P] beta_individual = rep_matrix(beta', I) + X2 * Gamma' + z * diag_pre_multiply(tau, L_Omega);
}
model {
// priors on the parameters
tau ~ normal(0, .5);
beta ~ normal(0, 1);
to_vector(z) ~ normal(0, 1);
L_Omega ~ lkj_corr_cholesky(4);
to_vector(Gamma) ~ normal(0, 1);
// log probabilities of each choice in the dataset
for(t in 1:T) {
vector[K] utilities; // tmp vector holding the utilities for the task/individual combination
// add utility from product attributes with individual part-worths/marginal utilities
utilities = X[start[t]:end[t]]*beta_individual[task_individual[t]]';
rank_order[start[t]:end[t]] ~ rank_logit(utilities);
}
}
library(rstan)
data_list_best_choice <-list(N = nrow(X),
T = nrow(tt),
I = I,
P = P,
P2 = P2,
K = J,
# NOTE!! This is the tricky bit -- we use the order of the ranks (within task)
# Not the raw rank orderings. This is how we get the likelihood evaluation to be pretty quick
choice = ranked_options$best_choice,
X = X,
X2 = W,
task = tt$task_number,
task_individual = tt$individual,
start = tt$start,
end = tt$end)
data_list_ranked_rcl <- list(N = nrow(X),
T = nrow(tt),
I = I,
P = P,
P2 = P2,
K = J,
# NOTE!! This is the tricky bit -- we use the order of the ranks (within task)
# Not the raw rank orderings. This is how we get the likelihood evaluation to be pretty quick
rank_order = ranked_options$order,
X = X,
X2 = W,
task = tt$task_number,
task_individual = tt$individual,
start = tt$start,
end = tt$end)
compiled_best_choice_model <- stan_model("mixed_conditional_individual_effects.stan")
compiled_ranked_choice_model <- stan_model("ranked_rcl.stan")
best_choice_fit <- sampling(compiled_best_choice_model,
data = data_list_best_choice,
cores = 4, iter = 600)
ranked_choice_fit <- sampling(compiled_ranked_choice_model,
data = data_list_ranked_rcl,
cores = 4, iter = 600)
Let’s check that the two models get similar parameter estimates
plot(get_posterior_mean(best_choice_fit, "beta_individual")[,5], get_posterior_mean(ranked_choice_fit,"beta_individual")[,5])
abline(0, 1)
That’s a good sign! Let’s check out the difference in precision of the estimates across models.
as.data.frame(best_choice_fit, pars = "beta_individual") %>%
gather(Parameter, Value) %>%
group_by(Parameter) %>%
summarise(median = median(Value),
lower = quantile(Value, .05),
upper = quantile(Value, .95)) %>%
mutate(individual = str_extract(Parameter, "[0-9]{1,2},") %>% parse_number,
column = str_extract(Parameter, ",[0-9]{1,2}") %>% parse_number) %>%
arrange(individual, column) %>%
mutate(`True value` = as.numeric(t(beta_i))) %>%
ggplot(aes(x = `True value`)) +
geom_linerange(aes(ymin = lower, ymax = upper), colour = "green", alpha = 0.3) +
geom_point(aes(y = median), colour = "green", alpha = 0.5) +
ggthemes::theme_pander() +
geom_abline(aes(intercept = 0, slope = 1)) +
labs(y = "Estimates",
title = "Estimates from best choice model",
subtitle = "With interior 90% credibility intervals")
as.data.frame(ranked_choice_fit, pars = "beta_individual") %>%
gather(Parameter, Value) %>%
group_by(Parameter) %>%
summarise(median = median(Value),
lower = quantile(Value, .05),
upper = quantile(Value, .95)) %>%
mutate(individual = str_extract(Parameter, "[0-9]{1,2},") %>% parse_number,
column = str_extract(Parameter, ",[0-9]{1,2}") %>% parse_number) %>%
arrange(individual, column) %>%
mutate(`True value` = as.numeric(t(beta_i))) %>%
ggplot(aes(x = `True value`)) +
geom_linerange(aes(ymin = lower, ymax = upper), colour = "green", alpha = 0.3) +
geom_point(aes(y = median), colour = "green", alpha = 0.5) +
ggthemes::theme_pander() +
geom_abline(aes(intercept = 0, slope = 1)) +
labs(y = "Estimates",
title = "Estimates from ranked choice model",
subtitle = "With interior 90% credibility intervals")
And there we go – we can clearly see that the precision of estimates of the ranked choice model is much greater, which we expect given we’re extracting more information from study participants.