This week at work I had an excellent problem come across my desk. A client, for whom we had done some modeling work, sent across a plot that looked a little like the one below (this is all fake data).
The client wanted to know if they should be concerned that the distribution of actual outcomes (the top panel) look nothing like the predictions (bottom panel).
If you’ve never played with models before and you saw this chart, your natural response might be to think that the model was terrible. And a terrible model might produce predictions that look nothing like the actual outcomes. Yet in this case, you would be mistaken in writing off the model due to the difference in plots. The predictions in the chart above actually come from the exact same model that I used to generate the “Actual outcome” data—it is impossible to get a better set of predictions than those summarized in the bottom panel. So what’s going on? Let’s step through it.
Often when modeling proportions, we get weird-looking distributions like the top panel in the chart above. Often there are spikes at 0 and 1, and some distribution of continuous outcomes between. In this particular example, we have a spike at 1 and a set of proportons roughly between 0.3 and 0.9, skewed towards higher values in this range.
A popular way of modeling these distributions is to break up the problem using a generative model. A generative model asks “what is a plausible mathematical structure that could give rise to the data that we observe.” One way of breaking up the problem would be to consider there actually being two processes.
Let’s get specific. If \(\beta_1\), \(\beta_2\), and \(\beta_3\) are coefficient vectors, \(\mu_1\), \(\mu_2\), and \(\mu_3\) are intercept terms and X is a design matrix containing our predictive features, then we might say that the data generating process for an outcome \(Y_{i}\) could be
Step 1:
\[ I(Y_{i}=1) \sim \mbox{Bernoulli}(\mbox{Logit}^{-1}(\mu_{1} + X_{i}\beta_{1})) \]
where \(I()\) is the indicator function, and Step 2, in the case that \(Y_{i} < 1\):
\[ Y_{i} \sim \mbox{Beta}(\mu_{2} + X\beta_{2}, \mu_{3} + X\beta_{3}) \]
For those who’ve not used them before, the \(\mbox{Beta}(\alpha, \beta)\) density is a highly flexible distribution that takes two positive parameters and assigns weight to observations between 0 and 1, with no probability of outcomes outside that interval. The wikipedia entry is helptul.
A nice thing about using a model like this to model proportions is that it assigns some weight to 1 for all values of \(\mu_{1} + X_{i}\beta_{1}\), and weight between 0 and 1 otherwise. If you had a spike at 0 also, you could simply replace the first model with a multinomial logit model, and it’d have the same benefits.
When we build generative models, we consider the world as observed to be just a single draw from a universe of different potential draws. Our generative model is just a probability density over those potential draws. So we can generate one draw from the model as so (in R).
# Load the relevant libraries
library(reshape2); library(arm); library(ggplot2); library(dplyr)
# Set some known parameter values
# We'll just assume a single covariate x
# parameters for the logit
beta11 <- -1.5
beta12 <- 0.5
# parameters for the beta
beta21 <- 5
beta22 <- 0.5
beta31 <- 2
beta32 <- -0.3
# Number of observations
N <- 1000
# Generate our predictor x
x <- rnorm(N)
# Generate some draws from the beta distribution for the continuous outcome
continuous <- rbeta(N, beta21 + beta22*x, beta31 + beta32*x)
# Generate the probabilities for the binary outcome
prob_binary <- invlogit(beta11 + beta12*x)
# Generate the draws for the binary outcome
binary <- rbinom(N, size = 1, prob = prob_binary)
# The outcome as defined
outcome <- ifelse(binary==1, binary, continuous)
Now we can plot the fake outcome data that we’ve created.
data_frame(outcome) %>%
ggplot(aes(x = outcome)) +
geom_histogram(alpha = 0.3) +
ggthemes::theme_economist() +
ggtitle("One draw from our generative model")
So now let’s generate predictions for each value of \(x\) for the value of \(y\). How might we do this? Two methods could work.
In this toy example, we will do the first. It happens that the epected value from a Beta distribution with parameters \(\alpha,\, \beta>0\) is \(\frac{\alpha}{\alpha + \beta}\). The expected value from a Bernoulli logit model is simply the probability, here \(\mbox{Logit}^{-1}\left(\mu_{1} + X\beta_{1}\right)\). So for a given value of \(x\), our prediction is just the weighted average of the two models:
\[ E[y_{i} |\, x_{i}] = \left(1 - \mbox{Logit}^{-1}\left(\mu_{1} + x_{i}\beta_{1}\right)\right) \times \frac{\mu_{2} + X\beta_{2}}{\mu_{2} + X\beta_{2} + \mu_{3} + X\beta_{3}} + \mbox{Logit}^{-1}\left(\mu_{1} + x_{i}\beta_{1}\right) \]
Now if we plot this density, we get something that looks nothing like our data density. Why? The expected value averages across two outcomes in proportion to their probability, so only those observations with an extremely high probability of having a proportion equal to 1 will have an expected value anywhere near 1.
expected_value <- (1 - prob_binary)*((beta21 + beta22*x)/(beta21 + beta22*x+ beta31 + beta32*x)) + prob_binary
data_frame(expected_value) %>%
ggplot(aes(x = expected_value)) +
geom_histogram(alpha = 0.3) +
ggthemes::theme_economist() +
ggtitle("Predicted values given x") +
xlim(0, 1)
In three words, posterior predictive checking. This method involves simulating outcomes from your model, incorporating any uncertainty you have about the parameters of the model after estimating it. The steps
Let’s use the realized \(x\) and \(y\) to estimate this model in Stan and generate posterior predictive draws.
# A Stan program to estimate the model described above
beta_logit_model <- "
data {
int N; // number of observations
int P; // number of explanatory variables
int N2; // number of observations to predict
vector[N] Y; // the proportion paid of expected at some period ahead
matrix[N, P] X; // The explanatory variables (make sure no variables known in the future!)
matrix[N2, P] X_new; // explanatory variables for out-of-sample
}
transformed data {
// we an integer Y and a real Y (Y_adj is only here for flexibility of extension)
vector[N] Y_adj;
int Y_int[N];
for(i in 1:N) {
if(Y[i] == 1.0) {
Y_adj[i] = 1.0;
Y_int[i] = 1;
} else {
Y_adj[i] = Y[i];
Y_int[i] = 0;
}
}
}
parameters {
matrix[3, P] beta;
vector[3] mu;
}
model {
// priors
to_vector(beta) ~ normal(0, .2);
mu ~ normal(0, 1);
// likelihood
Y_int ~ bernoulli_logit(mu[1] + X*beta[1]');
for(i in 1:N) {
if(Y_int[i]==0) {
// We constrain alpha and beta to be positive by taking exp()
Y_adj[i] ~ beta(exp(mu[2] + X[i]*beta[2]'),exp(mu[3] + X[i]*beta[3]'));
}
}
}
generated quantities {
vector[N2] Y_predict;
int tmp;
for(i in 1:N2) {
tmp = bernoulli_rng(inv_logit(mu[1] + X_new[i]*beta[1]'));
if(tmp ==1) {
Y_predict[i] = 1.0;
} else {
Y_predict[i] = beta_rng(exp(mu[2] + X_new[i]*beta[2]'), exp(mu[3] + X_new[i]*beta[3]'));
}
if(is_nan(Y_predict[i])) {
Y_predict[i] = 1.1;
}
}
}
"
Now we estimate the model
library(rstan)
options(mc.cores = parallel::detectCores())
beta_logit_estimation <- stan(model_code = beta_logit_model,
data = list(N = N,
P = 1,
N2 = N,
Y = outcome,
X = matrix(x, N, 1),
X_new = matrix(x, N, 1)),
iter = 600)
And finally plot the posterior replicatons
predictions <- as.data.frame(beta_logit_estimation) %>%
select(contains("Y_predict")) %>%
melt()
predictions %>%
ggplot() +
geom_line(aes(x = value, group = variable), colour = "orange", stat = "density", alpha = 0.1, adjust = .8) +
geom_density(data = data_frame(outcome), aes(x = outcome), colour = "black", adjust = 0.8) +
ggthemes::theme_economist() +
ggtitle("Actual outcomes and posterior predictive replications") +
annotate("text", x = 0.2, y = 5, label = "Density of actual outcomes", hjust = 0) +
annotate("text", x = 0.2, y = 3.5, label = "Posterior replications", colour = "orange", hjust = 0)
And there we have it. For each value of \(x\), we have generated many plausible outcomes. It is these we should be comparing against the outcome, not the average of the many draws.