Many researchers appear to want to use Bayesian analysis when analysing their experiments, but many do not actually carry this out, possibly because of perceptions that doing so is difficult. There is nothing wrong with that. It is, after all, possible to have a Bayesian outlook on the world—updating your priors about how the world works when new evidence comes to light—without carrying out explicit Bayesian modeling. But it would be better if analysts used tools consistent with their beliefs.
In this post I’ll show you how to estimate an experimental treatment effect using linear regression, while incorporating prior information from previous studies. Rather than doing this in stages (estimating the treatment effect and then doing some meta-analysis), we’ll do everything in one pass. This has the advantage of helping us to get more precise estimates of all our model parameters.
Let’s say that we run the \(J\)’th experiment estimating the treatment effect of some treatment \(x\) on an outcome \(Y\). It’s an expensive and ethically challenging experiment to run, so unfortunately we’re only able to get a sample size of 20. For simplicity, we can assume that the treatment has the same treatment effect for all people, \(\theta\) (this is easily dropped in more elaborate analyses). There have been \(J-1\) similar experiments run in the past. In this example our outcome data \(Y\) are conditionally normally distributed with (untreated) mean \(\mu\) and standard deviation \(\sigma\). There is nothing to stop us from having a far more complex model for the data. So the outcome model looks like this:
\[ y_{i, J} \sim \mbox{Normal}(\mu + \theta_{J} x_{i,J}, \sigma) \]
The question is: how can we estimate the parameters of this model while taking account of the information from the \(J-1\) previous studies? The answer is to use the so-called hierarchical prior.
Let’s say that each of the \(J-1\) previous studies each has an estimated treatment effect \(\beta_{j}\), estimated with some standard error \(se_{j}\). Taken together, are these estimates of \(\beta_{j}\) the ground truth for the true treatment effect in their respective studies? One way of answering this is to ask ourselves: if the researchers of each of those studies replicated their study in precisely the same way, but after checking the estimates estimated by the other researchers, would they expect to find the same estimate they found before, \(\beta_{j}\)? Or would they perhaps expect some other treatment effect estimate, \(\theta_{j}\), that balances the information from their own study with the other studies?
The answer to this question gives rise to the hierarchical prior. In this prior, we say that the estimated treatment effect \(\beta\) is a noisy measure of the underlying treatment effect \(\theta_{j}\) for each study \(j\). These underlying effects are in turn noisy estimates of the true average treatment effect \(\hat{\theta}\)—noisy because of uncontrolled-for varation across experiments. That is, if we make assumptions of normality:
\[ \beta_{j} \sim \mbox{Normal}(\theta_{j}, se_{j}) \]
and
\[ \theta_{j} \sim \mbox{Normal}\left(\hat{\theta}, \tau\right) \]
Where \(\tau\) is the standard devation of the distribution of plausible experimental estimates.
The analysis therefore has the following steps:
As an example, we’ll take the original 8-schools data, with some fake data for the experiment we want to estimate. The 8-schools example comes from an education intervention modeled by Rubin, in which a similar experiment was conducted in 8 schools, with only treatment effects and their standard errors reported. The task is to generate an estimate of the possible treatment effect that we might expect if we were to roll out the program across all schools.
library(rstan); library(dplyr); library(ggplot2); library(reshape2)
# The original 8 schools data
schools_dat <- data_frame(beta = c(28, 8, -3, 7, -1, 1, 18, 12),
se = c(15, 10, 16, 11, 9, 11, 10, 18))
# The known parameters of our data generating process for fake data
mu <- 10
sigma <- 5
N <- 20
# Our fake treatment effect estimate drawn from the posterior of the 8 schools example
theta_J <- rnorm(1, 8, 6.45)
# Create some fake data
treatment <- sample(0:1, N, replace = T)
y <- rnorm(N, mu + theta_J*treatment, sigma)
The Stan program we use to estimate the model is below. Note that these models can be difficult to fit, and so we employ a “reparameterization” below for the theta
s. This is achieved by noticing that if
\[ \theta_{j} \sim \mbox{Normal}\left(\hat{\theta}, \tau\right) \] then \[ \theta_{j} = \hat{\theta} + \tau z_{j} \]
where \(z_{j}\sim\mbox{Normal}(0,1)\). A standard normal has an easier geometry for Stan to work with, so this parameterization of the model is typically preferred. Here is the Stan model:
// We save this as 8_schools_w_regression.stan
data {
int<lower=0> J; // number of schools
int N; // number of observations in the regression problem
real beta[J]; // estimated treatment effects from previous studies
real<lower=0> se[J]; // s.e. of those effect estimates
vector[N] y; // the outcomes for students in our fake study data
vector[N] treatment; // the treatment indicator in our fake study data
}
parameters {
real mu;
real<lower=0> tau;
real eta[J+1];
real<lower = 0> sigma;
real theta_hat;
}
transformed parameters {
real theta[J+1];
for (j in 1:(J+1)){
theta[j] = theta_hat + tau * eta[j];
}
}
model {
// priors
tau ~ cauchy(5, 2);
mu ~ normal(10, 2);
eta ~ normal(0, 1);
sigma ~ cauchy(3, 3);
theta_hat ~ normal(8, 5);
// parameter model for previous studies
for(j in 1:J) {
beta[j] ~ normal(theta[j], se[j]);
}
// our regression
y ~ normal(mu + theta[J+1]*treatment, sigma);
}
Now we estimate the model from R. Because of the geometry issues mentioned above, we use control = list(adapt_delta = 0.99)
to prompt Stan to take smaller step sizes, improving sampling performance at a cost of slower estimation time (this isn’t a problem here; it estimates in a couple of seconds).
eight_schools_plus_regression <- stan("8_schools_w_regression.stan",
data = list(beta = schools_dat$beta,
se = schools_dat$se,
J = 8,
y = y,
N = N,
treatment = treatment),
control = list(adapt_delta = 0.99))
Let’s comapare the estimates we get for our regression model to those we might get from the Bayesian model. A simple linear regression model gives us the following confidence intervals for the parameter estimates:
coef | estimates | 2.5 % | 97.5 % |
---|---|---|---|
mu | 10.114894 | 6.877282 | 13.35251 |
theta[9] | 9.201593 | 4.835998 | 13.56719 |
Our Bayesian model gives us more precise estimates for the treatment effect, with the 95% credibility region considerably smaller. This is because we have “borrowed”" information from the previous studies when estimating the treatment effect in the latest study. The estimates are also closer to the group-level mean.
print(eight_schools_plus_regression, pars = c("mu", "theta[9]", "theta_hat"), probs = c(0.025, 0.5, 0.975))
## Inference for Stan model: 8_schools_w_regression.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
## mu 10.14 0.02 1.20 7.77 10.15 12.48 4000 1
## theta[9] 9.07 0.03 1.76 5.66 9.10 12.65 4000 1
## theta_hat 8.49 0.05 2.69 3.11 8.55 13.81 2407 1
##
## Samples were drawn using NUTS(diag_e) at Sat Sep 3 00:36:59 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
I hope this helps you implement hierarchical priors in your next research project. If you’ve any questions, or you’ve spotted errors in my reasoning or code, please comment!