The standard implementations of instrumental variables regression, using either linear algebra (2SLS/IV) or optimization (GMM) are well known, and there is no shortage of good resources on their use. Yet IV regression does have a few notable issues, especially in the presence of weak instruments, small samples, or outliers. If you have access to good quality prior information about effect sizes, then using Bayesian instrumental variables regression can help you with these issues. This post walks you through how to implement Bayesian IV in Stan. The particular implementation I’ll walk you through shows you how to use meta-analytic hierarchical priors. That is, priors for your treatment effect that come from previous studies of the same effect. These can be useful for when you want to both borrow power from previous research, and evaluate how much your new observations update a resonable observer’s posterior of the effect size given previous research.
The statistical model
The standard instrumental variables linear model has, for each observation \(i\), an outcome \(y_{i}\). We want to know the effect of the shift of some endogenous treatments \(Y_{i}\) on this outcome. We have access to a vector of pre-treatment information \(X_{i}\), and some instruments that exogenously vary \(Y_{i}\), \(Z_{i}\). There must be as many or more instruments in \(Z\) than there are endogenous variables in \(Y\).
The generative model is
\[ Y_{i} = A + \Gamma X_{i} + \Delta Z_{i} + \eta_{i} \]
(the first stage) and
\[ y_{i} = \alpha + Y_{i}\beta + X_{i}\delta + \epsilon_{i} \]
(the second stage). \(A\) and \(\alpha\) are intercepts, and \(\Gamma\), \(\Delta\), \(\delta\), and \(\beta\) are regression coefficients. \(\beta\) is the what we are interested in.
The important thing to note is that the errors \(\eta\) and \(\epsilon\) are correlated, which is why you can’t just estimate the second stage alone. To abuse the language a little, the unobserved things that cause your treatment (\(\eta\)) are related to the unobserved things that cause your outcome (\(\epsilon\)). But under the Gauss-Markov theorem, we require strict exogeneity (errors of a linear regression must be uncorrelated with the covariates). That’s where the bias we’re trying to correct comes from.
The Bayesian approach to this model is to specify the joint likelihood of the endogenous regressors and the outcome, then put priors on the unknowns.
Let’s call \(\hat{Y}_{i} = (y_{i}, Y_{i}')'\), \(\mu_{i} = (\mu_{1i}, \mu_{2i}')'\) with \(\mu_{1i} = Y_{i}\beta + X_{i}\delta\) and \(\mu_{2i} = \Gamma X_{i} + \Delta Z_{i}\).
Now we assume that the residuals are either jointly normal or jointly Student-t (which will be more robust to outliers).
\[ (\epsilon_{i}, \eta_{i}')' \sim \text{Multi Normal}(0, \text{diag}(\tau)\, \Omega\, \text{diag}(\tau)) \] Where \(\tau\) is the scale of the residuals and \(\Omega\) is their correlation matrix. We like this parameterization (rather than specifying the covariance matrix of the residuals directly) as it’s easier to give informed priors to the scale and correlation of the errors. If we want to use a multivariate Student’s t distribution, we just specify
\[ (\epsilon_{i}, \eta_{i}')' \sim \text{Multi Student t}(\nu, 0, \text{diag}(\tau)\, \Omega\, \text{diag}(\tau)) \] where \(\nu\) is the degrees of freedom parameter to be estimated; small values of this indicate fat tails.
This gives us the generative model for the observed data, from which we can derive the likelihood:
\[ \hat{Y}_{i} \sim \text{Multi normal}(\mu_{i}, \text{diag}(\tau)\, \Omega\, \text{diag}(\tau)) \]
Note that for this to be a truly generative model we ought to be able to actually generate data with it. That means having priors for the unknowns. We typically start with fairly non-informative priors on the scale of the residuals; something like
\[ \tau \sim \text{Cauchy}_{+}(0, 2) \] for data of order-of-magnitude around 1. Then we give a prior to the correlation matrix, normally an LKJ prior with degrees of freedom inversely proportional to the degree of endogeneity we think are in the data. If we think there’s a fair bit of endogeneity, then a prior degrees of freedom around 2 might be reasonable.
\[ \Omega \sim \text{LKJ}(2) \]
For covariates that have been scaled, I’ll normally use a prior like \(\text{Normal}(0, 1)\) for \(\delta\), \(\Delta\) and \(\Gamma\)
\[ \text{vec}(\Delta),\, \text{vec}(\Gamma),\, \delta \sim \text{Normal}(0, 1) \]
And finally, the prior on \(\beta\)–the whole reason we’re going through this pain!
Imagine you’re not the first researcher to investigate some causal effect and there exists a literature of other estimates of the same thing. Each of these studies (there are \(J\) of these, indexed by \(j\)) is trying to measure \(\bar{\beta}\), but due to study idiosyncracies, the true effect in each study is \(\beta_{j} \sim \text{Normal}(\bar{\beta}, \sigma_{beta})\). Each study generates a noisy measure of this estimate, \(b_{j}\), which has its own standard error, \(se_{j}\). I go over this in a bit more detail here.
The generative model here is
\[ b_{j} \sim \text{Normal}(\beta_{j}, se_{j}) \] and
\[ \beta_{j} \sim \text{Normal}(\bar{\beta}, \sigma_{beta}) \]
Now if we want to use this hierarchical prior in our instrumental variables model, all we need to do is say that our \(\beta\), let’s call it \(\beta_{J+1}\), comes from the same parent distribution, that is
\[ \beta_{J+1} \sim \text{Normal}(\bar{\beta}, \sigma_{beta}) \]
And that is our hierarchical prior.
Below I’ll give you code for two versions of the model. The first one has a boring prior on \(\beta\) but assumes multiple endogenous regressors. The second assumes a single endogenous regressor and uses the hierarchical prior. You could of course do both, but the sorts of situations I’m thinking about you having good prior information is more when you have a single effect you care about, not many.
Note that in this implementation we allow you to provide a flag for whether or not you’d like it to estimate; if you don’t estimate, it’ll generate fake data, This is an excellent way to check that the model code is correct. I’ll show you how to do that below.
No hierarchical priors, multiple endogenous regressors
data {
int N;
int PX; // dimension of exogenous covariates
int PN; // dimension of endogenous covariates
int PZ; // dimension of instruments
matrix[N, PX] X_exog; // exogenous covariates
matrix[N, PN] X_endog; // engogenous covariates
matrix[N, PZ] Z; // instruments
vector[N] Y_outcome; // outcome variable
int<lower=0,upper=1> run_estimation; // simulate (0) or estimate (1)
}
transformed data {
matrix[N, 1 + PN] Y;
Y[,1] = Y_outcome;
Y[,2:] = X_endog;
}
parameters {
vector[PX + PN] gamma1;
matrix[PX + PZ, PN] gamma2;
vector[PN + 1] alpha;
vector<lower = 0>[1 + PN] scale;
cholesky_factor_corr[1 + PN] L_Omega;
}
transformed parameters {
matrix[N, 1 + PN] mu; // the conditional means of the process
mu[:,1] = rep_vector(alpha[1], N) + append_col(X_endog,X_exog)*gamma1;
mu[:,2:] = rep_matrix(alpha[2:]', N) + append_col(X_exog, Z)*gamma2;
}
model {
// priors
to_vector(gamma1) ~ normal(0, 1);
to_vector(gamma2) ~ normal(0, 1);
to_vector(alpha) ~ normal(0, 1);
scale ~ cauchy(0, 1);
L_Omega ~ lkj_corr_cholesky(4);
// likelihood
if(run_estimation ==1){
for(n in 1:N) {
Y[n] ~ multi_normal_cholesky(mu[n], diag_pre_multiply(scale, L_Omega));
}
}
}
generated quantities {
matrix[N, 1 + PN] Y_simulated;
for(n in 1:N) {
Y_simulated[n, 1:(1+PN)] = multi_normal_cholesky_rng(mu[n]', diag_pre_multiply(scale, L_Omega))';
}
}
One endogenous regressor, hierarchical prior
// saved as classic_iv_hierarchical_prior.stan
data {
int N;
int PX; // dimension of exogenous covariates
int PZ; // dimension of instruments
int J; // number of previous studies
vector[J] b_j; // previous estimates
vector<lower = 0>[J] se_j; // standard error of previous estimates
matrix[N, PX] X_exog; // exogenous covariates
vector[N] X_endog; // engogenous covariates
matrix[N, PZ] Z; // instruments
vector[N] Y_outcome; // outcome variable
int<lower=0,upper=1> run_estimation; // simulate (0) or estimate (1)
}
transformed data {
matrix[N, 2] Y;
Y[,1] = X_endog;
Y[,2] = Y_outcome;
}
parameters {
// intercepts
vector[2] alpha;
// hierarchical prior parameters
vector[J] z;
real beta_hat;
real<lower = 0> sigma_beta;
// first stage coefficients
vector[PZ] Delta;
vector[PX] Gamma;
// second stage coefficients
vector[PX] delta;
real beta_ours;
// covariance matrix parameters
cholesky_factor_corr[2] L_Omega;
vector<lower = 0>[2] tau;
}
transformed parameters {
matrix[N, 2] mu; // the conditional means of the process
vector[J] beta_j;
// first stage
mu[:,1] = rep_vector(alpha[1], N) + append_col(X_exog, Z)*append_row(Delta, Gamma);
// second stage
mu[:,2] = rep_vector(alpha[2], N) + append_col(X_endog,X_exog) * append_row(beta_ours, delta);
// this the non-centered parameterization (if beta_j ~ normal(beta_hat, sigma_beta) then
// beta_j = beta_hat + sigma_beta *z with z ~ normal(0, 1))
for(j in 1:J) {
beta_j[j] = beta_hat + sigma_beta * z[j];
}
}
model {
// intercept prior
alpha ~ normal(0, 1);
// first stage priors
Delta ~ normal(0, 1);
Gamma ~ normal(0, 1);
// second stage priors
delta ~ normal(0, 1);
// hierarchical prior
z ~ normal(0, 1);
for(j in 1:J) {
b_j[j] ~ normal(beta_j[j], se_j[j]);
}
beta_ours ~ normal(beta_hat, sigma_beta);
// Covarince matrix prior
tau ~ cauchy(0, 2);
L_Omega ~ lkj_corr_cholesky(2);
// likelihood
if(run_estimation ==1){
for(n in 1:N) {
Y[n] ~ multi_normal_cholesky(mu[n], diag_pre_multiply(tau, L_Omega));
}
}
}
generated quantities {
vector[N] y_sim;
vector[N] x_endog;
for(n in 1:N) {
vector[2] error;
error = multi_normal_cholesky_rng(rep_vector(0.0, 2), diag_pre_multiply(tau, L_Omega));
x_endog[n] = mu[n, 1] + error[1];
y_sim[n] = alpha[2] + x_endog[n] * beta_ours + X_exog[n]* delta + error[2];
}
}
Fake data example
Now I’ve shown you how to write out the model in Stan, let’s use it to simulate some fake data given some instruments and covariates, then estimate the model both using classic IV and Bayesian IV.
First we’re going to make some fake data to feed to the model.
set.seed(42)
# Load the stan library
library(rstan)
options(mc.cores = parallel::detectCores())
# Compile the model
compiled_model <- stan_model("classic_iv_hierarchical_prior.stan")
# Let's make some fake data
N <- 1000 # Number of observations
PX <- 5 # Number of exogenous variables
PZ <- 2 # Number of instruments
J <- 10 # Number of previous studies
# Previous study parameters, drawn from the prior
beta_hat <- rnorm(1, 0, 1)
sigma_beta <- truncnorm::rtruncnorm(1, a = 0)
beta_j <- rnorm(J, beta_hat, sigma_beta)
se_j <- truncnorm::rtruncnorm(J, a = 0)
b_j <- rnorm(J, beta_j, se_j)
# Exogenous variables (make them correlated with a random correlation matrix)
X_exog <- MASS::mvrnorm(N, rep(0, PX), cor(matrix(rnorm(PX*(PX+5)), PX+5, PX)))
# We have to feed it some dummy endogenous variables but these won't make a difference
X_endog <- rnorm(N)
# Some fake instruments
Z <- MASS::mvrnorm(N, rep(0, PZ), cor(matrix(rnorm(PZ*(PZ+5)), PZ+5, PZ)))
Y_outcome <- rnorm(N)
data_list <- list(N = N, PX = PX, PZ = PZ, J = J,
b_j = b_j, se_j = se_j, X_exog = X_exog, X_endog = X_endog,
Z = Z, Y_outcome = Y_outcome,
run_estimation = 0)
And get some fake data draws from the model.
# Now let's get some fake draws
draws_from_model <- sampling(compiled_model, data = data_list, iter = 50, chains = 1)
What we’ve done above is to first compiled the Stan code into machine code. Then we made some fake data for the “meta-analysis” we have conducted. Then we made fake values for \(X\) and \(Z\) (we had to make fake values for \(y\) and the endogenous regressor also because Stan needs us to give it a data list with values for everything declared in the data block). Finally we asked Stan to “sample” from the model. Because we weren’t evaluating the likelihood, this is equivalent to drawing from the prior, then drawing from the generative model, for 50 iterations.
What we’ll do now is get the fake data out of the fitted Stan object and run Stan on the fake data.
# Let's use the next to last non-warm-up draw as our fake data (draw 24)
y_sim <- extract(draws_from_model, pars = "y_sim")[[1]][24,]
x_endog <- extract(draws_from_model, pars = "x_endog")[[1]][24,]
true_beta <- extract(draws_from_model, pars = "beta_ours")[[1]][24]
# Now let's make a new data list
data_list_2 <- data_list
data_list_2$X_endog <- x_endog
data_list_2$Y_outcome <- y_sim
data_list_2$run_estimation <- 1
What we’ve done is extracted the fake data from the “model fit” and replaced the old entries in the data list with the “real” values of the endogenous regressor and outcome. We’ve also switched run_estimation
on. So this time she’ll estimate.
# Fit the model
fitted_model <- sampling(compiled_model, data = data_list_2, cores = 4, chains = 4, iter = 1000)
Great! That estimated fairly cleanly in about three minutes on my laptop. Let’s check the estimates.
print(fitted_model, pars = c("beta_ours"))
## Inference for Stan model: classic_iv_hierarchical_prior.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta_ours 1.65 0 0.08 1.49 1.59 1.65 1.7 1.81 582 1.01
##
## Samples were drawn using NUTS(diag_e) at Sun Nov 26 22:56:15 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
print(paste0("And the true value of beta_ours is ", round(true_beta, 3)))
## [1] "And the true value of beta_ours is 1.703"
So our estimate is within the 95% band. Now let’s stick the data together and estimate the same model using vanilla IV.
library(AER)
iv_data <- data.frame(Y_outcome = y_sim, X_endog = x_endog, X_exog, Z)
iv_fit <- ivreg(Y_outcome ~ X_endog + X1+X2+X3+X4+X5| X1.1 + X2.1+ X1+X2+X3+X4+X5, data = iv_data)
# That fit instantly!
# ...but
summary(iv_fit)
##
## Call:
## ivreg(formula = Y_outcome ~ X_endog + X1 + X2 + X3 + X4 + X5 |
## X1.1 + X2.1 + X1 + X2 + X3 + X4 + X5, data = iv_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.47353 -2.18462 -0.05983 2.11454 10.12866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.40662 0.65526 -2.147 0.0321 *
## X_endog 1.95165 0.45899 4.252 2.32e-05 ***
## X1 0.06253 0.12955 0.483 0.6294
## X2 -0.08761 0.22900 -0.383 0.7021
## X3 -0.03708 0.25834 -0.144 0.8859
## X4 0.11932 0.20094 0.594 0.5528
## X5 0.19293 0.18676 1.033 0.3018
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.122 on 993 degrees of freedom
## Multiple R-Squared: 0.8395, Adjusted R-squared: 0.8385
## Wald test: 7.656 on 6 and 993 DF, p-value: 4.691e-08
Which is also close—but because it doesn’t make use of the hierarchical prior, does not do as well. Let’s plot the estimates against the true value (using draws from the pseudo-posterior for IV).
And hey presto! We’ve fit the fake data with Bayes and with vanilla IV. In this example, you can see that both the vanilla IV and the Bayesian estimate both get pretty close to the true effect. Yet the Bayesian estimate is way more precise, because it’s using that information in the hierarchical prior.