Random effects, partial pooling and exchangeability

I’ve had some feedback on two points of my zen that I’d like to clear up. First, the point of big data is to estimate rich models. Second, take advantage of partial pooling. The feedback has basically been “how?”

So here’s how.

Hierarchical data helps you do better causal analysis AND prediction …if you do it right

One of the nice things about working in industry is that we get to play with extremely rich/big data. We might observe customers, products and staff over time and geographies, allowing a huge amount of potential models. This data might look like this:

individual	time	value
1	1	-0.45255059
1	2	0.81522520
1	3	0.93875991
2	1	-0.39965080
2	2	0.02931142
2	3	1.49530643
3	1	0.58661273
3	2	0.61238152
3	3	-1.47199902

or like this:

region	rural/urban	value
South	Littleton	0.7203744
South	Littleton	-0.3914667
South	Tinyton	-0.4180982
South	Tinyton	-1.5864343
West	Humangaville	-0.3516165
West	Smallville	-0.1462636
West	Humangaville	1.0616746
West	Smallville	-0.3424441

They’re toy examples. But the general gist is that you have data that exists at multiple hierarchies. Sometimes these hierarchies are nested, as in the second example, while other times the groupings are latticed, as in the first. Or sometimes both.

There is an enormous amount of information in this hierarchical data. Yet sadly, analysts too commonly destroy this information by aggregating across hierarchies, or simply don’t take advantage of the nature of the data. Yes, these expensive “data scientists” are leaving $20 bills on the sidewalk.

What am I talking about? See if these sound familiar:

A credit risk analyst wants to judge the probability of a customer defaulting on a loan. They have a huge amount of individual-level repayment history, as well as personal information. This is panel data at its best. What happens (almost all the time) is that the analyst reduces each customer (for whom they have hundreds of rows of data) to a single row, fashioning a bunch of predictors from each customer’s repayment history and personal details. They then try to predict some binary outcome (normally if the customer misses payments for 90 days) from the predictors. No dynamics, no covariance between customers, nuthin’.
A “big data engineer” (scare quotes useful) is working on a recommender engine. They know a distributed computing framework like Hadoop or Spark, and have a lot of data. So they split the (enormous) data across many computers, run a model on each of the computers, and recombine the results. Good big data engineers will be cognizant of the hierarchy the data, processing similar observations together.
An economist is conducting a study on panel of survey data and “takes time and individual fixed effects”.

I’m not bagging these methods. They’ve been extremely useful for a long time. And good analysis with simple techniques can be far more useful than bad analysis with fancy techniques. But technology has progressed, and we can now do good analysis better.

A basic model

Through this post I’ll refer to a “hierarchical model”. To give some flavour to this, let’s just say I’m talking about a mixed normal linear model, which I’ll describe right here. This is not because I love the normal mixed linear model—it’s just an example, but one that is flexible enough to be useful for many problems. You could adapt everything that follows to help you with many different models, from big neural networks to gaussian processes to financial time series to structural IO models. Bear with me.

In the normal mixed linear model, we say that some individual $i$ in group $j$ has an outcome $y_{i,j}$ and a vector of covariates $X_{i,j}$. If it helps you, think of $y_{i,j}$ as the dollars spent on coffee by customer $i$ (who lives in town $j$); $X_{i,j}$ contains information about the customer—their gender, age, income and so on. If you’re more used to repeated observations, then $j$ could equally be a time grouping. The model is just

\[ y_{i,j} = \alpha_{j} + X_{i,j}\beta_{j} + \epsilon_{i,j} \]

where $\alpha_{j}$ is group $j$’s intercept, $\beta_{j}$ is a vector of slope terms, and $\epsilon$ is the regression residual, which in this simple model we’ll assume $\epsilon$ is normally distrubuted with a mean of 0 and standard deviation of $\sigma$.

As you can see, in this general model we let $\alpha_{j}$ and $\beta_{j}$ vary by group. This might be new to you, and it’s a part of the secret sauce discussed below. Allowing parameters to vary by groups and subgroups is what allows us to build amazingly rich models and make the most out of our data. But it’s not trivial, and we’ll get to that soon; first, how does having varying parameters at a group level help us?

How does having varying parameters at the group level help us?

Two words: unobserved information. In particular, unobserved information that is is fairly fixed within groups. This can take two forms.

The first is information that affects the value of one of the covariates $X_{i,j}$ and also affects $y_{i,j}$. Such information is confounding in that it generates a non-causal correlation between covariates and outcomes (biasing our estimates of $\beta$ away from their causal value). If this information is fairly fixed within a group, then group-varying intercepts can help to control for it. This is the same argument for fixed effects regression with panel data.

Taking our coffee example, perhaps some towns are more stressful to live in than others because they have high-paced work. The stress and pace of work is never observed, but it causes both higher incomes and more coffee sales. So if you observe that towns with higher incomes tend to drink more coffee, what would you make of an economic boom? Would you expect it to cause higher coffee sales? Not unless it affects work-pace and stress. Soaking up the unobserved information fixed within groups over time goes some way to fixing this problem (more on this below). To do this, we have group-varying intercepts.

The second is unobserved information that affects the relationship between a covariate and the outcome. More than anything, this is a relaxation of the assumption that slope parameters be the same for all groups. Perhaps the rich people in one town happen to be Mormon bankers who don’t drink coffee? Allowing slopes to vary across groups can help capture this additional information.

So how do we estimate these models?

Imagine a spectrum. On one end, we assume that all groups have the same set of parameters—no group-varying parameters. This is known as a “pooled” regression. In this case, $\alpha_{j} = \alpha$ and $\beta_{j} = \beta$, which just gives the normal linear model

\[ y_{i,j} = \alpha + X_{i,j}\beta + \epsilon_{i,j} \]

At the other end of the spectrum, we estimate $\alpha_{j},\, \beta_{j}$ for each of the groups by running a linear regression on each of the groups separately—so-called “un-pooled” regression. No information from any group other than $j$ influences the parameter estimate for group $j$. Conceptually, that’s similar to what people are doing with Hadoop and Spark today.

The approach we Bayesian advocate is a so-called “partial pooling” approach, which will result in parameter estimates somewhere in the middle of the spectrum. The intuition for the approach comes from a few observations:

Groups tend to mean-regress, and so the best parameters for use out-of-sample probably aren’t those from the un-pooled estimates no-matter the sample size—they’re probably closer to the pooled estimates. When I say “fixed effects normally aren’t”, this is what I mean.
Similar groups (similar in the sense that they’re nearby in the hierarchy) tend to behave similarly; it makes sense to let the parameter estimates from one group “borrow power” from similar groups.
We should be far more wary of drawing inference from small groups than large groups. And,
The relationships between parameter estimates across groups can give us important prior information for the values of parameters in small groups or those with noisy data.

The take-away from those points is that we probably want to be somewhere between the pooled and unpooled estimates. In practice this means being a good Bayesian and specifying a probability model for the data and parameters.

So we still have our model

\[ y_{i,j} = \alpha_{j} + X_{i,j}\beta_{j} + \epsilon_{i,j} \]

And we have our parameters $\alpha_{j},\, \beta_{j},\, \sigma$. Let’s let $\theta_{j} = (\alpha_{j}, \beta_{j}')'$. Our prior distribution for $\theta_{j}$ is

\[ \theta_{j} \sim \mbox{Multi Normal}(\mu, \Sigma) \]

Where $\mu$ is a vector of expected values for $\alpha_{j},\, \beta_{j}$, and $\Sigma$ is their covariance matrix. We can decompose the covariance into scale $\tau$ and correlation $\Omega$: $\Sigma = \mbox{diag}(\tau)\Omega\mbox{diag}(\tau)$. This parmaterization is easier to give informative priors than for $\Sigma$. See here.

I used to gloss over the whole covariance of parameters thing. But it’s very important! Imagine we have many groups, each with only two parameters, $\alpha_{j}$ and $\beta_{j}$. We run an un-pooled regression for every group and find that in one group we have a very precise estimate for its $\alpha$ but not its $\beta$ (perhaps the group was small). Now imagine we learn that across groups the correlation between $\alpha_{j}$ and $\beta_{j}$ is -0.8. This new information should help us get a better estimate for $\beta_{j}$ in the troublesome group. Being explicit about the covariance between group-level parameters can help you benefit from this.

To implement a mixed (generalized) linear model like the one above, you might want to look at the rstanarm package in R. If you want to implement more adventurous models, then you should do it in Stan. Mike’s talk here will convince you that you can’t do this using Maximum Likelihood.

A warning: Exchangeability matters (common argument against random effects models)

Astute readers familiar with fixed effects models will have noted a problem with one of my arguments above. I said that we could use random intercepts to soak up unobserved information that affects both $X$ and $y$ by including group-varying intercepts $\alpha_{j}$. But this implies that the unobserved information fixed in a group, $\alpha_{j}$, is correlated with $X$. This correlation violates a very important rule in varying-intercept, varying-slope models: exchangeability.

Exchangeability says that there should be no information other than the outcome $y$ that should allow us to distinguish the group to which a group-level parameter belongs.

In this example, we can clearly use values of X to predict $j$, violating exchangeability. But all is not lost. The group-varying parameter needs not be uncorrelated with X, only the random portion of it.

The Bafumi-Gelman correction

Imagine we have an exchangability issue for a very simple model with only group-varying intercept: the unobserved information $\alpha_{j}$ is correlated with $X_{i,j}$ across groups. Let’s break $\alpha_{j}$ down into its fixed and random portions.

\[ \alpha_{j} = \mu_{1} + \eta_{j} \]

where \[ \eta_{j} \sim \mbox{normal}(0, \sigma_{\eta}) \]

So that now the regression model can be written as

\[ y_{i,t} = \mu_{1} + X_{i,j}\beta + e_{i,j} \mbox{ where } e_{i,j} = \epsilon_{i,j}+ \eta_{j} \]

For the correlation to hold, it must be the case that $\eta_{j}$ is correlated with $X_{i,j}$. But our regression error is $e_{i,j}$, which is clearly correlated with $X$ violating the Gauss-Markov theorem and so giving us biased estimates.

In a nice little paper Bafumi and Gelman suggest an elegant fix to this: simply control for group level averages in the model of $\alpha_{j}$. This is a Bayesian take on what econometricians might know as a Mundlak/Chamberlain approach. If $\bar{X}_{j}$ is the mean of $X_{i,j}$ in group $j$, then we could use the model

\[ \alpha_{j} = \hat{\alpha} + \gamma \hat{X}_{j} + \nu_{j} \]

which results in the correlaton between $\nu_{j}$ and $X_{i,j}$ across groups being 0. It’s straightforward to implement, and gets you to conditional exchangeability—a condition under which mixed models like this one are valid.

Happy modeling!