Building good statistical models is hard. Unfortunately, most statistics/data science/econometrics training focuses on statistical models and the algorithms that we use to estimate them, at the cost of not covering other important topics. One such important topic is workflow: how to structure the process of your analysis to maximise the odds that you build useful models.
The following is the workflow that I try to force on myself. Often I take shortcuts, only to get stuck and come back to the workflow. When friends email me because their models aren’t doing what they’re meant to, I tell them to stick to this workflow. Put simply, this workflow is the best way to learn a wide range of modeling techniques and build better models.
A more elaborate explanation I wrote earlier in the year is here.
Plotting should always be your starting point, even if you are a whiz modeler. A few good reasons: first, you are either trying to discover relationships in the data or explain them away; your eyes will be able to detect these very easily. Your data might have problems, like missing values or incorrectly-coded observations. Sometimes your variables have different orders of magnitude. Plotting your data first is the easiest way of avoiding these pitfalls.
You should always produce density plots of your outcome variables. Doing this will provide guidance for the sort of model you should build.
As I wrote in my zen, you need to have a very clear idea of what the random variables are in your model. Random variables are simply those variables whose values are in part due to chance. For most modeling purposes, our random variables are the things we don’t know for sure out of sample. These might include model parameters, latent variables, predictions, etc.
A common problem occurs when the modeler uses a random variable as a predictive feature in a model but does not explicitly model it. For instance they might build a model that looks a bit like
\[ \mbox{Sales}_{t} = f(\mbox{Weather}_{t}, \mbox{Day of the week}_{t}) + \mbox{error}_{t} \]
Next, when called on to make a prediction, the modeler uses forecasts for weather to generate predictions for sales, but without taking into account the uncertainty around weather (a random variable). As for the day of the week, this is not a random variable and so we don’t have to model it. Any forecasts conditioned on random variables without taking into account their uncertainty will be far too precise.
The reason we should write down what the random variables are in the model is because this is precisely what we are going to model.
In this step, we ask ourselves: what is a plausible process that could generate the outcomes that we observe? For instance, if we think that a normal linear regression model with coefficients \(\beta\) and covariates \(X\) and residual standard deviation \(\sigma\) is suitable, our generative model would be
\[ y_{i} \sim \mathcal{N}(X\beta, \sigma) \]
Or we might consider a normality assumption to be too strong, and use a “fat-tailed distribution” instead
\[ y_{i} \sim \mbox{Student's t}(\nu, \, X\beta,\, \sigma) \]
Or perhaps our outcome \(y_{i}\) comes from two distribtuions, each with a different probability (as in this post). Or it could be binary, or count data, or strictly positive data, or multimodal data, etc., in which we would choose different distributions still.
Note that the examples above are extremely simple models—you should almost always start with simple models and build up in complexity. As your model grows in complexity, the value to performing the fake data exercise in steps 4 and 5 grows.
After defining the generative model, you should assign some priors to all the unknowns—in this case, the parameters \(\nu\), \(\beta\), and \(\sigma\). These priors should give weight to plausible values of the parameters of the model, and no weight to impossible values. For instance, \(\nu\) is restricted to be \(>1\)and \(\sigma\) has to be positive. Priors for those parameters should not put weight on values outside that range.
We have a generative model for our data—a way to simulate plausible values for \(y\) given \(X\)—and priors for the parameters.
The next step is to draw some values from a prior, which we treat as being “known” values of the parameters. After doing this we have values for \(X\), and “known” values for \(\nu\), \(\beta\) and \(\theta\), so we can simulate some fake data by drawing observations from the generative model in step 3.
Why should we simulate some fake data? First, it gives us an idea of whether our model puts weight on impossible outcomes—we don’t want to use a model that does that! But more importantly, this (often skipped) step makes us be very explicit about all the assumptions in the model, and guides us to the estimation in the next step.
Before taking your estimation model to real data, you should always try estimating the model on the fake data you simulated in step 4. Why do this? First, you know the values of the parameters for that data. So you should check to see that when you estimate your model on fake data you can recapture the known values. If your model is unable to recapture known parameters with fake data, it will definitely estimate the wrong parameter values using real data.
If your model is able to recapture known parameter values, it’s time to estimate the model on real data.
Often people jump to this step without performing 4 and 5 first, and get funny results. (I know this because some of these people are my friends and I receive a few emails a week about precisely this problem). Gelman’s folk theorem is “it’s probably not the computer. It’s probably your model.” and this is almost always the case here.
Sometimes, especially for big, loosely-identified models, you might not be getting very good samples from your posterior. One great tool in R for exploring pathologies in sampling is rstanarm
(available here), which provides a web interface to your MCMC fits.
If you have poor convergence or pathologies in sampling, this can often be fixed by reparameterizing your model. Reparameterizing a model is simply a way of expressing the same model in a way that your posterior has a more regular shape (and so is easier to sample from).
Now we know that the model has been built well and is estimating fine. But was it the right model in the first place? Posterior predictive checking is a very useful method for answering this question. The aim is to check to see if, when simulated, the model generates predictions that have a similar distribution to the observed data, after taking into account uncertainty in the model parameters.
An example of this is below, from some recent work of mine modeling micro-loan repayments in Sub-Saharan Africa.
We have just built a fairly simple model, but now we have it working, and we probably have a good idea what’s wrong with it. At this point we can afford to go back to step 3 and build up a more complex model, knowing that if anything breaks (or a deadline approaches), we have a well-built, well-checked model ready to go.