In order to fit Bayesian models we need to construct a function that tells us when certain values of model unknowns are good or bad. For example, in image Figure 2, we plot a histogram of values of some random variable X. We want to fit a density function to this histogram so as to be able to make probabilistic statements about the likely distribution of yet-unseen observations. To gauge which proposed density is a good one, we would like a function that gives a higher value for the proposed model unknowns that lead to the distribution in the bottom panel and lower values for the proposed model unknowns that lead to distribution in the top panel.
There are many functions that we could use to determine whether some proposed model unknowns result in a “better” fit than some other unknowns. Likelihood functions are one approach, based on the assumption that the proposed generative distribution gave rise to the observed data. This is most easily motivated with an example.
Example: the binomial likelihood of loan defaults
Let’s say we wanted to estimate the probability of loan defaults. We have a small sample—only five loans— and one has defaulted. If we encode defaulted = 1 and not defaulted = 0, then our data look like
## [1] 0 0 1 0 0
The likelihood function answers the question “what is the probability of observing this sequence if the probability of defaulting is \(\theta\)?” In the case that the outcomes are independent of each other, then this is the product of the likelihood contributions—the probability of observing each observation. The probability of observing a non-default is \((1 - \theta)\), and so the probability of observing a default is \(\theta\). For example, let’s say we think the probability of defaulting is 2%. Then our likelihood (the product of the likelihood contributions) would be
\[ p(y = (0, 0, 1, 0, 0)'|\, \theta = 0.02) = 0.98 \times 0.98 \times 0.02 \times 0.98 \times 0.98 = 0.0184 \]
The above is the specific case to this example. In general, the Bernoilli likelihood—the likelihood function we use for binary outcomes—is
\[ p(y |\, \theta) = \prod_{i=1}^{N} \theta^{y_{i}}\times(1 - \theta)^{1 - y_i} \]
Note that when \(y_{i} = 0\)—no default—\(\theta^{y_{i}} = 1\). Now, is the proposed probability of defaulting a good one? That’s what we can use the likelihood function to perhaps determine. We plot the values of the likelihood function for this data evaluated over the possible values of \(\theta\) below.
What do we notice about this likelihood? First, it’s not a probability distribution in \(\theta\)—it does not integrate to 1. This is a common source of confusion, firstly because it looks like many of the distributions we’re used to working with. But also because we write it as \(\text{Likelihood} = p(y|\, \theta)\). It does integrate to 1 over the possible values of \(y\), but not over the possible values of \(\theta\). Second, we notice that in this case it is maximized at 0.2, which happens to be the proportion of loans in our very small sample that have defaulted. The maximum point of the likelihood function is known as the maximum likelihood estimate (or MLE) of our parameter \(\theta\) given our data. Formally,
\[ \hat\theta_{MLE} = \text{argmax}_{\theta}(p(y |\, \theta)) \] The plot illustrates something potentially dangerous about using the Maximum likelihood estimate with small samples. In this case, before we did the analysis, we had reason to believe that the default rate was 2%. This is what we might call prior information; information that exists from sources other than the sample of data we’re analyzing. Unfortunately our maximum likelihood estimate does not reflect this prior knowledge at all, and we’re left with a probability of default simply equal to the sample frequency. Later in this chapter we’ll discuss how to encode this outside information in our estimate.
The Bernoulli log likelihood
In this simple example, in which we have only one unknown and very few data points, we have no problem evaluating the likelihood function directly. Yet you might have noticed that as we provide more observations of \(y\) to the function, it will get closer to 0—in the discrete case, the likelihood contribution of each observation is between 0 and 1, so multiplying more and more of them together results in a rapidly shrinking likelihood value. If you had a large number of observations for example, your computer will simply not be able to handle that number of leading 0s.
# draw 100k 0-1 coin flips
y <- sample(0:1, 1e5, replace = T)
# A bernoulli likelihood function
bernoulli_l <- function(theta, y) {
prod(theta^y * (1 - theta)^(1 - y))
}
# evaluate likelihood at the correct value
bernoulli_l(0.5, y)
## [1] 0
In order to handle these issues with numerical precision, we use the log likelihood function. This is, as the name suggests, the log of the likelihood function. In the Bernoulli case, it’s
\[ \log(p(y |\, \theta)) = \log\left(\prod_{n=1}^{N} \theta^{y_{i}}\times(1 - \theta)^{1 - y_i}\right) = \sum_{i = 1}^{N} (y_{i}\times \log(\theta) + (1 - y_{i})\times \log(1 - \theta)) \]
Which makes use of the fact that \(\log(ab) = log(a) + log(b)\) and \(\log(a^{x}) = x\log(a)\). This representation of the likelihood is far easier for us to work with than the raw likelihood. For one, it is order preserving—the values of the unknowns that maximize the log likelihood are the same as those that maximize the likelihood—and yet we sum the log likelihood contributions, so small probabilities don’t send the value towards 0. Second, in some cases (for instance, with a binomial or categorical likelihood), you can evaluate it with linear algebra. If we want the likelihood of a vector \(y = (0, 0, 1, 0, 0, \dots)'\) evaluated with the probability vector \(\theta = (\theta_{1},\, \theta_{2},\dots)'\) denoting the probability of each observation, then the log likelihood is just
\[ \log p(y |\,\theta) = y'\log(\theta) + (1 - y)'\log(1 - \theta) \]
We use this formulation extensively in Chapter 4.
Now that we have a basic understanding of the log likelihood function, let’s take a look at the log likelihood function for a model with a continuous outcome and three unknowns.
Log likelihood of a simple normal linear model
Let’s say we have a very simple linear model of \(y\) with a single covariate \(x\). \(\alpha\), \(\beta\) and \(\sigma\) are the model’s parameters.
\[ y_{i} = \alpha + \beta x_{i} + \epsilon_{i} \]
with \(\epsilon_{i} \sim \mbox{N}(0, \sigma)\). This is the same as saying that
\[ y_{i} \sim \mbox{N}(\alpha + \beta x_{i}, \sigma) \]
This is the model we are saying generates the data—\(\mbox{N}(\alpha + \beta x_{i}, \sigma)\) is the generative distribution. It has three unknowns, \(\alpha\), \(\beta\) and \(\sigma\). If we knew what these values were for sure, we could make probabilistic statements about what values \(y_{i}\) might take if we knew the value of \(x_{i}\).
But that’s not how data analysis normally works. Normally we’re given values of \(x\) and \(y\) (or many \(x\)s and many \(y\)s) and have to infer the relationship between them, often with the help of some model. Our task is to estimate the values of the unknowns—in this case, \(\alpha\), \(\beta\) and \(\sigma\).
So how do we do this by maximum likelihood? One thing that computers are relatively good at is choosing values of some unknowns in order to optimize (normally minimize) the value of some function. Here we use the normal log likelihood function as a way of scoring various combinations of \(\alpha\), \(\beta\) and \(\sigma\) so that the score is high when the model describes the data well, and low when it does not.
Let’s say we propose a set of parameters, \(\alpha = 2\), \(\beta = .5\) and \(\sigma = 1\) and we have an obervation \(y_{i} = 1,\, x_{i} = 1\). Given the parameters and \(x_{i}\) we know that the outcomes \(y\) should be distributed as
\[ y_{i} \sim \mbox{N}(2 + .5 \times 1, 1) = \mbox{N}(2.5, 1) \]
which might look like this:
Now we ask: what was the density at the actual outcome \(y_{i}\)?
In R
, we could go ahead and write a function that returns the density for a given observation. Because we’ll be optimizing this, we want to give it the unknowns as a single vector.
density_1 <- function(theta, # the unknown parameters
x, y) {
# This function returns the density of a normal with mean theta[1] + theta[2]*x
# and standard deviation exp(theta[3])
# evaluated at y.
dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))
}
Now let’s evaluate the density at \(x=1,\, y=1\) for the given likelihood values. Note that we’ve included the scale (which must be positive) as \(\sigma = \exp(\theta_{3})\), as it must be positive. If we know \(\theta_{3}\), we can get back to \(\sigma\) by taking its log.
density_1(theta = c(2, .5, log(1)),
x = 1, y = 1)
## [1] 0.1295176
Just as in the binary outcome case, the value of the density at an observed outcome is the likelihood contribution \(f(y_{i} | x_{i}, \alpha, \beta, \sigma)\) of a single datapoint. Unlike the binary case, this has no direct probabilistic interpretation except in relative terms. If the continous density at a given value is twice the density at another, then the relative probability of observing the first value is twice the second. Also, the density needn’t be less than one; when the likelihood is very narrow, we routinely see values of the likelihood greater than 1. You’ll be able to see that this likelihood contribution is maximized (at infinity) when it imposes a spike on the single observation we observe, ie, \(\alpha + \beta \times 1 = 1\) and \(\sigma = 0\)).
# log(0) -> -Inf
density_1(c(0, 1, -Inf) , 1, 1)
## [1] Inf
But we have more than one observation
We have one observation, three unknowns—of course this is an unidentified model. In reality we have many observations. So how do we use this principle of likelihood to get estimates of the many model unknowns?
If we assume (conditional) independence of the draws–that is, the value of one observation’s \(\epsilon_{i}\) has no influence on another’s \(\epsilon_{j}\), the sample likelihood of your data vector \(y\) is the product of all the individual likelihoods \(\prod_{i = 1}^{N}f(y_{i} | x_{i}, \alpha, \beta, \sigma)\). This is identical to the binary outcomes likelihood above.
You can probably see the problem here—for reasonably large spreads of the data, the value of the density for a datapoint is typically less than 1, and the product of many such datapoints will be an extremely small number. And so just as in the binary case, we take the log likelihood of the data.
\[ \log\left( \prod_{i = 1}^{N}f(y_{i} | x_{i}, a, b, s)\right) = \sum_{i = 1}^{N}\log(f(y_{i} | x_{i}, a, b, s)) \]
Log likelihood is a confusing concept to beginners as the values of the numbers aren’t easily interpretable. You might run the analysis and get a number of -3477 and ask “is that good? Bad? What does it mean?” These numbers are more meaningful when conducting model comparison, that is, we get the out of sample log likelihood for two different models. Then we can compare them. The usefulness of this approach is best illustrated with an example.
Imagine that we have an outlier—rather than \(y_{j} = 1\) as in the image above, it is -10. Under the proposed model, we’re essentially saying that the such an outcome is all but impossible. The probability of observing an observation that low or lower is 3.732564310^{-36}. The density of such a point would be 4.695195410^{-35}—a very small number. But the log of its density is large in absolute value: -79.0439385. Compare that with the log likelihood of the original \(y_{j} = 1\) : -2.0439385. An outlier penalizes the log likelhood far more than a point at the mode, and consequently our maximum likelihood estimate will try to make sense of the outlier by dragging the entire distribution that way.
This idea helps us do good modeling: we want to give positive weight to outcomes that happen, and no weight to impossible outcomes. If in the data visualization step we notice some of these outliers (and they are real data rather than miscoded entries—which are common) then we will want to use distributions that make these possible. Use of these fat-tailed distributions is covered in Chapter 3.
Maximum likelihood
Maximum likelihood estimators are simple: if the (log) likelihood is a unidimensional score of how well the data fit the model for a (potentially large) number of parameters, then we can simply run an optimizer that attempts to maximize this score by varying the values of the parameters. If the optimizer is able to find us the mode of the likelihood, it gives us the maximum likelihood estimator (MLE). This mode needn’t be a good estimate of what we want to know. If we have good prior information, the MLE won’t capture any of that. And in some cases (as with the hierarchical models illustrated in the second worked example in this chapter) the MLE returns an objectively terrible estimate.
Let’s get some practice at writing this out in R. Going through this exercise will help you understand what’s going on under the hood in Stan.
Suppose we have a a thousand observations. \(x\) is drawn from a standard normal, \(\alpha = 1\), \(\beta = 2\) and \(\sigma = 3\). \(y = \alpha + \beta x + \epsilon\), where \(\epsilon\) has a zero-centered normal distribution with standard deviation (scale) \(= \sigma\).
alpha <- 1; beta <- 2; sigma <- 3
x <- rnorm(1000)
y <- rnorm(1000, alpha + beta * x, sigma)
Now that we have some fake data, let’s write out the log likelihood function, as before. Note that because the optimization function we use minimizes the function, we need to return the negative of the log likelihood (minimizing the negative is the same as maximizing the likelihood).
negative_log_likelihood <- function(theta, # the unknown parameters
x, y) {
# This function returns the log likelihood of a normal with mean theta[1] + theta[2]*x
# and standard deviation exp(theta[3])
# evaluated at a vector y.
-sum(log(dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))))
}
Now we optimize
estimates <- optim(par = c(1,1,1), negative_log_likelihood, x = x, y =y)
estimated_alpha <- estimates$par[1]
estimated_beta <- estimates$par[2]
estimated_sigma <- exp(estimates$par[3])
paste("estimate of alpha is", round(estimated_alpha, 2), "true value is", alpha)
## [1] "estimate of alpha is 0.95 true value is 1"
paste("estimate of beta is", round(estimated_beta, 2), "true value is", beta)
## [1] "estimate of beta is 2.03 true value is 2"
paste("estimate of sigma is", round(estimated_sigma, 2), "true value is", sigma)
## [1] "estimate of sigma is 3.16 true value is 3"
Hey presto! We just estimated a model by maximizing the likelihood by varying the parameters of our model keeping the data fixed. You should note that this method can be applied more generally in much larger, more interesting models.
Bayes rule gives us a method for combining the information from our data (the likelihood) and our priors, discussed below. It says that the (joint) probability density of our parameter vector \(\theta\) is
\[ \mbox{p}(\theta|\, y) = \frac{\mbox{p}(y|\, \theta)\, \mbox{p}(\theta)}{\mbox{p}(y)} \]
where \(\mbox{p}(y|\, \theta)\) is the likelihood, and \(\mbox{p}(\theta)\) is the prior. We call \(\mbox{p}(\theta|\, y)\) the posterior. Because \(\mbox{p}(y)\) doesn’t depend on the vector of unknowns, \(\theta\), we often express the posterior up to a constant of proportionality
\[ \mbox{p}(\theta|\, y) \propto \mbox{p}(y|\, \theta)\, \mbox{p}(\theta) \]
What can you take away from this? A few things:
Bayes’ rule also tells us that in order to obtain a posterior, all we need is a prior and a likelihood (and some strategy for combining them).
Prior distributions summarize our information about the values of unknowns in our model before seeing the data. For the uninitiated, this is the scary bit of performing Bayesian analysis, often because of fears of introducing biases, having a poor understanding of how exactly the parameter influences the model, or knowing how exactly to summarize this prior information in a way that makes sense for computation (ie. which prior distribution to choose). Such fears are often revealed by the choice of extremely diffuse priors, for instance regression coefficients like \(\beta \sim \mbox{Normal}(0, 1000)\).
Don’t be afraid of using priors—good prior information radically improves the performance of models, as well as computation. You almost always have high quality information about the values parameters might take on. For example:
Prior distributions should be such that they put positive probabilistic weight on possible values of the model unknowns, and no weight on impossible values. In the simple linear model, the standard deviation of the residuals, \(\sigma\) must be positive. And so its prior should not have probabilistic weight below 0. That might guide the choice to a distribution like a half-Normal, half-Cauchy, half-Student’s t, inverse Gamma, lognormal etc.
In a properly-posed model to be estimated with Bayesian techniques, all model unknowns must have priors. We almost always choose these priors from a set of parametric distributions, which themselves might have models for their parameters (so-called “hierarchical priors”). We look at hierarchical priors later in this chapter, and again in chapters 2 and 3. The choice of prior depends on the type of parameter being modeled. Some common parameter types and choices include:
Parameter type | Common prior distribution |
---|---|
Regression coefficient | (Unbounded or bounded) Normal, Student’s T, Cauchy distributions |
High-dimensional coefficients | Double-exponential, Horseshoe prior, Finnish Horseshoe prior |
Scale parameter (must be positive) | “Half” Normal, Student’s T, Cauchy distributions; Gamma, Exponential etc. |
Correlation matrix | “LKJ” distribution |
Covariance matrix | Inverse Wishard matrix |
Simplex (shares) | Dirichlet distribution |
Proportion (0-1 bounded continuous) | Beta distribution |
Priors have a “regularizing” effect, meaning that they pull our estimates away from the likelihood and towards the prior. The “tighter” the prior, the greater this regularizing effect will be. If the prior puts very little weight on the range of plausible values suggested by the likelihood function, two things will happen: first, we will very often have computational issues in the model-fitting stage. Second, even if we can get our model to fit, it will drag our estimates away from the likelihood and towards the prior. At the limit, if the data contains no information about the value of a coefficient (for example a linear regression \(y = \alpha + \beta x + \epsilon\) where every observation of \(x\) is 0), then our estimate will simply be our prior.
This regularizing effect of priors is perhaps the most controversial aspect of Bayesian modeling. A caricatured criticism (correct to some degree) is that a Bayesian model could say just about anything depending on our choice of priors. This criticism is technically true, but does not reflect the practice of Bayesian modeling done well, nor the advantages of using well-chosen priors.
Just as we can estimate a model using maximum likelihood, we can also estimate the mode of the posterior using penalized likelihood estimation. This is a really great way to start thinking about Bayesian estimation.
To estimate a model using penalized likelihood, we simply need to recognize that if
\[ \mbox{p}(\theta|\, y) \propto \mbox{p}(y|\, \theta)\, \mbox{p}(\theta) \]
then
\[ \log(\mbox{p}(\theta|\, y)) \propto log(\mbox{p}(y|\, \theta)) + \, \log(\mbox{p}(\theta)) \] That is, our log posterior density is proportional to the log likelihood plus the log prior density evaluated at a given value of \(\theta\). So we can choose \(\theta\) to maximize our posterior by choosing a \(\theta\) that maximizes the right hand side.
We do this in the function below. I’ve been verbose in my implementation to make it very clear what’s going on. The target
is proportional to the accumulated log posterior density. To maximize the log posterior density, we minimize the negative target. We’ll choose priors
\[ \alpha,\, \beta \sim \mbox{N}(0,1) \mbox{ and } \sigma \sim \text{N}_{+}(0,1) \]
negative_penalized_log_likelihood <- function(theta, # the unknown parameters
x, y) {
# This function returns a value proportional to the
# log posterior density of a normal with mean theta[1] + theta[2]*x
# and standard deviation exp(theta[3])
# evaluated at a vector y.
# Initialize the target
target <- 0
# Add the density of the priors to the accumulator
# Add the log density of the scale prior at the value (truncated normal)
target <- target + log(truncnorm::dtruncnorm(exp(theta[3]), a = 0))
# Add the log density of the intercept prior at the parameter value
target <- target + log(dnorm(theta[1]))
# Add the log density of the slope prior at the parameter value
target <- target + log(dnorm(theta[2]))
# Add the Log likelihood
target <- target + sum(log(dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))))
# Return the negative target
-target
}
Now we optimize
estimates <- optim(par = c(1,1,1), negative_penalized_log_likelihood, x = x, y =y)
estimated_alpha <- estimates$par[1]
estimated_beta <- estimates$par[2]
estimated_sigma <- exp(estimates$par[3])
paste("estimate of alpha is", round(estimated_alpha, 2), "true value is", alpha)
## [1] "estimate of alpha is 0.94 true value is 1"
paste("estimate of beta is", round(estimated_beta, 2), "true value is", beta)
## [1] "estimate of beta is 2.01 true value is 2"
paste("estimate of sigma is", round(estimated_sigma, 2), "true value is", sigma)
## [1] "estimate of sigma is 3.15 true value is 3"
And there you go! You’ve estimated your very first (very simple) model using penalized likelihood.
Before you go on and get too enthusiastic about estimating everything with penalized likelihood, be warned: the modal estimator will do fine in fairly low dimensions, and with many fairly simple models. But as your models become more complex you’ll find that it does not do a good job. This is partly because the mode of a high-dimensional distribution can be a long way from its expected value (which is what you really care about). But also because you often want to generate predictions that take into account the uncertainty you have in your model’s unknowns when making predictions. For this, you’ll need to draw samples from your posterior.