One of the things I had difficulty with when learning statistics was getting an intuitive grasp of what a covariance matrix is. I mean, I knew it had to do with correlation and variance— when two variables have a big covariance, they tend to move together; when the covariance is negative, they move in opposite directions. But variances are so hard to interpret. And I was a bad student.

This post is a quick tutorial on covariance matrices, the multivariate normal distribution, the LKJ prior, and Box-Cox transformations.

Some data

First, let’s load the data from yesterday’s post; it contains quarterly growth rates of jobs in Australia and real GDP. We’ll only keep the quarterly observations for jobs growth.

library(Quandl); library(dplyr); library(rstan); library(ggplot2); library(reshape2)
options(mc.cores = parallel::detectCores())

#real gdp millions chained dollars - seasonally adj
GDP <- Quandl("AUSBS/5206002_EXPENDITURE_VOLUME_MEASURES_A2304402X") %>% 
  mutate(Date = as.Date(Date)) %>% 
  arrange(Date) %>% 
  mutate(GDP_growth = Value/lag(Value, 1)) %>% 
  dplyr::select(Date, GDP_growth)

#employed persons - monthly seasonally adj
EMP <- Quandl("AUSBS/6202001_A84423043C") %>% 
  mutate(Date = as.Date(Date)) %>% 
  arrange(Date) %>% 
  mutate(Emp_growth = Value/lag(Value, 3)) %>%
  dplyr::select(Date, Emp_growth)

# Join series together
full_data <- left_join(EMP, GDP, by = "Date") %>% 
  filter(!is.na(GDP_growth), !is.na(Emp_growth))

Now we have the series, let’s plot them both with the excellent ggpairs function in the GGally package. Along the diagonals, we have density plots of both variables; on the off-diagonals, we have a bivariate plot showing how the two variables are related.

GGally::ggpairs(full_data[,-1], title = "Pairs plot, GDP and employment growth")

Two things you should notice here:

  1. The two variables are have bell-curve shapes with a slightly fat tail on the left hand side, corresponding to recessions.
  2. The two variables are fairly correlated.

What we want to do is come up with a joint distribution of these two variables. Ideally that distribution should be one that the data might plausibly have been drawn from. Although the two marginal densities (down the diagonal) don’t look entirely normal, let’s start by modeling the two series jointly using a multivariate normal density. This distribution takes two parameters: a mean vector \(\mu\) and a covariance matrix \(\Sigma\).

What do these parameters correspond to? The mean vector \(\mu\) contains the means of each series. In this case, something like

Emp_growth GDP_growth
1.005 1.008

How about the covariance matrix \(\Sigma\)? Well this parameter tries to do two things. It should capture the variation in each variable individually. And it should capture the correlation between the two.

Rather than think of a covariance matrix as having both purposes, it’s easier to think these two purposes separately, by splitting them into scale and correlations. The scale vector, \(\tau = (\tau_{1},\, \tau_{2})'\), contains the standard deviations of the two series. It might contain values that look like

Emp_growth GDP_growth
0.005082 0.00779

The correlations between the two series are captured in a correlation matrix \(\Omega\). They might look a bit like this:

  Emp_growth GDP_growth
Emp_growth 1 0.3412
GDP_growth 0.3412 1

The covariance matrix \(\Sigma\), is simply defined as the quadratic form of the scale vector and correlation matrix.

\[ \Sigma = \mbox{diag}(\tau)\Omega\mbox{diag}(\tau) \]

Where \(\mbox{diag}(\tau)\) creates a square matrix with the values of \(\tau\) down the diagonal.

Estimating the parameters of the density

Now the parameter values printed above—the mean, standard deviation, and correlations—are all parameters of the data. But we really want to draw inference about the parameters of the model that we are saying might have given rise to the data. We need to estimate the parameters of the model.

For this, we need our generative distribution, which we said before was multivariate normal. Where \(y_{t}\) is the vector containing the annual employment growth and annual GDP growth in period t,

\[ y_{t} \sim \mbox{Multi Normal}(\mu,\, \Sigma) \]

Where

\[ \Sigma = \mbox{diag}(\tau)\Omega\mbox{diag}(tau) \]

We also need to provide priors. \(\tau\) is a vector of standard deviations, which must be positive. We might give it a half Cauchy prior, which allows for potentially large values for each standard deviation.

\[ \tau \sim \mbox{Cauchy}_{+}(0, 2) \]

What distribution could we give to the correlation matrix \(\Omega\)? Such a distribution needs to take into account a few facts about correlation matrices:

One such density that does all of the above is the LKJ density. It takes a single parameter, \(\nu\), which must be positive. When \(\nu=1\), this is the same as having a uniform joint prior over the correlations in the system. When \(\nu\) grows very large, our correlation matrix collapses to a diagonal matrix. For the purposes of the model, we’ll use

\[ \Omega \sim \mbox{LKJ}(4) \]

This is implementable using the following Stan code:

// saved as multi_normal_model.stan
data {
  int T; // number of time periods
  int P; // number of variables
  matrix[T, P] Y; // observations
}
parameters {
  vector[P] mu;
  vector<lower = 0>[P] tau;
  corr_matrix[P] Omega;
}
model {
  // priors
  mu ~ normal(1, 1);
  tau ~ cauchy(0, .1);
  Omega ~ lkj_corr(4);
  
  // likelihood
  for(t in 1:T) {
    Y[t] ~ multi_normal(mu, quad_form_diag(Omega, tau));
  }
}
generated quantities {
  vector[T] y_lpd;
  
  // record the log posterior densities (for model comparison)
  for (t in 1:T) {
    y_lpd[t] = multi_normal_lpdf(Y[t] | mu, quad_form_diag(Omega, tau));
  }
}

which we call from R:

multi_normal_model <- stan("multi_normal_model.stan",
                           data = list(T = nrow(full_data),
                                       P = 2,
                                       Y = full_data[,-1]))

And view the parameter estimates as so:

## Inference for Stan model: multi_normal_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##            mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## mu[1]      1.00       0 0.00 1.00 1.00 1.00 1.00  1.01  4000    1
## mu[2]      1.01       0 0.00 1.01 1.01 1.01 1.01  1.01  4000    1
## Omega[1,1] 1.00       0 0.00 1.00 1.00 1.00 1.00  1.00  4000  NaN
## Omega[1,2] 0.32       0 0.07 0.18 0.28 0.33 0.37  0.45  1387    1
## Omega[2,1] 0.32       0 0.07 0.18 0.28 0.33 0.37  0.45  1387    1
## Omega[2,2] 1.00       0 0.00 1.00 1.00 1.00 1.00  1.00  3870    1
## tau[1]     0.01       0 0.00 0.00 0.00 0.01 0.01  0.01  1931    1
## tau[2]     0.01       0 0.00 0.01 0.01 0.01 0.01  0.01  1710    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Sep  5 16:02:29 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

But wait! Wasn’t our data a bit non-normal?

Alert readers would have noticed that the data that we’ve just fit a multivariate normal distribution to is a little skewed. In such a case, we might want to model a transformation of the data that seems more normal. In this particular case the transformation doesn’t actually make much of a difference, but the principle stands. One such transformation is the Box-Cox transformation, in which we transform a (nonnegative) series \(y\) with a transformation parameter \(\lambda\) such that

\[ y_{i}^{(\lambda)} = \frac{y_{i}^{\lambda}-1}{\lambda} \]

if \(\lambda\) is no 0, and \(\ln(y_{i})\) otherwise.

In this case, we will estimate the value of lambda for each series. The transformation is implemented in Stan below:

// saved as multi_normal_model_boxcox.stan
data {
  int T; // number of time periods
  int P; // number of variables
  matrix[T, P] Y; // observations
}
parameters {
  vector[P] mu;
  vector[P] lambda;
  vector<lower = 0>[P] tau;
  corr_matrix[P] Omega;
}
transformed parameters {
  matrix[T, P] Y_transformed;
  
  // This is where we make the transformation
  for(p in 1:P) {
    for(t in 1:T) {
      if(lambda[p]==0) {
        Y_transformed[t,p] = log(Y[t,p]);
      } else {
        Y_transformed[t,p] = (Y[t,p]^lambda[p]-1)/lambda[p];
      }
    }
  }
}
model {
  // priors
  mu ~ normal(1, 2);
  lambda ~ normal(0, 15);
  tau ~ cauchy(0, 2);
  Omega ~ lkj_corr(4);
  
  // likelihood
  for(t in 1:T) {
    Y_transformed[t] ~ multi_normal(mu, quad_form_diag(Omega, tau));
  }
}
generated quantities {
  vector[T] y_lpd;
  
  // record the log posterior densities (for model comparison)
  for (t in 1:T) {
    y_lpd[t] = multi_normal_lpdf(Y_transformed[t] | mu, quad_form_diag(Omega, tau));
  }
}

Which we call from R as so

boxcox_multinormal <- stan("multi_normal_model_boxcox.stan",
                           data = list(T = nrow(full_data),
                                       P = 2,
                                       # Add some constant to all observations to guarantee positivity
                                       Y = full_data[,-1]))

Now that we have estimated both the transformed and untransformed models on the same data, we can check to see which appears to fit the data better. Ideally, we’d want to do cross-validation, which would involve fitting the model many times, checking the log posterior density on the holdout set. For models like this one, this is too time-consuming. Instead, we use an approximation to leave-one-out cross validation, available in the loo package.

To implement this model comparison, we first need to extract the log likelihood (which across draws is the in-sample log posterior density) from the model. This is precisely the object we created in the generated quantities block in the Stan code—the (log) height of the data density for each data-point at parameters from a given draw. Next, we run the loo() function, which performs the cross-validation approximation. Finally, we can compare the two models with compare(), which reports the difference in expected log posterior density of the two models. If compare returns a positive value, the expected out-of-sample log posterior density of the second model is higher than the first model, meaning it should perform better out of sample.

library(loo)

# Extract log posterior densities from both models
mod_1_ll <- extract_log_lik(multi_normal_model, "y_lpd")
mod_2_ll <- extract_log_lik(boxcox_multinormal, "y_lpd")

# Estimate leave-one-out cross validation
loo_1 <- loo(mod_1_ll)
loo_2 <- loo(mod_2_ll)

# Now compare

compare(loo_1, loo_2)
## elpd_diff        se 
##      41.0      19.8

As we can see, the expected log posterior density for the second model—our model with the Box-Cox transformation—is slightly higher than for the simple multi normal model.

Edit - what happens in the transformation?

The Box-Cox transformation above rescales the data using the parameters lambda, summarised below.

print(boxcox_multinormal, pars = "lambda")
## Inference for Stan model: multi_normal_model_boxcox.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##             mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff
## lambda[1] -37.96    0.16 8.24 -54.37 -43.46 -37.75 -32.37 -22.07  2747
## lambda[2] -49.02    0.11 6.12 -61.25 -53.15 -49.09 -44.77 -37.13  2901
##           Rhat
## lambda[1]    1
## lambda[2]    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Sep  5 16:04:09 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

What does this imply for the normality of the rescaled data?

# Take our data, which we'll rescale
positive_data <- full_data[,-1]

# Grab the lambdas and take the mean
lambdas_mean <- as.data.frame(boxcox_multinormal) %>% 
  select(contains("lambda")) %>% 
  summarise_each(funs(median)) %>% 
  melt
## No id variables; using all as measure variables
# Transform the data according to the Box-Cox transformation
positive_data[,1] <- (positive_data[,1]^lambdas_mean$value[1] - 1)/-lambdas_mean$value[1]
positive_data[,2] <- (positive_data[,2]^lambdas_mean$value[2] - 1)/-lambdas_mean$value[2]

# Plot the transformed data
GGally::ggpairs(positive_data, title = "Pairs plot, transformed data")

As we can see here, we still have some skewness, though this skewness is towards lower values rather than higher ones. Any gains from this transformation are likely to be small.