Many time-series exhibit “regimes”—periods for which a set of rules seem to apply. When we jump to a different regime, a different set of rules apply. Yet we never really know which regime we’re in—they aren’t directly observable. This post illustrates the intuition behind building models that have these latent regimes in Stan. They are close in concept to finite mixture models as described here, with the main difference being that we model the changes in state probabilities as being a discrete (hidden) Markov process, rather than a continuous autoregressive process. The following explanation uses James Hamilton’s fantastic notes.
As in the linked post above, we’ll use 100 times the daily difference in logged prices of a stock as the example time-series to model. Call these percentage returns \(r_{t}\). Just as before, we have two states: a random walk state, and a momentum state, (in which returns persist). I’ll assume normal distributions for these daily returns—you shouldn’t do this in real life. Just want to keep things simple. The generative models under the two states are:
State 1: \(r_{t} \sim \mbox{normal}(\alpha_{1}, \sigma_{1})\) (prices take a random walk)
State 2: \(r_{t} \sim \mbox{normal}(\alpha_{2} + \rho r_{t-1}, \sigma_{2})\) (some momentum in returns)
We’ll call the true state in period \(t\) \(s_{t}\). Of course, we don’t actually know what state we’re in on a given day, but we can infer probabilistically if we have a model for it. The particular way we’ll model this discrete state is as a Markov chain. That is, \(p(s_{t}=j\, |\, s_{t-1} = i, s_{t-2} = k, \dots, r_{0}, \dots,r_{t-1}) = p(s_{t}=j\, |\, s_{t-1} = i) = p_{ij}\) —all the information relevant to making a forecast for what the state is in the next period is contained in what the state was the previous period.
It might help to illustrate the Markov matrix explicitly:
\[ p(s_{t} = j) = \begin{bmatrix} p_{11} & p_{12} = 1 - p_{11} \\ p_{21} = 1 - p_{22} & p_{22} \end{bmatrix}_{s_{t-1},j} \]
The problem with fitting this model is that we don’t know what state we were in in the previous period for sure, which necessitates an iterative algorithm.
First, let’s define the likelihood contribution of an observation \(y_{t}\) under the two states
\[ \eta_{1t} = \mbox{Normal}(y_{t} |\, \alpha_{1}, \sigma_{1}) \] \[ \eta_{2t} = \mbox{Normal}(y_{t} |\, \alpha_{2}+ \rho r_{t-1}, \sigma_{2}) \]
In Stan notation, this is just
// inside a for loop over t
eta1[t] = exp(normal_lpdf(y[t] | alpha_1, sigma_1));
eta2[t] = exp(normal_lpdf(y[t] | alpha_2 + rho * y[t-1], sigma_2));
If we knew the state in the previous period for sure, we could evaluate the log likelihood contribution of \(y_{t}\) as
\[ f(y_{t} |\, \theta, s_{t-1} = j) = p_{s_{t-1}1}\eta_{1t} + p_{s_{t-1}2}\eta_{2t} \]
That is, the likelihood contribution is simply the weighted average of likelihood contributions under the two proposed models, where the weights are the probability of being in each state.
Of course, we don’t know the probability of being in each state. Let \(\xi_{jt}\) be the probability of being in state \(j\) in period \(t\), so \(\xi_{jt-1}\) is the probability of being in that state the previous period. This gives us the likelihood contribution
\[ f(y_{t} |\, \theta) = p_{11}\xi_{1t-1}\eta_{1t} + p_{12}\xi_{1t-1}\eta_{2t} + p_{21}\xi_{2t-1}\eta_{2t}+ p_{22}\xi_{2t-1}\eta_{2t} \]
Again, this is just a weighted average of likelihood contributions, weighted now by the probability of being in each state, which itself is uncertain and therefore we have to average across the probability of being in each of those states last period.
But where does \(\xi_{jt}\)s come from? We can back out the probability of being in each state in \(t\) by noticing that the probability of being in each state is
\[ \xi_{jt} = \frac{\sum_{i = 1}^{2}p_{ij}\xi_{jt-1}\eta_{jt}}{f(y_{t} |\, \theta)} \]
Once we have the likelihood contributions, we can calculate the log likelihood
\[ \mathcal{L}(y | \theta) = \sum_{t=1}^{T} \log(f(y_{t} |\, \theta)) \]
The big problem with modeling any mixture-type model is that if the two generative models are fairly similar, the relative likelihoods will be fairly similar and you can wind up with each MCMC chain mis-labeling the different models. Mike Betancourt has a wonderful primer on this here. In order to identify the model, we need prior information that makes each of the potential data generating processes distinguishable from each other. In the case of the model above, the case where \(\rho = 0\) makes the data generating processes the same, so we need a strong prior on this parameter in order to identify the model. You’ll know that your regime-switching model is poorly identified when your rhats are larger than 1 but each chain is mixing well when considered individually.
We want the autoregressive model to be distinguishable from the white-noise model. So let’s go with a fairly tight prior on \(\rho \sim \text{Normal}(1, .1)\). If we relax this prior too much, we expect that \(\rho\) will get dragged towards 0 and the the model will get less identifiable.
\(\alpha_{1,2}\) is the daily expected percentage point return, which should be unbounded. With 252 trading days a year, a \(\mbox{Normal}(0,.1)\) would imply one-standard-deviation annual returns of 25%, which seems reasonable.
The innovation scales must be positive—we’ll use half cauchy (\(\sigma_{1,2} \sim \text{Cauchy}_{+}(0, 1)\)).
\(p_{11},\, p_{22}\) must be in (0, 1), and we expect considerable stickiness of states so we’ll use a \(\beta(10, 2)\).
Finally the initial values of \(\xi_{1t=0}\) we’ll use \(\beta(2, 2)\).
First, let’s get the same data we used for the last example (note that the Quandl code has changed now that they want to make money).
library(tidyverse); library(rstan); library(Quandl)
# Now with real data!
googl <- Quandl("WIKI/GOOGL", collapse = "weekly")
googl <- googl %>%
mutate(Date = as.Date(Date)) %>%
arrange(Date) %>%
mutate(l_ac = log(`Adj. Close`),
dl_ac = c(NA, diff(l_ac))) %>%
filter(Date > "2010-01-01")
plot.ts(googl$dl_ac, main = "Weekly changes in Google's adjusted close price")
The Stan code for the model is below. Note that I’ve not used the log_sum_exp
operator, but you could do this. Also you could use some matrix algebra to simplify the creation of f
and xi
in the transformed parameters
block—you’ll probably want to do this if you have more than one state.
// saved as regime_switching_model.stan
data {
int T;
vector[T] y;
}
parameters {
vector<lower = 0, upper = 1>[2] p;
real<lower = 0> rho;
vector[2] alpha;
vector<lower = 0>[2] sigma;
real<lower = 0, upper = 1> xi1_init;
real y_tm1_init;
}
transformed parameters {
matrix[T, 2] eta;
matrix[T, 2] xi;
vector[T] f;
// fill in etas
for(t in 1:T) {
eta[t,1] = exp(normal_lpdf(y[t]| alpha[1], sigma[1]));
if(t==1) {
eta[t,2] = exp(normal_lpdf(y[t]| alpha[2] + rho * y_tm1_init, sigma[2]));
} else {
eta[t,2] = exp(normal_lpdf(y[t]| alpha[2] + rho * y[t-1], sigma[2]));
}
}
// work out likelihood contributions
for(t in 1:T) {
// for the first observation
if(t==1) {
f[t] = p[1]*xi1_init*eta[t,1] + // stay in state 1
(1 - p[1])*xi1_init*eta[t,2] + // transition from 1 to 2
p[2]*(1 - xi1_init)*eta[t,2] + // stay in state 2
(1 - p[2])*(1 - xi1_init)*eta[t,1]; // transition from 2 to 1
xi[t,1] = (p[1]*xi1_init*eta[t,1] +(1 - p[2])*(1 - xi1_init)*eta[t,1])/f[t];
xi[t,2] = 1.0 - xi[t,1];
} else {
// and for the rest
f[t] = p[1]*xi[t-1,1]*eta[t,1] + // stay in state 1
(1 - p[1])*xi[t-1,1]*eta[t,2] + // transition from 1 to 2
p[2]*xi[t-1,2]*eta[t,2] + // stay in state 2
(1 - p[2])*xi[t-1,2]*eta[t,1]; // transition from 2 to 1
// work out xi
xi[t,1] = (p[1]*xi[t-1,1]*eta[t,1] +(1 - p[2])*xi[t-1,2]*eta[t,1])/f[t];
// there are only two states so the probability of the other state is 1 - prob of the first
xi[t,2] = 1.0 - xi[t,1];
}
}
}
model {
// priors
p ~ beta(10, 2);
rho ~ normal(1, .1);
alpha ~ normal(0, .1);
sigma ~ cauchy(0, 1);
xi1_init ~ beta(2, 2);
y_tm1_init ~ normal(0, .1);
// likelihood is really easy here!
target += sum(log(f));
}
Next, we compile the model
options(mc.cores = parallel::detectCores())
compiled_model <- stan_model("regime_switching_model.stan")
googl_mod <- sampling(compiled_model, data= list(T = nrow(googl), y = googl$dl_ac*100), iter = 1000, chains = 4)
We can check that the model has converged (our Rhats should be around 1).
print(googl_mod, pars = c("alpha", "rho", "p", "sigma"))
## Inference for Stan model: regime_switching_model.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha[1] 0.18 0.00 0.15 -0.11 0.07 0.18 0.28 0.47 2000 1
## alpha[2] 0.27 0.01 0.66 -0.96 -0.20 0.28 0.72 1.55 2000 1
## rho 0.95 0.00 0.11 0.75 0.88 0.95 1.02 1.17 2000 1
## p[1] 0.95 0.00 0.02 0.91 0.94 0.96 0.97 0.98 1762 1
## p[2] 0.53 0.00 0.13 0.29 0.44 0.53 0.62 0.78 2000 1
## sigma[1] 2.86 0.00 0.14 2.60 2.77 2.86 2.95 3.13 2000 1
## sigma[2] 9.21 0.04 1.89 6.48 7.90 8.93 10.14 13.83 2000 1
##
## Samples were drawn using NUTS(diag_e) at Thu Mar 29 11:32:39 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
So what do the state probabilities look like? Let’s extract the first column of xi
and plot it over time.
goog <- as.data.frame(googl_mod, pars = "xi") %>%
gather(par, value) %>%
filter(grepl(",1", par)) %>%
mutate(time = readr::parse_number(stringr::str_extract(par, "[0-9]{1,3}[,]"))) %>%
group_by(par) %>%
summarise(time = first(time),
mean = mean(value),
lower = quantile(value, .025),
upper = quantile(value, .975)) %>%
mutate(date = readr::parse_date(googl$Date)) %>%
ggplot(aes(x = date, y = mean)) +
#geom_point(alpha = 0.3) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "green") +
labs(title ="Probability of being in random walk state\n vsv momentum state",
y = "Probability",
subtitle = "GOOGL")
close <- ggplot(data = googl,aes(x = Date, y = `Adj. Close`)) +
geom_line(colour = "green")
gridExtra::grid.arrange(goog, close)
As we can see, there are relatively few episodes of genuine momentum activity, and the week-to-week persistence of momentum is only slightly above .5. So I’d not be too gung-ho about trading on this model just yet.
As it is, we’ve specified the Markov transition probabilities \(p_{11}\) and \(p_{22}\) as parameters. Yet we have a wealth of information that might provide information about how those probabilities might change. In Stan, it is extremely easy to model these parameters directly, for instance using
\[ p_{jjt} = \text{Logit}^{-1}(f(\mathcal{I}_{t})) \]
where \(f(\mathcal{I}_{t})\) is some unbounded function of the information available at the beginning of the week. But that’s for another day!