This post describes a generative model for classification when then number of potential classes is very large, both relative to the number of ovservations and the number of features. I set up some fake data with 1000 labels, 4500 observations and 30 features. On 500 hold-out cases, the model generates predictions with around 45% accuracy. By contrast, a stock random forest has 32% accuracy, and an evening of fart-arsing around in Keras struggles to beat that.

The reason for the better classification is because we explicitly model the features as emerging from the label, in such

Last week I was at Sawtooth Conference in Orlando where I finally met the wonderful Elea McDonnell Feit. Elea had an interesting problem. Occasionally, we have data that look something like this:

User id | Activity 1 | Activity 2 | Activity 3 | Activity 4 | Activity 5 |
---|---|---|---|---|---|

49 | 1 | 0 | 1 | 0 | 1 |

277 | 1 | 0 | 0 | 0 | 0 |

NA | 0 | 1 | 0 | 0 | 0 |

NA | 0 | 0 | 1 | 0 | 0 |

655 | 1 | 0 | 0 | 0 | 0 |

… | … | … | … | … | … |

That is, often we have activity data but only sometimes observe the User IDs, but not always. For instance, you might run news website, and users sometimes log onto your site in incognito mode to avoid the paywall. While on your site, they engage in “activities”, which would be any binary variables that describe user behaviour, such as operating system dummies, browser, parts of the site etc. Given a (potentially incomplete) observation of user behaviour, we want to generate a probabilistic classification for what User ID they have.

The model I have in mind is something like the following:

We have users indexed by \(i \in 1:I\) engaging in activities \(a \in 1:A\). The probability that user \(i\) engages in activity \(a\) during a visit is \(\text{Logit}^{-1}(\alpha_{ia})\). We observe \(N\) site visits, with \(N>>I\). We’ll simulate 5000 visits data for 30 activities and 1000 users.

```
set.seed(47)
N <- 5000
I <- 1000
A <- 30
user <- rep(NA, I)
activities <- matrix(NA, N, A)
# Draw some generative values of alpha
alpha <- matrix(rnorm(I*A, 0, 2), I, A)
# Draw some 0-1 activities with probabilities given by logit^-1(alpha)
for(n in 1:N) {
# Draw the user
user[n] <- sample(1:I, 1)
# Given her alphas, draw an activity profile
activities[n,] <- as.numeric(rbinom(A, 1, arm::invlogit(alpha[user[n],])))
}
# Now let's make some test data
user_2 <- user
user_2[(N-500):N] <- -1 # We use -1 in place of NA so that Stan doesn't mind
```

which gives us data that look like this (the crossover between labelled and unlabelled data shown):

User ID | V1 | V2 | V3 | V4 | V5 | V6 |
---|---|---|---|---|---|---|

809 | 0 | 0 | 0 | 1 | 0 | 1 |

994 | 1 | 1 | 1 | 1 | 1 | 0 |

909 | 0 | 1 | 1 | 1 | 0 | 1 |

-1 | 0 | 1 | 0 | 1 | 1 | 0 |

-1 | 0 | 1 | 0 | 0 | 0 | 1 |

-1 | 1 | 0 | 1 | 0 | 0 | 1 |

… | … | … | … | … | … | … |

If you’re told to generate predictions for data that look like this, it’d be pretty natural to go to a workhorse model from machine learning, like a Random Forest. Let’s see how this goes

```
library(randomForest)
# Fit the model
if("model_fits.RData" %in% dir()) {
load("model_fits.RData")
} else {
rf_fit <- randomForest(factor(user_2[user_2>0]) ~ ., data = as.data.frame(activities[user_2>0,]))
}
# Make some predictions for the hold-out set
predictions <- data.frame(predictions = predict(rf_fit, newdata = as.data.frame(activities)),
actuals = user,
i2 = user_2) %>%
filter(i2<0)
# What proportion of predictions were correct?
acc <- mean(predictions$predictions== predictions$actuals)
```

So there we go: predictive accuracy of 31.34% on hold-out data–pretty good! Especially given the huge number of potential labels, small number of observations, and small number of features, I’m extremely impressed. Guessing at random will give us a .1% accuracy. Can we do better?

Another way of classifying observations is to build a generative model of our *activities*, not the labels. Under this paradigm, each user \(i\) has their own set of parameters \(\alpha_{i} = (\alpha_{i1}, \, \dots, \, \alpha_{iA})\). If we apply the inverse Logit function, element-wise, to this vector we get the probabilities of each action for person \(i\) (the inverse logit function simply converts each alpha into a probability). Our assumption is that a vector of actions during some visit \(v\) (we assume that each user might visit multiple times), \(y_{i}\) is distribued Bernoulli with these probabilities

\[ y_{iv} \sim \text{Bernoulli}((\text{Logit}^{-1}(\alpha_{i1}), \dots, \text{Logit}^{-1}(\alpha_{iA}))') \]

Once we make this modeling assumption, we can do three things:

- For a given set of parameters, compute the likelihood of a labelled observation \(y_{iv}\)
- For an unlabelled observation, compute what its likelihood contribution would have been had it come from the model associated with every other users’ parameters. From this we can calculate the probability of it being an observation from that user.
- Calculate the log likelihood of the model and estimate the parameters by Bayesian methods or, as I do below, using Penalized likelihood.

Let’s say we have a labelled observation \(y_{iv}\), that is, one for which we know the user. There are an associated set of parameters \(\alpha_{i}\) with this user. With the model above, we can calculate the likelihood contribution of this observation.

\[ \mathcal{L}(y_{iv}| \alpha_{i}) = \prod_{a = 1}^{A}\text{Logit}^{-1}(\alpha_{ia})^{y_{iva}}\times(1 - \text{Logit}^{-1}(\alpha_{ia}))^{(1 - y_{iva})}) \]

Where \(y_{iva}\) is the \(a\)’th element of \(y_{iv}\), and is either 0 or 1. Taking the log gives us the log likelihood contribution of the observation.

\[ \log(\mathcal{L}(y_{iv}| \alpha_{i})) = \sum_{a = 1}^{A}y_{iva}\log(\text{Logit}^{-1}(\alpha_{ia})) + (1 - y_{iva})\log(1 - \text{Logit}^{-1}(\alpha_{ia}))) \]

In order to construct the log likelihood of the data, we must work out the likelihood contribution of the labelled *and unlabelled* observations.

The likelihood contribution of an unlabelled observation is the weighted average of likelihood contributions under the proposed models, where the weights are the probability that the observation came from those models. But where do we get this probability?

Recall above that the likelihood of an observation belonging to some unlabelled person \(i'\) given a set of parameters belonging to \(i\) is

\[ \mathcal{L}(y_{i'v}| \alpha_{i}) = \prod_{a = 1}^{A}\text{Logit}^{-1}(\alpha_{ia})^{y_{i'va}}\times(1 - \text{Logit}^{-1}(\alpha_{ia}))^{(1 - y_{i'va})}) \]

Well the probability that it belongs to person \(i\) is simply the relative likelihood

\[ \text{prob} (i'= i) = \frac{\mathcal{L}(y_{i'v}| \alpha_{i})}{\sum_{i=1}^{I}\mathcal{L}(y_{i'v}| \alpha_{i})} \]

This might look familiar! It’s the softmax of the possible log likelihoods of \(y_{i'v}\) across all possible \(i\)s.

\[ \text{prob} (i'= i) = \frac{\exp(\log(\mathcal{L}(y_{i'v}| \alpha_{i})))}{\sum_{i=1}^{I}\exp(\log(\mathcal{L}(y_{i'v}| \alpha_{i})))} \]

So all we need to do is for every unlabelled observation, calculate the log likelihood contribution across all possible \(i\)s, and take the softmax. This is precisely what happens in the `transformed parameters`

block in the code below.

The full log likelihood is simply the sum of the log likelihood contributions of each observations. We know how to do this already for the labelled observations, but how about the unlabelled ones?

The *likelihood* (not log likelhood) contributions of an unlabelled observation is the weighted average likelihood for each possible users’ model, where the weights are calculated as above. That is

\[ \mathcal{L}(y_{i'v}|\alpha) = \text{prob}(i'=1)\times \mathcal{L}(y_{i'v}|\alpha_{1}) + \dots + \text{prob}(i'=A)\times \mathcal{L}(y_{i'v}|\alpha_{A}) \]

Therefore the log likelhood is simply the sum of this

\[ \log(\mathcal{L}(y_{i'v}|\alpha)) = \log(\text{prob}(i'=1)\times \mathcal{L}(y_{i'v}|\alpha_{1}) + \dots + \text{prob}(i'=A)\times \mathcal{L}(y_{i'v}|\alpha_{A})) \]

This is *precisely* the situation in which we want to wheel out the `log_sum_exp()`

function, which is defined as

`log_sum_exp(x) = log(exp(x[1]) + exp(x[2]) + ...)`

Except here, the `x[1]`

is our log likelihood for \(i=1\) and so on. This is implemented in Stan below.

```
// saved as generative_classifier.stan
data {
int N; // number of observations
int I; // number of users
int A; // number of activities
int user[N]; // user indicator for every observation; -1 if user is missing
matrix[N, A] activities;
}
parameters {
matrix[I, A] alpha;
}
transformed parameters {
matrix[N, I] user_probabilities = rep_matrix(0.0, N, I);
for(n in 1:N) {
// the user is known
if(user[n] > 0) {
user_probabilities[n, user[n]] = 1.0;
} else {
// the user is unknown
vector[I] tmp;
// loop through users
for(i in 1:I) {
tmp[i] = log(activities[n] * inv_logit(alpha[i]') + (1.0 - activities[n]) * (1.0 - inv_logit(alpha[i]')));
}
user_probabilities[n] = softmax(tmp)';
}
}
}
model {
// Give prior to alpha
to_vector(alpha) ~ normal(0, 2);
// calculate the log likelihood
for(n in 1:N) {
if(user[n] > 0) {
// log likelihood in the case that we observe the user
target += activities[n] * log(inv_logit(alpha[user[n]]')) +
(1.0 - activities[n]) * log(1.0 - inv_logit(alpha[user[n]]'));
} else {
// log likelihood in the case we do not observe the user
target += log_sum_exp(log(user_probabilities[n]') + // probability of being each user
(activities[n] * log(inv_logit(alpha')) + // plus the log likelihood in each state
(1.0 - activities[n]) * log((1.0 - inv_logit(alpha'))))');
}
}
}
```

Finally we compile and fit it in R. On my old Macbook pro, it fits in a few minutes for the example, which was far less than my abortive attempts with xgboost and far, far less than my evening spent trying to get a neural network to do anything sensible at all.

```
library(rstan)
compiled_model <- stan_model("generative_classifier.stan")
# Let's put some missing values (encoded as -1) in the individual/type index
# Create a list of data for Stan
data_list <- list(N = N,
I = I,
A = A,
user = user_2,
activities = activities)
# Fit the model
if(!"model_fits.RData" %in% dir()) {
optim_est <- optimizing(compiled_model, data_list)
save(rf_fit, optim_est, file = "model_fits.RData")
}
```

Let’s get the implied probablities and see how often the highest probability went to the true user.

```
# Extract the fitted probabilities
implied_probs <- optim_est$par[grepl("user_probabilities", names(optim_est$par))]
implied_probs <- matrix(implied_probs, N, I) %>% as_data_frame()
implied_probs <- implied_probs[user_2<0,]
# What proportion were predicted correctly?
acc_2 <- mean(apply(implied_probs, 1, which.max) == user[user_2<0])
```

This fairly simple model with a mere 30k parameters manages far better performance than the random forest, classifying 43.51% of observations correctly. That’s pretty phenomenal, given the problem.

The point of this post is that modeling your features can help you massively improve your classification models when you don’t have a huge amount of data.

Of course, we don’t only have set such a problem up with binary activities. You could equally have continuous variables, or ordinal variables, categorical variables etc. The likelihood would be a bit different, but the concept the same.

I’d be especially keen to hear from my friends in machine learning about how they might tackle the same problem.

Jim