Oh my god this happens all the time I swear. What I’m talking about here is the thing where you build a model and think it works, but if you look closer it’s actually doing almost nothing. This post looks at a common yet (in my opinion) not widely discussed trap you (or better yet, someone else), can fall into.
Even simple models can stuff up
Once upon a time I saw a logit model that was built to predict customer churn. The relevant group in charge of it had performance stats to back up their assertion that it worked well. But a quick glance suggested all was not as it seemed.
What gave this away? It’s pretty simple, in fact. The variables of the model were all pretty straightforward - indicators for variables, some customer history variables and some other stuff to handle the frequency of their interactions with the company. So far so good. The problem with all this becomes apparent when you look at the coefficients of the model. Almost all of them were near zero, except for the dummy variable for one product which was massive.
In other words, the model was just one giant
if statement. If you had this product, you would churn, otherwise you wouldn’t. The reason this model seemed like it was ok was because customers with that particular product were indeed churning en masse, such that it masked performance flaws elsewhere in the sample space.
In the course of a few months’ worth of trying to alert people to this, I realised that to accomplish that aim you need a good explanation for how that could have happened in the first place. This brings us to the villain of this particular story, a thing called quasi-separation.
This page gives as good an explanation as any I’ve seen as to what’s going on here. Long story short, the dataset used to build the model has a severe bias in it, such that it’s possible to make predictions about the label with 100% confidence in some circumstances.
In the case I described above, what had happened was that the model was trained on a dataset that contained no examples of a customer having that one product and also not churning. So maximum likelihood estimation takes that and runs with it and you wind up with a model that’s ridiculously oversensitive to this one variable. The confusion matrix shows this more clearly:
|Doesn’t have product||20||50|
Any time you see a sample with the HAS_PRODUCT flag, why wouldn’t you predict they churn? It’s a one-way bet.
This is a really clear-cut case, unfortunately this can still happen in slightly less obvious circumstances. In the link at the start of this section, they talk about a case of exact separation with a continuous instead of binary variable. Similarly, even if the 0 in that confusion matrix was a 2 or a 5 you’d still have the problem. This can also kick in if there’s a combination of variables that exactly splits your labels.
The most obvious knee-jerk reaction is to regularise your model, but that’s a band-aid. Really the data is the problem. And potentially the fact that you’re using a maximum likelihood technique. But that’s a matter for a separate rant some other time.
Now it’s time for a quick FAQ on what we’ve learned!
Q: My neural network doesn’t have that problem
A: Yes it does. If you’ve got separation problems in your data, then you’re still stuffed.
How would I know?1
The obvious thing is to run that confusion matrix for every categorical variable you’ve got and look for any that have a cell with barely any observations as a percentage of the total. You might also consider looking at a covariance matrix or two. The other thing you can do is just run your model and look for the signs that it has gone wrong. Remember, there are a few giveaways.
- One of the coefficients is orders of magnitude bigger than the others and the standard errors are massive because the model hasn’t really converged
- Some of your predictions are almost exactly zero or one - there should almost never be a reason for a model to be THAT confident
This is almost surely going to happen to you at some point, if it hasn’t already. Good luck with that haha.
I bet at least a few start-ups have got models that are really doing this under the hood.
Ok fine, this is a Soundgarden joke. ↩