Model selection frameworks

What is model selection?

Suppose we have \(p\) features about \(n\) things (e.g. 11 distinct features of 32 cars), and we want to use that information to try to predict an additional feature of each thing. How do we choose which of the \(p\) predictors should be included in the model, and which should not? Some are likely irrelevant because they are truly statistically independent, and including them in the model is overfitting by definition (fitting to noise because there is no signal). But which are the ones we have license to ignore?

What is an optimal model?

I don’t know how it took me so long to realize this, but almost every model selection technique I’m aware of boils down to:

\[\mbox{argmin}_{\beta}~~|X\beta - y|^2 + \lambda|\beta|_i\]

Here X is the \(n\)*\(p\) matrix of predictors, \(y\) is a column vector containing the dependent variable, and \(\beta\) is the coefficient vector that we’re trying to solve for. If an element of \(\beta\) is 0, that means that variable can be ignored.

The first term, \((X\beta - y)^2\) is the prediction error. \(X\beta\) is the prediction, \(y\) is the actual value, and the difference is how far off we are. If we find \(\beta\) that minimizes this term, it means we’ve found a model that makes very accurate predictions. (An aside: more generally speaking, this term might be replaced by a deviance (negative log-likelihood) in the case of maximum likelihood estimation.)

However, if our only objective was to minimize the prediction error, we would always include all predictors, since prediction error on the training dataset will monotonically decrease as we add more predictors, regardless of whether the predictors are relevant or pure noise.

Therefore, model selection methods include a second part, \(\lambda | \beta | _i\) to penalize the number of predictors or the magnitudes of their effects. \(|\beta|_i\) is the \(L^i\) norm of the coefficient vector, and \(\lambda\) is how strongly to penalize the magnitude. If we choose \(i=0\), then \(|\beta|_0\) is just the number of non-zero coefficients - exactly the number of parameters in the model. AIC, BIC, Cp, and a number of other model selection criteria use this approach, penalizing by the number of parameters in the model, and differ only in their choice of \(\lambda\). If we choose \(i=1\), then we’re penalizing the sum of absolute values of the parameters - this is known as lasso regression. And if we choose \(i=2\), we penalize large parameters far more than small parameters. Choosing \(i=2\) is called ridge regression, and isn’t actually not a great choice for choosing which predictors are relevant or not, as it doesn’t force parameters to zero, if they’re irrelevant - it just makes them small.

Comparing model selection approaches

Model selection method/criteria Regularization norm \(\lambda\)
Cp L0 2
AIC L0 2
BIC L0 log(n)
Best Subset Regression L0 vary it
Lasso L1 vary it, choose by cross-validation
Elastic Net L1 + L2 vary it, choose by cross-validation
Ridge L2 vary it, choose by cross-validation

A couple other approaches worth note that don’t quite fit in the table:

  • stepwise regression: typically tries to optimize AIC, BIC, Cp, or number of statistically significant parameters (based on t-tests). This is actually a greedy search strategy rather than an objective function.
  • Bayesian Model Averaging (BMA): In this framework, the most probable model is the one with the smallest BIC; however, to determine whether or not a predictor is in the ‘true’ model, one calculates the total probability of all models containing that predictor. BMA has clear ties to best subset regression (as it needs to evaluate all possible combinations of predictors, or at least all probable ones), as well as BIC-based model selection.
nate cermak
nate cermak
Biological Engineer