Concept

Overfitting

Definition

Overfitting is the failure mode in which a statistical or machine-learning model captures the idiosyncratic noise of its training data instead of the underlying signal — and so performs brilliantly on data it has already seen and poorly on anything new. The classic diagnostic is a widening gap between training-set performance and validation-set performance as model complexity grows. A linear regression with three features may have a modest in-sample R-squared and a similar out-of-sample R-squared; the same regression with three hundred features may explain 95% of the training variance and 0% of the test variance.

The mechanism is straightforward: every additional parameter gives the model another degree of freedom to bend itself around training points. Enough degrees of freedom and any finite dataset can be fit exactly. What is being fit, at that point, is not the systematic relationship but the unrepeatable noise.

Why it matters

How it works

Overfitting arises whenever a model has more capacity than the data can support. Capacity comes from many sources: the number of free parameters, the flexibility of the functional form, the depth of a decision tree, the number of features included, the number of times the modeller iterates on the same training set. Each source increases the model's ability to express any pattern — including the noise pattern that happens to be in this particular sample.

The defences fall into three families. Statistical defences add penalties for complexity: L1 and L2 regularisation, information criteria, early stopping on a validation loss. Procedural defences split the data: train on one slice, validate on another, hold a third in reserve that nobody is permitted to look at until a single final evaluation. Conceptual defences impose structural priors: rather than letting the model find any pattern, restrict it to patterns that make economic or physical sense. The most important defence is the hardest to systematise — limit how many distinct hypotheses are tested against the same dataset, because the probability of one of them passing by chance grows with the number of tries.

Where it goes next

Continue exploring

Tags