AIWiki
Malaysia

Overfitting

Overfitting is a modelling error in machine learning where a model learns the training data too closely, including its noise, and consequently performs poorly on new, unseen data.

5 min readLast updated June 2026Foundations

Overfitting occurs when a machine learning model captures not only the underlying pattern in its training data but also the random noise and idiosyncrasies particular to that sample. Such a model achieves very low error on the data it was trained on yet fails to generalise, producing markedly worse predictions when presented with new examples. Overfitting is one of the central obstacles in applied machine learning, and managing it is a routine part of model development.

How overfitting arises

A model has a certain capacity, loosely the richness of the relationships it can represent. When capacity is large relative to the amount and quality of training data, the model has enough flexibility to memorise individual data points rather than infer the general rule that produced them. A high-degree polynomial fitted to a handful of points, for example, can pass exactly through every point while oscillating wildly between them. The fit looks perfect on the training set and is useless for prediction.

Several conditions encourage overfitting: an overly complex model architecture, too few training examples, noisy or mislabelled data, training for too many iterations, and the presence of features that are only spuriously correlated with the target. Deep neural networks, with millions or billions of parameters, are especially prone to it unless deliberately constrained.

Detecting overfitting

The standard diagnostic is to compare performance on data the model was trained on against performance on a held-out validation or test set. A widening gap, where training accuracy keeps improving while validation accuracy stalls or declines, is the signature of overfitting. Plotting both curves against training time produces the familiar learning curve used to decide when to stop training.

Cross-validation, in which the data is repeatedly partitioned into training and validation folds, gives a more robust estimate of how well a model will generalise and reduces the chance of being misled by a single fortunate or unfortunate split.

The bias-variance tradeoff

Overfitting is best understood through the bias-variance decomposition of prediction error. A model that overfits has low bias but high variance: it is highly sensitive to the particular training sample, so small changes in the data produce large changes in the fitted model. Underfitting is the mirror image, with high bias and low variance. The practitioner seeks the middle ground where total error is minimised. This tradeoff frames most of the techniques used to control overfitting.

Preventing and reducing overfitting

A range of methods address overfitting. Gathering more representative training data is the most direct, since a richer sample makes memorisation harder and the true pattern more evident. Regularisation techniques such as L1 and L2 penalties discourage large parameter values and thereby simplify the learned function. In neural networks, dropout randomly disables units during training, while early stopping halts optimisation once validation performance ceases to improve. Reducing model size, pruning features, and using data augmentation to synthetically expand the training set are also widely applied. Ensemble methods such as bagging reduce variance by averaging many models.

The table below summarises common remedies and the mechanism by which each helps.

| Technique | Mechanism | | --- | --- | | More training data | Makes memorisation harder, clarifies signal | | L1 / L2 regularisation | Penalises complex parameter configurations | | Dropout | Prevents co-adaptation of neurons | | Early stopping | Stops before noise is fitted | | Data augmentation | Expands effective dataset size | | Cross-validation | Detects poor generalisation early |

References

  1. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer.
  2. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
  3. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.