AIWiki
Malaysia

Regularisation (Machine Learning)

Regularisation is a collection of techniques in machine learning that constrain models during training to reduce overfitting and improve generalisation to unseen data.

5 min readLast updated May 2026Foundations

Regularisation in machine learning is the umbrella term for techniques that modify the training procedure to discourage models from fitting noise in the training data and instead encourage them to learn patterns that generalise to unseen inputs. Without regularisation, sufficiently expressive models — especially deep neural networks — tend to memorise their training set, achieving low training error while performing poorly on validation and production data. Regularisation is therefore one of the most important practical levers in supervised learning.

The bias–variance perspective

A common way to think about regularisation is through the bias–variance decomposition of generalisation error. Underfitting reflects high bias: the model is too inflexible to capture the underlying signal. Overfitting reflects high variance: the model captures patterns that vary across draws from the same distribution. Regularisation shifts the trade-off toward higher bias and lower variance, accepting a small loss in training fit in exchange for a larger reduction in error on unseen data.

Classical explicit regularisation

Several methods modify the loss function directly.

L2 regularisation (also called weight decay or ridge regression) adds a penalty proportional to the sum of squared weights. It discourages large weights, producing smoother decision boundaries and more numerically stable optimisation. In its modern implementation as decoupled weight decay (AdamW), it is a default in most deep learning training recipes.

L1 regularisation (lasso) adds a penalty proportional to the sum of absolute weight values. Because the L1 penalty has a non-smooth point at zero, it tends to drive some weights exactly to zero, producing sparse models that are easier to interpret and cheaper to deploy.

Elastic net combines L1 and L2 penalties, balancing sparsity and stability. It is widely used in tabular machine learning and feature selection.

A unified way to think about these is that they shrink parameters toward simpler defaults — zero for L1 and L2, or a prior in Bayesian formulations.

Stochastic and architectural regularisation

Other techniques operate on the model or the training procedure rather than on the loss.

Dropout randomly zeroes a fraction of activations during training, forcing redundant representations and approximating an ensemble of subnetworks. It remains a standard component of fully connected and recurrent layers.

Batch normalisation and layer normalisation stabilise activations and have a documented regularising side-effect, partly because batch statistics inject noise during training.

Early stopping halts training when validation loss stops improving, preventing the model from continuing to memorise the training data.

Data augmentation synthesises additional training examples by applying label-preserving transformations: cropping, flipping, colour jitter, mixup, cutmix, and rotation for vision; synonym substitution, back-translation, and span masking for text; pitch shifting and noise injection for audio. Augmentation is one of the most effective regularisers because it directly enlarges the effective training distribution.

Label smoothing replaces hard one-hot training targets with slightly softened distributions, discouraging the model from becoming overconfident.

Mixup, cutmix, and stochastic depth randomly combine or drop pieces of inputs or layers during training, producing strong empirical gains on image classification benchmarks.

Implicit regularisation

A line of research has highlighted that even without explicit regularisation, choices such as stochastic gradient descent, learning-rate schedules, weight initialisation, and architecture all bias optimisation toward solutions with desirable generalisation properties. The fact that overparameterised neural networks generalise at all is partly explained by this implicit regularisation.

Practical guidance

Production deep learning recipes typically combine several regularisers. A representative image classification recipe might use AdamW with weight decay 0.05, dropout 0.1 in the classification head, label smoothing 0.1, mixup and cutmix, random erasing, and a cosine learning-rate schedule with early stopping. Tabular boosted-tree models rely on tree depth, subsampling, and L2 leaf regularisation. Large language model pre-training relies primarily on weight decay, dropout, and the implicit regularisation of large diverse data.

| Technique | Modifies | Cost | Strength | |---|---|---|---| | L2 / weight decay | Loss | Trivial | Mild and broadly applicable | | L1 / lasso | Loss | Trivial | Sparsity, interpretation | | Dropout | Activations | Small | Strong on dense layers | | Early stopping | Schedule | None | Always recommended | | Data augmentation | Inputs | Moderate | Very strong on vision/audio | | Label smoothing | Targets | Trivial | Calibration | | Mixup / cutmix | Inputs | Small | Strong on classification |

References

  1. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society.
  2. Srivastava, N. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR.
  3. Loshchilov, I. and Hutter, F. (2019). Decoupled Weight Decay Regularization (AdamW). ICLR.
  4. Zhang, H. et al. (2018). mixup: Beyond Empirical Risk Minimization. ICLR.
  5. Müller, R., Kornblith, S., Hinton, G. (2019). When Does Label Smoothing Help?. NeurIPS.