AIWiki
Malaysia

Dropout

A regularisation technique in deep learning that randomly deactivates neurons during training, preventing co-adaptation and improving generalisation. Introduced by Hinton and colleagues in 2012 and formalised in 2014.

5 min readLast updated May 2026Foundations

Dropout is a regularisation technique for deep neural networks in which, during training, each neuron is independently set to zero with some probability p on every forward pass. The technique was introduced by Geoffrey Hinton and colleagues in 2012 and formalised in the 2014 paper "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov. By preventing neurons from co-adapting to specific patterns in the training data, dropout reduces overfitting and has become a near-universal component of deep learning practice.

Mechanism

During training, dropout multiplies the activations of a layer by a random binary mask drawn from a Bernoulli distribution with parameter 1 minus p, where p is the dropout rate. Neurons whose mask value is zero are effectively removed from the network for that forward and backward pass, along with all of their incoming and outgoing connections. To preserve the expected magnitude of activations, the surviving activations are typically scaled by 1 divided by (1 minus p), a convention known as "inverted dropout". At inference time the full network is used without masking and without rescaling.

An equivalent view, articulated by the original authors, is that dropout trains an ensemble of exponentially many thinned sub-networks that share weights. At inference, the deterministic full network approximates the geometric mean of the ensemble's predictions.

Motivation

Geoffrey Hinton has recounted the intuition behind dropout in terms of bank fraud prevention: bank tellers were periodically rotated between branches to prevent them from forming collusive relationships. Analogously, randomly removing different neurons on each training example prevents complex co-adaptations among hidden units, forcing each neuron to learn features that are useful in many different contexts.

Empirical impact

Dropout produced significant improvements on supervised learning tasks in vision, speech recognition and document classification in the early 2010s, and was a key ingredient in AlexNet's 2012 ImageNet result. The technique became standard in feed-forward networks and convolutional neural networks. In recurrent neural networks, naive dropout on hidden-to-hidden connections can harm sequence modelling, leading to variants such as variational dropout, recurrent dropout and DropConnect.

Dropout in transformers and modern architectures

Dropout remains common in transformer models, applied within feed-forward sub-layers, attention output projections and residual connections. Very large transformers trained on internet-scale corpora often use lower dropout rates than smaller models, because the abundance of training data already exerts strong implicit regularisation. Layer drop, stochastic depth and attention dropout are related techniques that randomly remove entire layers or attention heads.

Monte Carlo dropout

Yarin Gal and Zoubin Ghahramani showed in 2016 that running multiple forward passes with dropout enabled at inference time and averaging the predictions yields an approximation to Bayesian model averaging. This technique, known as Monte Carlo dropout, is used to obtain calibrated predictive uncertainty in medical imaging, autonomous driving and other safety-critical applications without the cost of training a full Bayesian neural network.

Variants

DropConnect generalises dropout by masking individual weights rather than entire neurons. Spatial dropout removes entire feature maps in convolutional layers. Variational dropout learns the dropout rate per parameter through a Bayesian objective. Concrete dropout uses a continuous relaxation of the Bernoulli distribution to make the dropout rate differentiable.

Limitations and complements

Dropout slows training because each gradient step operates on a thinner sub-network. It interacts non-trivially with batch normalisation, and modern practice often favours combining lower dropout with batch or layer normalisation, weight decay, label smoothing, mixup and strong data augmentation rather than relying on dropout alone.

References

  1. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15(1), pp. 1929–1958.
  2. Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2012). Improving Neural Networks by Preventing Co-adaptation of Feature Detectors. arXiv:1207.0580.
  3. Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian Approximation. ICML.
  4. Wan, L. et al. (2013). Regularization of Neural Networks using DropConnect. ICML.