AIWiki
Malaysia

Batch Normalisation

Batch normalisation is a deep learning technique that normalises the activations of each layer within a mini-batch to accelerate training and improve model stability.

5 min readLast updated May 2026Foundations

Batch normalisation (BatchNorm or BN) is a layer-level technique that rescales the activations within a neural network so that, at each training step, the inputs to subsequent layers have approximately zero mean and unit variance across the mini-batch. Introduced by Sergey Ioffe and Christian Szegedy in 2015 while at Google, it became one of the most widely deployed innovations in deep learning and remains a default choice in modern convolutional networks.

Motivation

Deep networks suffer from a phenomenon known as internal covariate shift: as the parameters of earlier layers update, the distribution of inputs feeding into later layers changes continuously. This forces later layers to track a moving target, slowing convergence and pushing learning-rate selection toward small, conservative values. Batch normalisation reduces this shift by re-centring and re-scaling activations at every step.

Although the original "internal covariate shift" framing has been challenged by later analyses — Santurkar et al. (2018) showed that BatchNorm primarily smooths the optimisation landscape — the practical benefits are robust: faster convergence, less sensitivity to weight initialisation, larger usable learning rates, and a mild regularising effect.

Computation

For a mini-batch of activations entering a BatchNorm layer, the operation proceeds in four steps. First, the per-channel mean is computed across the batch. Second, the per-channel variance is computed across the batch. Third, each activation is normalised by subtracting the mean and dividing by the square root of the variance plus a small constant epsilon for numerical stability. Fourth, the normalised activations are rescaled by a learned scale parameter gamma and shifted by a learned bias parameter beta, restoring the layer's representational capacity.

A compact way of writing this transformation is: y = gamma * (x - mu) / sqrt(var + eps) + beta, where mu and var are the per-channel batch statistics and gamma and beta are learnable.

During inference the per-batch statistics are replaced by running averages of mean and variance accumulated during training, so that predictions are deterministic and independent of the rest of the batch.

Behaviour in training and inference

A subtlety of BatchNorm is that it introduces a behavioural difference between training and inference. During training, normalisation depends on the composition of the batch and acts as a stochastic regulariser. During inference, the layer applies fixed parameters. This duality can cause subtle bugs when batches are unusually small, when distributed training synchronises statistics incorrectly, or when fine-tuning on a small downstream dataset.

Variants

Several normalisation strategies have emerged for settings where BatchNorm performs poorly.

Layer normalisation (Ba, Kiros, Hinton, 2016) normalises across the feature dimension within each example, removing dependence on batch size. It is the standard choice in transformers and recurrent networks.

Group normalisation (Wu and He, 2018) splits channels into groups and normalises within each group, performing well at small batch sizes used in object detection and segmentation.

Instance normalisation (Ulyanov et al., 2017) normalises per example and per channel, popular in style-transfer networks.

Weight normalisation and spectral normalisation instead constrain weights rather than activations, with applications in generative adversarial networks.

A comparison of the most common variants is shown below.

| Variant | Normalises across | Depends on batch size | Common use | |---|---|---|---| | Batch norm | Batch + spatial | Yes | CNNs | | Layer norm | Features within example | No | Transformers, RNNs | | Group norm | Feature groups within example | No | Detection, segmentation | | Instance norm | Per channel within example | No | Style transfer |

When BatchNorm helps and when it does not

BatchNorm is most effective with moderate-to-large batch sizes (typically 16 or more per device) and image-like inputs. It is less suitable for sequence models, small-batch training, online learning, or scenarios where the batch composition changes adversarially. Modern architectures such as transformer-based language models, vision transformers, and diffusion models therefore tend to favour layer or group normalisation.

Regularising effect

Because batch statistics inject noise into each forward pass, BatchNorm provides a regularising effect comparable in some settings to dropout. Many production CNNs use both; others rely on BatchNorm alone. The interaction is empirical and depends on architecture, dataset, and training regime.

References

  1. Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML.
  2. Santurkar, S. et al. (2018). How Does Batch Normalization Help Optimization?. NeurIPS.
  3. Ba, J., Kiros, J., Hinton, G. (2016). Layer Normalization. arXiv:1607.06450.
  4. Wu, Y. and He, K. (2018). Group Normalization. ECCV.
  5. Ulyanov, D. et al. (2017). Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv:1607.08022.