AIWiki
Malaysia

Layer Normalisation

Layer normalisation is a technique that normalises the inputs across the features of a single training example, stabilising and accelerating the training of deep neural networks, especially transformers.

4 min readLast updated June 2026Foundations

Layer normalisation is a method for stabilising the training of deep neural networks by normalising the summed inputs to the neurons within a layer for each individual training example. Introduced in 2016 by Jimmy Ba, Jamie Ryan Kiros and Geoffrey Hinton, it rescales activations so that, across the features of one example, they have zero mean and unit variance before a learned scale and shift are applied. The normalised value for an input is computed as (x - mu) / sqrt(sigma^2 + eps), where mu and sigma are the mean and standard deviation taken over the layer's features and eps is a small constant for numerical stability.

Motivation

As signals pass through the many layers of a deep network, the distribution of activations can shift and vary widely during training, a phenomenon that slows convergence and can destabilise optimisation. Normalising activations keeps them in a well-behaved range, which permits higher learning rates, smooths the optimisation landscape and reduces sensitivity to initialisation. After normalising, a learned gain and bias restore the network's ability to represent any required scale, so no representational power is lost.

Difference from batch normalisation

Layer normalisation is closely related to the earlier batch normalisation but differs in the axis over which statistics are computed. Batch normalisation normalises each feature across the examples in a mini-batch, making its behaviour dependent on batch size and on the distinction between training and inference. Layer normalisation instead normalises across the features within a single example and is therefore independent of batch size and identical at training and inference time.

This independence is decisive for sequence models and for settings with small or variable batch sizes. Recurrent networks and transformers process inputs of differing lengths and often train with modest batches, conditions under which batch normalisation performs poorly. Layer normalisation sidesteps these problems, which is why it became the standard choice for these architectures.

| Property | Layer normalisation | Batch normalisation | | --- | --- | --- | | Normalises across | Features of one example | A feature across the batch | | Depends on batch size | No | Yes | | Train vs inference | Identical | Differs | | Typical use | Transformers, RNNs | Convolutional networks |

Role in transformers

Layer normalisation is an integral part of the transformer architecture that underlies modern large language models. Each transformer block applies it around the attention and feed-forward sub-layers, combined with residual connections, to keep training stable as networks scale to many layers and billions of parameters. Two arrangements are common: post-norm, where normalisation follows the sub-layer, and pre-norm, where it precedes the sub-layer, with pre-norm generally giving more stable training of very deep models. Variants such as RMSNorm, which normalises using only the root mean square and omits the mean subtraction, are now widely used in large models for efficiency. The continued reliance on layer normalisation and its descendants reflects how essential normalisation is to training the largest contemporary neural networks.

References

  1. Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450.
  2. Vaswani, A. et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
  3. Zhang, B. and Sennrich, R. (2019). Root Mean Square Layer Normalization. Advances in Neural Information Processing Systems.