AIWiki
Malaysia

Backpropagation

Backpropagation is the primary algorithm for training neural networks, computing gradients of a loss function with respect to each weight by applying the chain rule of calculus in reverse through the network layers.

6 min readLast updated May 2026Foundations

Backpropagation — formally, the backpropagation of errors algorithm — is the foundational method by which artificial neural networks learn from data. By efficiently computing how each parameter in a network contributes to prediction error, it enables optimisation algorithms such as gradient descent to systematically reduce that error through iterative weight updates. Virtually every deep learning system in production today, from image classifiers to large language models, is trained using some variant of this algorithm.

Historical Background

The algorithm was popularised in 1986 when David Rumelhart, Geoffrey Hinton, and Ronald Williams published their landmark paper in Nature, demonstrating that backpropagation could train multi-layer networks on tasks previously considered intractable for single-layer perceptrons.[^1] Although the mathematical foundations — the chain rule of calculus and automatic differentiation — had been understood for decades, the 1986 paper established a clear, computationally practical procedure that drove the first wave of neural network research.

How Backpropagation Works

Training a neural network requires minimising a loss function that measures the discrepancy between the network's predictions and the true labels. Backpropagation breaks this minimisation into two sequential phases executed for each batch of training examples.

Forward Pass

During the forward pass, input data flows through the network layer by layer. Each neuron computes a weighted sum of its inputs, adds a bias term, and applies a non-linear activation function (such as ReLU, sigmoid, or tanh) to produce an output. This process continues until the final layer produces a prediction, at which point the loss function evaluates the difference between the prediction and the ground truth.

Backward Pass

The backward pass computes the gradient of the loss with respect to every trainable parameter in the network. Starting from the loss at the output layer, the algorithm applies the chain rule to propagate error signals backwards through successive layers. At each layer, it calculates: (1) how much the layer's output contributed to the overall loss; (2) how the layer's weights influenced that output. These partial derivatives accumulate to give the gradient of each weight.

The chain rule makes this computation tractable. For a composed function such as a deep network — where the output of one layer is the input to the next — the overall gradient is the product of local gradients at each layer. Modern deep learning frameworks such as PyTorch and TensorFlow implement this through automatic differentiation engines that construct a computational graph during the forward pass and traverse it in reverse during backpropagation.

Weight Update

Once gradients are computed, an optimisation algorithm (typically a variant of gradient descent) updates each weight in the direction that reduces the loss. The learning rate hyperparameter controls the step size of each update.

Vanishing and Exploding Gradients

A well-known challenge in deep networks is gradient instability. When gradients are repeatedly multiplied by small numbers through many layers, they can shrink exponentially — the vanishing gradient problem — causing early layers to train extremely slowly. Conversely, multiplication by large numbers causes exploding gradients, leading to numerical instability.

Practitioners address these problems through several techniques. Activation functions such as ReLU largely mitigate vanishing gradients compared to the sigmoid function. Careful weight initialisation schemes (e.g., Xavier or He initialisation) control the variance of activations across layers. Gradient clipping caps the magnitude of gradients to prevent explosion. Architectural innovations such as residual connections (used in ResNets) and layer normalisation also help stabilise training in very deep networks.[^2]

Computational Efficiency

Modern implementations compute gradients for an entire mini-batch of samples simultaneously using vectorised matrix operations on GPUs, making backpropagation highly parallelisable. Frameworks like PyTorch and JAX extend the approach with higher-order differentiation and just-in-time compilation, enabling research into second-order optimisation methods that use curvature information beyond the simple gradient.

Role in Modern AI

Backpropagation is not merely a training trick; it is the mechanism by which every gradient-based learning system — convolutional neural networks, recurrent neural networks, transformers, diffusion models — acquires its capabilities. The algorithm's scalability has proved remarkable: networks trained on billions of parameters over trillions of tokens still rely on the same mathematical principle introduced in 1986, augmented by engineering advances in hardware and software.

| Technique | Problem Addressed | |-----------|-------------------| | ReLU activation | Vanishing gradients in deep networks | | Residual connections | Gradient flow in very deep networks | | Gradient clipping | Exploding gradients in RNNs | | Batch normalisation | Internal covariate shift | | He / Xavier init | Activation variance instability |

References

  1. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
  2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.
  3. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
  4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.