What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Backpropagation

Backpropagation is the primary algorithm for training neural networks, computing gradients of a loss function with respect to each weight by applying the chain rule of calculus in reverse through the network layers.

6 min readLast updated May 2026Foundations

Backpropagation — formally, the backpropagation of errors algorithm — is the foundational method by which artificial neural networks learn from data. By efficiently computing how each parameter in a network contributes to prediction error, it enables optimisation algorithms such as gradient descent to systematically reduce that error through iterative weight updates. Virtually every deep learning system in production today, from image classifiers to large language models, is trained using some variant of this algorithm.

Historical Background

The algorithm was popularised in 1986 when David Rumelhart, Geoffrey Hinton, and Ronald Williams published their landmark paper in Nature, demonstrating that backpropagation could train multi-layer networks on tasks previously considered intractable for single-layer perceptrons.[^1] Although the mathematical foundations — the chain rule of calculus and automatic differentiation — had been understood for decades, the 1986 paper established a clear, computationally practical procedure that drove the first wave of neural network research.

How Backpropagation Works

Training a neural network requires minimising a loss function that measures the discrepancy between the network's predictions and the true labels. Backpropagation breaks this minimisation into two sequential phases executed for each batch of training examples.

Forward Pass

During the forward pass, input data flows through the network layer by layer. Each neuron computes a weighted sum of its inputs, adds a bias term, and applies a non-linear activation function (such as ReLU, sigmoid, or tanh) to produce an output. This process continues until the final layer produces a prediction, at which point the loss function evaluates the difference between the prediction and the ground truth.

Backward Pass

The backward pass computes the gradient of the loss with respect to every trainable parameter in the network. Starting from the loss at the output layer, the algorithm applies the chain rule to propagate error signals backwards through successive layers. At each layer, it calculates: (1) how much the layer's output contributed to the overall loss; (2) how the layer's weights influenced that output. These partial derivatives accumulate to give the gradient of each weight.

The chain rule makes this computation tractable. For a composed function such as a deep network — where the output of one layer is the input to the next — the overall gradient is the product of local gradients at each layer. Modern deep learning frameworks such as PyTorch and TensorFlow implement this through automatic differentiation engines that construct a computational graph during the forward pass and traverse it in reverse during backpropagation.

Weight Update

Once gradients are computed, an optimisation algorithm (typically a variant of gradient descent) updates each weight in the direction that reduces the loss. The learning rate hyperparameter controls the step size of each update.

Vanishing and Exploding Gradients

A well-known challenge in deep networks is gradient instability. When gradients are repeatedly multiplied by small numbers through many layers, they can shrink exponentially — the vanishing gradient problem — causing early layers to train extremely slowly. Conversely, multiplication by large numbers causes exploding gradients, leading to numerical instability.

Practitioners address these problems through several techniques. Activation functions such as ReLU largely mitigate vanishing gradients compared to the sigmoid function. Careful weight initialisation schemes (e.g., Xavier or He initialisation) control the variance of activations across layers. Gradient clipping caps the magnitude of gradients to prevent explosion. Architectural innovations such as residual connections (used in ResNets) and layer normalisation also help stabilise training in very deep networks.[^2]

Computational Efficiency

Modern implementations compute gradients for an entire mini-batch of samples simultaneously using vectorised matrix operations on GPUs, making backpropagation highly parallelisable. Frameworks like PyTorch and JAX extend the approach with higher-order differentiation and just-in-time compilation, enabling research into second-order optimisation methods that use curvature information beyond the simple gradient.

Role in Modern AI

Backpropagation is not merely a training trick; it is the mechanism by which every gradient-based learning system — convolutional neural networks, recurrent neural networks, transformers, diffusion models — acquires its capabilities. The algorithm's scalability has proved remarkable: networks trained on billions of parameters over trillions of tokens still rely on the same mathematical principle introduced in 1986, augmented by engineering advances in hardware and software.

| Technique | Problem Addressed | |-----------|-------------------| | ReLU activation | Vanishing gradients in deep networks | | Residual connections | Gradient flow in very deep networks | | Gradient clipping | Exploding gradients in RNNs | | Batch normalisation | Internal covariate shift | | He / Xavier init | Activation variance instability |

Malaysian Context — AI Education and Industry Adoption

Backpropagation and its underlying mathematics form the foundation of AI education programmes across Malaysian universities. Universiti Malaya (UM), Universiti Teknologi Malaysia (UTM), and Universiti Putra Malaysia (UPM) offer machine learning courses where backpropagation is taught as a core module within deep learning curricula. Private institutions such as Asia Pacific University (APU) and Multimedia University (MMU) have expanded their AI degree offerings, incorporating hands-on deep learning labs that train students to implement backpropagation from scratch before moving to framework-level abstractions.

Malaysia's National AI Roadmap, administered by MDEC and the Ministry of Digital, identifies talent development as a critical pillar. HRD Corp (Human Resources Development Corporation) funds upskilling programmes for industry professionals that frequently include neural network training fundamentals. Companies such as Maxis, TM (Telekom Malaysia), and Celcom Axiata have sponsored internal data science academies where engineers learn to build and fine-tune neural networks using PyTorch and TensorFlow, directly applying backpropagation-based optimisation to business problems such as network traffic prediction and customer churn modelling.

In the fintech sector, Maybank and CIMB have deployed deep learning models for fraud detection and credit scoring, both of which depend on backpropagation for training. Malaysia Digital Economy Corporation (MDEC) has issued grants under the Malaysia Digital Acceleration Grant (MDAG) programme to AI startups building custom neural network solutions for local industries, including agriculture (oil palm yield forecasting) and manufacturing quality inspection — all relying on backpropagation as the training engine.

The broader ASEAN context is also relevant: as regional AI hubs like Singapore's AI Singapore (AISG) publish open-source learning resources, Malaysian practitioners increasingly leverage these to accelerate their understanding of optimisation algorithms including backpropagation. Cross-border research collaboration between Malaysian universities and institutions in Singapore, Japan, and South Korea continues to advance applied deep learning research in the region.

References

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Tags:backpropagation neural networks gradient descent training deep learning

Type	Optimisation algorithm
Field	Machine learning, neural networks
Introduced	1986 (Rumelhart, Hinton, Williams)
Key use	Training multi-layer neural networks
Related	Gradient descent, chain rule, loss function