What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Gradient Descent

Gradient descent is an iterative optimisation algorithm that minimises a loss function by repeatedly updating model parameters in the direction of the steepest descent, as defined by the negative gradient.

6 min readLast updated May 2026Foundations

Gradient descent is the workhorse optimisation algorithm underlying nearly all machine learning model training. Given a differentiable loss function that quantifies how poorly a model performs on training data, gradient descent iteratively adjusts the model's parameters to find values that minimise the loss. The algorithm is simple in principle — move parameters in the direction of steepest decrease — but its practical effectiveness depends on careful engineering choices around batch size, learning rate scheduling, and algorithmic variants.

Mathematical Foundations

The loss function L(θ) maps model parameters θ (weights and biases) to a scalar value representing prediction error. The gradient ∇L(θ) is a vector of partial derivatives indicating the direction and rate of steepest increase of L. Gradient descent subtracts a fraction of this gradient from the current parameters at each step:

θ ← θ − η · ∇L(θ)

where η (eta) is the learning rate. A small learning rate leads to slow convergence; too large a rate causes the parameters to oscillate or diverge. Choosing an appropriate learning rate — or scheduling it to decay over training — is one of the central hyperparameter decisions in practice.

Variants by Batch Size

Batch Gradient Descent

The classical form computes the gradient over the entire training dataset before each update. This produces an accurate gradient estimate but is computationally expensive for large datasets, as each iteration requires processing all examples.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent (SGD) updates parameters using the gradient computed from a single randomly selected training example. Updates are noisy, but the algorithm makes rapid progress and can escape shallow local minima due to that noise. SGD was the dominant training algorithm before the rise of adaptive methods and remains widely used, particularly with momentum.

Mini-Batch Gradient Descent

The standard practice in deep learning uses mini-batches — small subsets of training data (typically 32 to 512 examples) — to compute each gradient estimate. Mini-batch updates balance the computational efficiency of matrix operations on GPUs against the noise reduction compared to single-sample SGD. Most modern usage of "SGD" in frameworks refers implicitly to mini-batch SGD.

Momentum and Adaptive Methods

Vanilla gradient descent treats all parameters equally and uses the same learning rate throughout training. A family of improved optimisers addresses its limitations.

Momentum accumulates a velocity vector in the direction of persistent gradients, dampening oscillations in narrow valleys of the loss surface and accelerating convergence in directions of consistent gradient. The update rule incorporates an exponential moving average of past gradients.

AdaGrad adapts learning rates per parameter based on historical gradient magnitudes, making larger updates for infrequently updated parameters. However, its learning rates monotonically decrease, which can halt learning prematurely.

RMSprop addresses AdaGrad's issue by using an exponential moving average of squared gradients instead of a cumulative sum, keeping the effective learning rate from collapsing.

Adam (Adaptive Moment Estimation), introduced by Diederik Kingma and Jimmy Ba in 2014, combines momentum (first moment) with RMSprop-style adaptive learning rates (second moment).[^1] Adam has become the default optimiser in large-scale deep learning tasks due to its fast, stable convergence across a wide range of architectures. Variants including AdamW (which decouples weight decay from the gradient update), Adan, and Lion have been proposed as further improvements.

| Optimiser | Adaptive LR | Momentum | Notes | |-----------|-------------|----------|-------| | SGD | No | Optional | Simple, widely used with scheduling | | SGD + Momentum | No | Yes | Faster convergence in practice | | AdaGrad | Yes (cumulative) | No | Suited to sparse data | | RMSprop | Yes (moving avg) | No | Fixes AdaGrad's decay issue | | Adam | Yes | Yes | Default for most deep learning | | AdamW | Yes | Yes | Corrects weight decay coupling |

Learning Rate Scheduling

Fixed learning rates rarely achieve optimal results. Common schedules include step decay (reduce by a factor every N epochs), cosine annealing (smooth decay following a cosine curve), and warmup strategies (start from a low rate and ramp up before decaying). The transformer training recipe introduced in "Attention Is All You Need" (2017) uses a specific warmup followed by inverse square-root decay, which became widely adopted for language model training.[^2]

Challenges: Local Minima and Saddle Points

For non-convex loss surfaces (as found in deep networks), gradient descent can theoretically get trapped in local minima. Empirically, however, large over-parameterised networks seem to encounter relatively few problematic local minima; saddle points — where the gradient is zero but the point is neither a minimum nor a maximum — are more common but SGD's noise helps escape them. Research into the loss landscape geometry of deep networks (by Goodfellow, Vinyals, Saxe, and others) has informed better initialisation and optimisation strategies.

Malaysian Context — Optimisation in Malaysian AI Development

Gradient descent and its variants form a core component of AI engineering curricula across Malaysia. The National AI Roadmap, launched under MDEC and the Ministry of Digital, emphasises foundational AI competency, which includes optimisation theory. Universities such as Universiti Teknologi Malaysia (UTM), Multimedia University (MMU), and Taylor's University offer data science and AI programmes where students implement gradient descent from first principles before applying it via frameworks such as PyTorch and TensorFlow.

In industry, Malaysian organisations deploying deep learning models — including Maybank's fraud detection systems, Petronas's predictive maintenance models for upstream operations, and AirAsia's demand forecasting systems — all rely on Adam or SGD variants to train their models. HRD Corp-accredited training providers like Tertiary Infotech, Arkmind, and AI Centres of Excellence at public universities deliver professional upskilling programmes covering optimisation algorithms for working engineers.

MDEC's Malaysia Digital Acceleration Grant (MDAG-AI) supports AI startups building production-grade models, where gradient-based training underlies every solution. The grant programme, which in 2025 was expanded to RM2.9 million, explicitly includes model development and training infrastructure as eligible activities. Malaysian AI companies competing in the ASEAN market — including those building manufacturing inspection systems for the Penang and Johor electronics corridors — require engineers who can diagnose and fix training instabilities, a skill that depends directly on understanding gradient descent dynamics.

As Malaysia positions itself as a regional AI hub, with data centre investment surging to 690 MW of capacity in early 2025, the demand for engineers who understand the full stack from loss function design to optimiser selection has grown considerably. Institutions like MDEC's Malaysia Digital Academy (MIDA) and private academies affiliated with Amazon Web Services and Microsoft Azure have incorporated optimisation workshops into their cloud AI training tracks.

References

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. ICLR 2019.

Tags:gradient descent optimisation machine learning SGD Adam training

Type	First-order iterative optimisation algorithm
Field	Machine learning, mathematical optimisation
Key variants	Batch GD, SGD, Mini-batch GD, Adam, RMSprop
Key use	Training neural networks and linear models
Related	Backpropagation, learning rate, loss function