AIWiki
Malaysia

Gradient Descent

Gradient descent is an iterative optimisation algorithm that minimises a loss function by repeatedly updating model parameters in the direction of the steepest descent, as defined by the negative gradient.

6 min readLast updated May 2026Foundations

Gradient descent is the workhorse optimisation algorithm underlying nearly all machine learning model training. Given a differentiable loss function that quantifies how poorly a model performs on training data, gradient descent iteratively adjusts the model's parameters to find values that minimise the loss. The algorithm is simple in principle — move parameters in the direction of steepest decrease — but its practical effectiveness depends on careful engineering choices around batch size, learning rate scheduling, and algorithmic variants.

Mathematical Foundations

The loss function L(θ) maps model parameters θ (weights and biases) to a scalar value representing prediction error. The gradient ∇L(θ) is a vector of partial derivatives indicating the direction and rate of steepest increase of L. Gradient descent subtracts a fraction of this gradient from the current parameters at each step:

θ ← θ − η · ∇L(θ)

where η (eta) is the learning rate. A small learning rate leads to slow convergence; too large a rate causes the parameters to oscillate or diverge. Choosing an appropriate learning rate — or scheduling it to decay over training — is one of the central hyperparameter decisions in practice.

Variants by Batch Size

Batch Gradient Descent

The classical form computes the gradient over the entire training dataset before each update. This produces an accurate gradient estimate but is computationally expensive for large datasets, as each iteration requires processing all examples.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent (SGD) updates parameters using the gradient computed from a single randomly selected training example. Updates are noisy, but the algorithm makes rapid progress and can escape shallow local minima due to that noise. SGD was the dominant training algorithm before the rise of adaptive methods and remains widely used, particularly with momentum.

Mini-Batch Gradient Descent

The standard practice in deep learning uses mini-batches — small subsets of training data (typically 32 to 512 examples) — to compute each gradient estimate. Mini-batch updates balance the computational efficiency of matrix operations on GPUs against the noise reduction compared to single-sample SGD. Most modern usage of "SGD" in frameworks refers implicitly to mini-batch SGD.

Momentum and Adaptive Methods

Vanilla gradient descent treats all parameters equally and uses the same learning rate throughout training. A family of improved optimisers addresses its limitations.

Momentum accumulates a velocity vector in the direction of persistent gradients, dampening oscillations in narrow valleys of the loss surface and accelerating convergence in directions of consistent gradient. The update rule incorporates an exponential moving average of past gradients.

AdaGrad adapts learning rates per parameter based on historical gradient magnitudes, making larger updates for infrequently updated parameters. However, its learning rates monotonically decrease, which can halt learning prematurely.

RMSprop addresses AdaGrad's issue by using an exponential moving average of squared gradients instead of a cumulative sum, keeping the effective learning rate from collapsing.

Adam (Adaptive Moment Estimation), introduced by Diederik Kingma and Jimmy Ba in 2014, combines momentum (first moment) with RMSprop-style adaptive learning rates (second moment).[^1] Adam has become the default optimiser in large-scale deep learning tasks due to its fast, stable convergence across a wide range of architectures. Variants including AdamW (which decouples weight decay from the gradient update), Adan, and Lion have been proposed as further improvements.

| Optimiser | Adaptive LR | Momentum | Notes | |-----------|-------------|----------|-------| | SGD | No | Optional | Simple, widely used with scheduling | | SGD + Momentum | No | Yes | Faster convergence in practice | | AdaGrad | Yes (cumulative) | No | Suited to sparse data | | RMSprop | Yes (moving avg) | No | Fixes AdaGrad's decay issue | | Adam | Yes | Yes | Default for most deep learning | | AdamW | Yes | Yes | Corrects weight decay coupling |

Learning Rate Scheduling

Fixed learning rates rarely achieve optimal results. Common schedules include step decay (reduce by a factor every N epochs), cosine annealing (smooth decay following a cosine curve), and warmup strategies (start from a low rate and ramp up before decaying). The transformer training recipe introduced in "Attention Is All You Need" (2017) uses a specific warmup followed by inverse square-root decay, which became widely adopted for language model training.[^2]

Challenges: Local Minima and Saddle Points

For non-convex loss surfaces (as found in deep networks), gradient descent can theoretically get trapped in local minima. Empirically, however, large over-parameterised networks seem to encounter relatively few problematic local minima; saddle points — where the gradient is zero but the point is neither a minimum nor a maximum — are more common but SGD's noise helps escape them. Research into the loss landscape geometry of deep networks (by Goodfellow, Vinyals, Saxe, and others) has informed better initialisation and optimisation strategies.

References

  1. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  2. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
  3. Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
  4. Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. ICLR 2019.