Adam Optimizer
Adam is an adaptive gradient-based optimization algorithm for training neural networks that combines momentum with per-parameter adaptive learning rates derived from estimates of the first and second moments of the gradients.
Adam, short for Adaptive Moment Estimation, is an optimization algorithm used to train machine learning models, particularly deep neural networks. It was introduced in 2014 by Diederik Kingma and Jimmy Ba and has since become one of the most widely used optimizers, included by default in frameworks such as TensorFlow and PyTorch. Adam computes individual adaptive learning rates for each parameter by maintaining running estimates of both the first moment (the mean) and the second moment (the uncentred variance) of the gradients.
Background
Training a neural network involves minimising a loss function by adjusting parameters in the direction indicated by gradients computed through backpropagation. Plain stochastic gradient descent uses a single fixed learning rate for all parameters, which can converge slowly and is sensitive to the choice of that rate. Two earlier ideas addressed different weaknesses. Momentum accumulates an exponentially decaying average of past gradients, smoothing the trajectory and accelerating progress along consistent directions. RMSprop scales each parameter's step by a decaying average of recent squared gradients, adapting the effective learning rate to how steep or noisy each dimension is. Adam unifies these two ideas.
How Adam works
Adam keeps two exponentially decaying averages for every parameter. The first moment estimate, often written m_t, tracks the average direction of recent gradients, playing the role of momentum. The second moment estimate, written v_t, tracks the average magnitude of recent squared gradients, playing the role of RMSprop. At each step the parameter is updated using m_t divided by the square root of v_t, so that directions with large or noisy gradients take smaller steps and stable directions take larger ones.
Because both averages are initialised at zero, they are biased toward zero during the early iterations. Adam corrects this with a bias-correction step that rescales m_t and v_t before they are used, which improves stability at the start of training. The algorithm exposes a small number of hyperparameters: a base learning rate, two decay rates for the moment estimates (commonly 0.9 and 0.999), and a tiny constant added for numerical stability.
| Component | Borrowed from | Role | | --- | --- | --- | | First moment estimate | Momentum | Direction of travel | | Second moment estimate | RMSprop | Per-parameter scaling | | Bias correction | Adam | Stability in early steps |
Strengths, variants, and limitations
Adam is valued for converging quickly, requiring little manual tuning, and performing robustly on noisy or sparse gradients, which makes it a strong default for many deep learning tasks. It is not universally optimal, however. On some problems, well-tuned stochastic gradient descent with momentum generalises better, and Adam can converge to sharper minima. These observations motivated variants such as AdamW, which decouples weight decay from the gradient update and is now standard for training large language models, as well as AdaMax, Nadam, and other refinements. Despite these alternatives, Adam and AdamW remain foundational tools in modern model training.
References
- Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015.
- Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization (AdamW). ICLR.
- Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms.
- DigitalOcean. Intro to Optimization in Deep Learning: Momentum, RMSProp and Adam.