What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Adam Optimizer

Adam is an adaptive gradient-based optimization algorithm for training neural networks that combines momentum with per-parameter adaptive learning rates derived from estimates of the first and second moments of the gradients.

4 min readLast updated June 2026Foundations

Adam, short for Adaptive Moment Estimation, is an optimization algorithm used to train machine learning models, particularly deep neural networks. It was introduced in 2014 by Diederik Kingma and Jimmy Ba and has since become one of the most widely used optimizers, included by default in frameworks such as TensorFlow and PyTorch. Adam computes individual adaptive learning rates for each parameter by maintaining running estimates of both the first moment (the mean) and the second moment (the uncentred variance) of the gradients.

Background

Training a neural network involves minimising a loss function by adjusting parameters in the direction indicated by gradients computed through backpropagation. Plain stochastic gradient descent uses a single fixed learning rate for all parameters, which can converge slowly and is sensitive to the choice of that rate. Two earlier ideas addressed different weaknesses. Momentum accumulates an exponentially decaying average of past gradients, smoothing the trajectory and accelerating progress along consistent directions. RMSprop scales each parameter's step by a decaying average of recent squared gradients, adapting the effective learning rate to how steep or noisy each dimension is. Adam unifies these two ideas.

How Adam works

Adam keeps two exponentially decaying averages for every parameter. The first moment estimate, often written m_t, tracks the average direction of recent gradients, playing the role of momentum. The second moment estimate, written v_t, tracks the average magnitude of recent squared gradients, playing the role of RMSprop. At each step the parameter is updated using m_t divided by the square root of v_t, so that directions with large or noisy gradients take smaller steps and stable directions take larger ones.

Because both averages are initialised at zero, they are biased toward zero during the early iterations. Adam corrects this with a bias-correction step that rescales m_t and v_t before they are used, which improves stability at the start of training. The algorithm exposes a small number of hyperparameters: a base learning rate, two decay rates for the moment estimates (commonly 0.9 and 0.999), and a tiny constant added for numerical stability.

| Component | Borrowed from | Role | | --- | --- | --- | | First moment estimate | Momentum | Direction of travel | | Second moment estimate | RMSprop | Per-parameter scaling | | Bias correction | Adam | Stability in early steps |

Strengths, variants, and limitations

Adam is valued for converging quickly, requiring little manual tuning, and performing robustly on noisy or sparse gradients, which makes it a strong default for many deep learning tasks. It is not universally optimal, however. On some problems, well-tuned stochastic gradient descent with momentum generalises better, and Adam can converge to sharper minima. These observations motivated variants such as AdamW, which decouples weight decay from the gradient update and is now standard for training large language models, as well as AdaMax, Nadam, and other refinements. Despite these alternatives, Adam and AdamW remain foundational tools in modern model training.

Malaysian Context — Training Infrastructure and Skills

The Adam optimizer is part of the standard toolkit for anyone training neural networks in Malaysia, from university research groups to industry data science teams. It is taught in machine learning and deep learning courses at institutions such as Universiti Malaya, Universiti Sains Malaysia, Universiti Teknologi Malaysia, and Universiti Kebangsaan Malaysia, and it underpins applied projects at MIMOS, Malaysia's national applied research institute.

Practical use of optimizers like Adam depends on access to accelerated computing. The growth of data centre capacity in Malaysia, including major facilities in Johor and Cyberjaya, gives local researchers and companies access to the GPU resources required to train sizeable models. National efforts such as the development of Malaysian large language models, including projects associated with MIMOS and YTL, rely on these optimization methods during training and fine-tuning.

Skills development is supported by the Human Resources Development Corporation (HRD Corp), which funds machine learning and data science training, and by MDEC programmes aimed at building AI talent among local enterprises and graduates. For most Malaysian practitioners, the relevant knowledge is not deriving optimization algorithms from scratch but understanding how to configure learning rates, decay parameters, and weight decay to train models reliably and efficiently.

As Malaysian organisations in banking, telecommunications, and manufacturing build their own models for fraud detection, demand forecasting, and quality inspection, familiarity with optimizers such as Adam and AdamW is a core competency for the country's expanding pool of AI engineers.

References

Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015.
Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization (AdamW). ICLR.
Ruder, S. (2016). An Overview of Gradient Descent Optimization Algorithms.
DigitalOcean. Intro to Optimization in Deep Learning: Momentum, RMSProp and Adam.

Tags:Adam optimization gradient descent deep learning training

Type	Adaptive optimization algorithm
Full name	Adaptive Moment Estimation
Proposed by	Diederik Kingma and Jimmy Ba
Year	2014
Combines	Momentum and RMSprop
Related	Gradient descent, backpropagation