What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Layer Normalisation

Layer normalisation is a technique that normalises the inputs across the features of a single training example, stabilising and accelerating the training of deep neural networks, especially transformers.

4 min readLast updated June 2026Foundations

Layer normalisation is a method for stabilising the training of deep neural networks by normalising the summed inputs to the neurons within a layer for each individual training example. Introduced in 2016 by Jimmy Ba, Jamie Ryan Kiros and Geoffrey Hinton, it rescales activations so that, across the features of one example, they have zero mean and unit variance before a learned scale and shift are applied. The normalised value for an input is computed as (x - mu) / sqrt(sigma^2 + eps), where mu and sigma are the mean and standard deviation taken over the layer's features and eps is a small constant for numerical stability.

Motivation

As signals pass through the many layers of a deep network, the distribution of activations can shift and vary widely during training, a phenomenon that slows convergence and can destabilise optimisation. Normalising activations keeps them in a well-behaved range, which permits higher learning rates, smooths the optimisation landscape and reduces sensitivity to initialisation. After normalising, a learned gain and bias restore the network's ability to represent any required scale, so no representational power is lost.

Difference from batch normalisation

Layer normalisation is closely related to the earlier batch normalisation but differs in the axis over which statistics are computed. Batch normalisation normalises each feature across the examples in a mini-batch, making its behaviour dependent on batch size and on the distinction between training and inference. Layer normalisation instead normalises across the features within a single example and is therefore independent of batch size and identical at training and inference time.

This independence is decisive for sequence models and for settings with small or variable batch sizes. Recurrent networks and transformers process inputs of differing lengths and often train with modest batches, conditions under which batch normalisation performs poorly. Layer normalisation sidesteps these problems, which is why it became the standard choice for these architectures.

| Property | Layer normalisation | Batch normalisation | | --- | --- | --- | | Normalises across | Features of one example | A feature across the batch | | Depends on batch size | No | Yes | | Train vs inference | Identical | Differs | | Typical use | Transformers, RNNs | Convolutional networks |

Role in transformers

Layer normalisation is an integral part of the transformer architecture that underlies modern large language models. Each transformer block applies it around the attention and feed-forward sub-layers, combined with residual connections, to keep training stable as networks scale to many layers and billions of parameters. Two arrangements are common: post-norm, where normalisation follows the sub-layer, and pre-norm, where it precedes the sub-layer, with pre-norm generally giving more stable training of very deep models. Variants such as RMSNorm, which normalises using only the root mean square and omits the mean subtraction, are now widely used in large models for efficiency. The continued reliance on layer normalisation and its descendants reflects how essential normalisation is to training the largest contemporary neural networks.

Malaysian Context — Inside the Models Trained Locally

Layer normalisation is a structural component of the transformer-based models being developed and deployed in Malaysia, even though it operates deep inside the network. Homegrown large language models such as ILMU from YTL AI Labs and MaLLaM are built on transformer architectures whose every block depends on layer normalisation or a variant such as RMSNorm to train stably at scale.

Training such models requires substantial compute, and Malaysia's expanding AI-cloud and data-centre footprint, including YTL's GPU infrastructure developed with Nvidia and other investments supported under the MyDigital Blueprint, provides the platform on which these normalisation-dependent networks are trained. The stability that layer normalisation provides is part of what makes it feasible to train large sovereign models domestically.

For the Malaysian AI talent pipeline, understanding normalisation techniques is part of the deep-learning competency taught at universities such as Universiti Malaya and through industry programmes funded by HRD Corp and coordinated with MDEC. As local research groups and companies fine-tune and build transformer models for Bahasa Melayu and regional applications, familiarity with layer normalisation and its modern variants is a practical necessity for engineers working at the model level.

References

Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450.
Vaswani, A. et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.
Zhang, B. and Sennrich, R. (2019). Root Mean Square Layer Normalization. Advances in Neural Information Processing Systems.

Tags:normalisation transformers deep learning training stability

Type	Normalisation technique
Introduced	2016, by Ba, Kiros and Hinton
Normalises over	Features of a single example
Key use	Transformer architectures
Benefit	Stable, faster training
Related	Batch normalisation