What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Batch Normalisation

Batch normalisation is a deep learning technique that normalises the activations of each layer within a mini-batch to accelerate training and improve model stability.

5 min readLast updated May 2026Foundations

Batch normalisation (BatchNorm or BN) is a layer-level technique that rescales the activations within a neural network so that, at each training step, the inputs to subsequent layers have approximately zero mean and unit variance across the mini-batch. Introduced by Sergey Ioffe and Christian Szegedy in 2015 while at Google, it became one of the most widely deployed innovations in deep learning and remains a default choice in modern convolutional networks.

Motivation

Deep networks suffer from a phenomenon known as internal covariate shift: as the parameters of earlier layers update, the distribution of inputs feeding into later layers changes continuously. This forces later layers to track a moving target, slowing convergence and pushing learning-rate selection toward small, conservative values. Batch normalisation reduces this shift by re-centring and re-scaling activations at every step.

Although the original "internal covariate shift" framing has been challenged by later analyses — Santurkar et al. (2018) showed that BatchNorm primarily smooths the optimisation landscape — the practical benefits are robust: faster convergence, less sensitivity to weight initialisation, larger usable learning rates, and a mild regularising effect.

Computation

For a mini-batch of activations entering a BatchNorm layer, the operation proceeds in four steps. First, the per-channel mean is computed across the batch. Second, the per-channel variance is computed across the batch. Third, each activation is normalised by subtracting the mean and dividing by the square root of the variance plus a small constant epsilon for numerical stability. Fourth, the normalised activations are rescaled by a learned scale parameter gamma and shifted by a learned bias parameter beta, restoring the layer's representational capacity.

A compact way of writing this transformation is: y = gamma * (x - mu) / sqrt(var + eps) + beta, where mu and var are the per-channel batch statistics and gamma and beta are learnable.

During inference the per-batch statistics are replaced by running averages of mean and variance accumulated during training, so that predictions are deterministic and independent of the rest of the batch.

Behaviour in training and inference

A subtlety of BatchNorm is that it introduces a behavioural difference between training and inference. During training, normalisation depends on the composition of the batch and acts as a stochastic regulariser. During inference, the layer applies fixed parameters. This duality can cause subtle bugs when batches are unusually small, when distributed training synchronises statistics incorrectly, or when fine-tuning on a small downstream dataset.

Variants

Several normalisation strategies have emerged for settings where BatchNorm performs poorly.

Layer normalisation (Ba, Kiros, Hinton, 2016) normalises across the feature dimension within each example, removing dependence on batch size. It is the standard choice in transformers and recurrent networks.

Group normalisation (Wu and He, 2018) splits channels into groups and normalises within each group, performing well at small batch sizes used in object detection and segmentation.

Instance normalisation (Ulyanov et al., 2017) normalises per example and per channel, popular in style-transfer networks.

Weight normalisation and spectral normalisation instead constrain weights rather than activations, with applications in generative adversarial networks.

A comparison of the most common variants is shown below.

| Variant | Normalises across | Depends on batch size | Common use | |---|---|---|---| | Batch norm | Batch + spatial | Yes | CNNs | | Layer norm | Features within example | No | Transformers, RNNs | | Group norm | Feature groups within example | No | Detection, segmentation | | Instance norm | Per channel within example | No | Style transfer |

When BatchNorm helps and when it does not

BatchNorm is most effective with moderate-to-large batch sizes (typically 16 or more per device) and image-like inputs. It is less suitable for sequence models, small-batch training, online learning, or scenarios where the batch composition changes adversarially. Modern architectures such as transformer-based language models, vision transformers, and diffusion models therefore tend to favour layer or group normalisation.

Regularising effect

Because batch statistics inject noise into each forward pass, BatchNorm provides a regularising effect comparable in some settings to dropout. Many production CNNs use both; others rely on BatchNorm alone. The interaction is empirical and depends on architecture, dataset, and training regime.

Malaysian Context — BatchNorm in Malaysian deep learning practice

Batch normalisation appears in nearly every production computer-vision system deployed in Malaysia, since the dominant architectures — ResNet, EfficientNet, YOLO families, and U-Net variants — all rely on it. Malaysian system integrators including ViTrox, Pentamaster, Greatech, and Aemulus ship vision-QC equipment whose convolutional backbones depend on BatchNorm for stable training across the manufacturing data captured at semiconductor and E&E sites in Penang, Kulim, and Shah Alam.

Academic groups at Universiti Sains Malaysia (USM), Universiti Putra Malaysia (UPM), Universiti Teknologi Malaysia (UTM), Universiti Malaya (UM), and Multimedia University (MMU) include BatchNorm and its variants in standard postgraduate deep-learning curricula. HRD Corp-funded courses on computer vision and deep learning, often delivered through Microsoft, NVIDIA, AWS, and local providers, treat normalisation layers as required foundational material.

Edge deployments — for example, vision systems on Jetson devices used in palm-oil grading or wafer inspection — pay particular attention to BatchNorm's inference-time behaviour, since improperly frozen running statistics are a common source of accuracy regressions when models are converted to TensorRT or ONNX. Local MLOps teams at Petronas Digital, Maybank, CIMB, AirAsia, and Grab Malaysia include normalisation-layer checks in their model-validation pipelines.

References

Ioffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML.
Santurkar, S. et al. (2018). How Does Batch Normalization Help Optimization?. NeurIPS.
Ba, J., Kiros, J., Hinton, G. (2016). Layer Normalization. arXiv:1607.06450.
Wu, Y. and He, K. (2018). Group Normalization. ECCV.
Ulyanov, D. et al. (2017). Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv:1607.08022.

Tags:batch-normalisation regularisation training optimisation

Type	Neural network normalisation layer
Introduced	2015, Ioffe and Szegedy (Google)
Computed over	Mini-batch statistics during training
Common siblings	Layer norm, group norm, instance norm
Related	Dropout, regularisation, gradient descent