What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Model Pruning

A model compression technique that removes redundant or low-importance parameters from a neural network to reduce size, memory footprint, and inference latency while preserving accuracy.

6 min readLast updated June 2026Infrastructure

Model pruning is a family of techniques for compressing neural networks by removing parameters, neurons, attention heads, or other model components deemed redundant for the target task. Pruning reduces the storage footprint, memory bandwidth requirements, and inference latency of deep learning models, typically at the cost of additional engineering effort and small accuracy degradation. Pruning sits alongside quantisation, knowledge distillation, and low-rank adaptation as one of the principal techniques used to deploy large models on edge devices, mobile hardware, and cost-constrained inference infrastructure.

Motivation

Deep neural networks, including modern transformer-based foundation models, are typically over-parameterised relative to the demands of any given downstream task. The lottery ticket hypothesis and a long line of compression research have shown that a substantial fraction of weights in trained networks can be removed with negligible impact on accuracy, provided the surviving subnetwork is fine-tuned appropriately. Pruning is motivated by deployment constraints — on-device inference, mobile GPU memory, batteries, and latency budgets — as well as by the desire to reduce cloud inference cost and energy consumption at scale.

Taxonomy

Pruning techniques are most commonly classified along two axes: granularity and criterion.

By granularity, pruning is either unstructured or structured. Unstructured pruning sets individual weights to zero, producing sparse weight matrices. Such sparsity can yield strong theoretical compression but is difficult to translate into wall-clock speedups on dense matrix hardware without specialised kernels or sparse accelerators. Structured pruning removes entire neurons, channels, filters, attention heads, or transformer layers, producing a smaller dense model that maps directly onto standard hardware without specialised support. Semi-structured patterns such as the N:M sparsity supported by NVIDIA Ampere and later GPUs (commonly 2:4 sparsity) sit between these extremes and are increasingly important for hardware-friendly pruning.

By criterion, pruning methods include magnitude pruning, which removes the smallest-magnitude weights; gradient or movement pruning, which removes weights whose values move toward zero during fine-tuning; second-order methods such as Optimal Brain Surgeon, which use Hessian information to identify low-impact weights; and activation-based methods that target neurons or channels with low average activation. More recent approaches use one-shot pruning at scale, such as SparseGPT and Wanda, which prune frontier-scale language models with no or minimal retraining, by combining magnitude with input-activation statistics.

Workflows

The classical pruning workflow is train, prune, fine-tune. A model is trained to convergence, weights are pruned by a chosen criterion to a target sparsity, and the surviving model is fine-tuned to recover accuracy. Iterative magnitude pruning repeats this loop, gradually increasing sparsity. Dynamic sparse training methods such as RigL prune and regrow connections during training, maintaining a fixed sparsity budget throughout. For very large language models, one-shot pruning followed by light fine-tuning has become a common pattern because end-to-end retraining is prohibitively expensive.

Effects on accuracy and efficiency

The accuracy-efficiency trade-off of pruning depends on the model family, task, sparsity level, and pruning criterion. Convolutional networks for image classification can typically be pruned to 90 percent sparsity or more with minimal accuracy loss. Transformer-based language models tolerate moderate unstructured sparsity (50 to 60 percent) well, with steeper accuracy degradation at higher levels. Structured pruning of attention heads and transformer layers is constrained by the network's depth and the redundancy of individual heads. When combined with quantisation and knowledge distillation, pruning often produces models that are several times smaller and faster than the original with limited end-task degradation.

Hardware and software support

Pruned models require execution stacks that can exploit sparsity. Hardware acceleration of sparse linear algebra has improved substantially: NVIDIA's Sparse Tensor Cores support 2:4 sparsity, and several inference frameworks expose APIs for structured and semi-structured sparse execution. Software libraries supporting pruning include PyTorch's torch.nn.utils.prune module, the SparseML toolkit from Neural Magic, NVIDIA's TensorRT and Apex libraries, the Hugging Face Optimum suite, and the DeepSparse runtime. Open-source frontier-model pruning projects have published pruned variants of Llama, Mistral, Qwen, and Falcon, often combined with quantisation to four or eight bits.

Limitations

Pruning is not a universal optimisation. Unstructured sparsity often does not translate into wall-clock speedups on commodity GPUs without specialised kernels. Aggressive pruning can disproportionately harm minority-class accuracy and degrade behaviour on out-of-distribution inputs, raising fairness and reliability concerns. Pruning interacts with other optimisations: combined pruning and quantisation pipelines require careful sequencing, and the accuracy budget for pruning must be allocated against quantisation and distillation. For language models, evaluation on a narrow benchmark may not reflect downstream task performance, so pruned models should be evaluated on the actual deployment workload before release.

Malaysian Context — Pruning, Edge AI, and Sovereign Inference

Model pruning is directly relevant to Malaysia's edge AI, on-device intelligence, and cost-sensitive inference scenarios. Local applications in manufacturing — including electronics assembly in Penang's E&E corridor, contract manufacturing for global brands, and palm oil milling — rely on compact computer vision models running on edge hardware such as NVIDIA Jetson, Intel-based industrial PCs, and ARM-class single-board computers. Pruning, combined with quantisation, is a standard technique to bring foundation-scale vision or language models within edge memory and latency budgets in these deployments.

Malaysian banks and fintechs such as Maybank, CIMB, RHB, Public Bank, Touch 'n Go, GXBank, and AEON Bank operate large transaction inference workloads where pruned and quantised models reduce inference cost in real-time fraud detection, credit decisioning, and customer service automation. Telcos including TM, Maxis, Celcom-Digi, and U Mobile use compressed models for on-device speech and network anomaly use cases. Petronas, Tenaga Nasional Berhad, and oil and gas service providers deploy pruned models for predictive maintenance and inspection on remote and offshore equipment with constrained connectivity and power budgets.

Government initiatives including the MyDigital Blueprint, the Malaysia AI Roadmap, the National AI Office, and MDEC's Digital Hub programme have emphasised sovereign AI capability and the development of locally trained, locally hosted models. Pruning techniques contribute to the feasibility of running Bahasa Melayu-tuned foundation models on Malaysian data-centre capacity, and HRD Corp claimable AI training programmes increasingly cover model compression topics. The Personal Data Protection Act (PDPA) and the Bank Negara Malaysia Risk Management in Technology framework do not specify pruning as such, but treat model compression decisions within the wider model risk and explainability requirements applied to AI in regulated sectors.

References

Han, S., Pool, J., Tran, J., and Dally, W. J. (2015). Learning both Weights and Connections for Efficient Neural Networks. NeurIPS.
Frankle, J., and Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR.
Frantar, E., and Alistarh, D. (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. ICML.
Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2024). Wanda: A Simple and Effective Pruning Approach for Large Language Models. ICLR.

Tags:model-pruning model-compression efficiency inference-optimisation

Type	Model compression technique
Variants	Unstructured, structured, magnitude-based, movement-based
Key objectives	Smaller model size, lower latency, lower energy
Common workflow	Train → prune → fine-tune
Related	Quantisation, Knowledge Distillation, LoRA

Motivation

Taxonomy

Workflows

Effects on accuracy and efficiency

Hardware and software support

Limitations

See Also

References

References