What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Model Compression

Model compression is a set of techniques that reduce the size, memory footprint, and computational cost of machine learning models while preserving predictive accuracy, enabling deployment on resource-constrained hardware.

6 min readLast updated June 2026Infrastructure

Model compression refers to a family of techniques that reduce the computational and storage demands of trained machine learning models, particularly deep neural networks, without substantially degrading their accuracy. As AI models have grown from millions to hundreds of billions of parameters, compression has become a critical discipline enabling deployment in production environments where memory, latency, energy consumption, or hardware cost impose practical constraints.

The need for model compression arises from a fundamental tension in modern AI: the most capable models are trained on large clusters of high-end GPUs, but they must ultimately run on servers with limited memory, mobile devices, embedded systems, or edge hardware lacking cloud connectivity. A language model that requires 80 gigabytes of GPU memory during training cannot be served efficiently on a single inference server, let alone on a smartphone. Compression bridges this gap.

Core Techniques

Quantisation

Quantisation reduces the numerical precision of model weights and activations. Standard training uses 32-bit floating-point (FP32) numbers; quantisation maps these to lower-precision representations such as 16-bit float (FP16), 8-bit integer (INT8), or even 4-bit integer (INT4). The reduction in precision yields proportional decreases in memory usage and, on hardware with native integer arithmetic support (such as NVIDIA Tensor Cores and Google TPUs), significant gains in throughput.

Post-training quantisation (PTQ) applies quantisation after training is complete, requiring only a small calibration dataset. Quantisation-aware training (QAT) simulates quantisation noise during training, typically producing higher accuracy at very low bit widths. For large language models, techniques such as GPTQ, AWQ, and GGUF have enabled 4-bit quantisation of billion-parameter models with minimal perplexity degradation.

Pruning

Pruning removes redundant or low-importance parameters from a trained model. Unstructured pruning sets individual weights to zero based on a magnitude threshold, producing sparse weight matrices. Structured pruning removes entire channels, attention heads, or layers, which is more hardware-friendly as it produces dense smaller tensors rather than sparse ones.

Iterative magnitude pruning — alternating between pruning a fraction of weights and fine-tuning the remaining network — was demonstrated by Frankle and Carlin (2019) in the Lottery Ticket Hypothesis, which showed that large networks contain smaller subnetworks that can be trained to similar accuracy from initialisation.

Knowledge Distillation

Knowledge distillation trains a smaller student model to mimic the output distribution of a larger teacher model, rather than training the student from scratch on hard labels. The teacher's soft probability outputs over all classes carry richer information than one-hot labels, enabling the student to generalise better. DistilBERT, a 40% smaller version of BERT with 97% of its language understanding capability, is a prominent example.

Distillation can be applied to intermediate layer representations (feature-based distillation) and attention patterns (attention transfer) in addition to output logits, enabling the student to capture the teacher's internal reasoning structure.

Low-Rank Decomposition

Many weight matrices in neural networks exhibit low intrinsic rank — meaning their information content can be approximated with far fewer parameters. Low-rank decomposition factorises a large weight matrix W into the product of two smaller matrices, reducing parameter count. This technique underlies LoRA (Low-Rank Adaptation), which has become standard for efficient fine-tuning of large language models.

Weight sharing groups parameters into clusters and assigns a single shared value to each cluster, reducing the number of distinct values that must be stored. Tensor decomposition methods such as Tucker decomposition and CP decomposition extend low-rank factorisation to higher-dimensional weight tensors in convolutional networks.

Comparison of Techniques

| Technique | Accuracy Impact | Inference Speedup | Memory Reduction | Hardware Dependency | |---|---|---|---|---| | INT8 quantisation | Low | 2-4x | ~4x | Needs INT8 accelerator | | INT4 quantisation | Moderate | 3-6x | ~8x | Needs INT4 accelerator | | Unstructured pruning | Low-moderate | Limited | Variable | None | | Structured pruning | Moderate | 1.5-3x | 30-70% | None | | Knowledge distillation | Low | Depends on student | Depends on student | None | | Low-rank decomposition | Low | 1.2-2x | 20-50% | None |

Practical Applications

Model compression enables real-world deployments that would otherwise be infeasible. On-device speech recognition in smartphones relies on compressed acoustic models that must run within a memory budget of tens of megabytes. Industrial predictive maintenance sensors running on microcontrollers use TinyML models derived through aggressive quantisation and pruning. Cloud providers apply model compression to reduce per-inference GPU cost, improving the economics of serving large language models at scale.

Malaysian Context — Model Compression for Edge and Mobile Deployment

Malaysia's push towards Industry 4.0 and smart manufacturing under the MyDigital Blueprint has increased demand for model compression expertise, particularly for deploying AI at the edge in factories, plantations, and remote infrastructure sites where cloud connectivity is unreliable or latency requirements preclude round-trip inference.

Petronas has explored edge AI deployments at offshore platforms and remote pipeline infrastructure where compressed anomaly-detection and predictive-maintenance models run on industrial edge hardware. The models are typically derived from larger cloud-trained networks using post-training quantisation and structured pruning before deployment via frameworks such as ONNX Runtime or OpenVINO. Similarly, FGV Holdings and Sime Darby Plantation have piloted compressed computer vision models for leaf disease detection on tablets and drones operating in palm oil estates with limited connectivity.

In the consumer electronics and telecommunications space, Maxis and Celcom (now CelcomDigi) have deployed compressed natural language models for on-device voice assistants and spam detection in mobile applications, leveraging INT8-quantised models that fit within the memory constraints of mid-range Android devices. These deployments follow compression pipelines using tools such as TensorFlow Lite and ONNX.

Malaysia's university AI research programmes — particularly at Universiti Malaya, Universiti Teknologi Malaysia, and Universiti Sains Malaysia — have published research on compression techniques for medical imaging and remote sensing applications, both domains where model accuracy must be preserved despite aggressive compression. MDEC supports commercialisation of such research through its Malaysia Digital programme, which provides co-funding for AI product development.

HRD Corp (Human Resource Development Corporation) has approved training programmes in edge AI and model optimisation under several certified training providers, reflecting industry demand for engineers capable of deploying compressed models on embedded and mobile platforms.

References

Frankle, J. & Carlin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR 2019.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Deep Learning Workshop.
Dettmers, T. et al. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023.
Cheng, Y. et al. (2024). A Comprehensive Review of Model Compression Techniques in Machine Learning. Applied Intelligence, Springer.
Lin, J. et al. (2020). MCUNet: Tiny Deep Learning on IoT Devices. NeurIPS 2020.

Tags:quantisation pruning knowledge distillation edge AI MLOps

Type	Model optimisation technique
Primary methods	Quantisation, pruning, knowledge distillation, low-rank decomposition
Goal	Smaller size, faster inference, lower memory usage
Target hardware	Edge devices, mobile phones, microcontrollers, IoT sensors
Related	Quantisation, Knowledge distillation, Edge AI, TinyML