AIWiki
Malaysia

Model Compression

Model compression is a set of techniques that reduce the size, memory footprint, and computational cost of machine learning models while preserving predictive accuracy, enabling deployment on resource-constrained hardware.

6 min readLast updated June 2026Infrastructure

Model compression refers to a family of techniques that reduce the computational and storage demands of trained machine learning models, particularly deep neural networks, without substantially degrading their accuracy. As AI models have grown from millions to hundreds of billions of parameters, compression has become a critical discipline enabling deployment in production environments where memory, latency, energy consumption, or hardware cost impose practical constraints.

The need for model compression arises from a fundamental tension in modern AI: the most capable models are trained on large clusters of high-end GPUs, but they must ultimately run on servers with limited memory, mobile devices, embedded systems, or edge hardware lacking cloud connectivity. A language model that requires 80 gigabytes of GPU memory during training cannot be served efficiently on a single inference server, let alone on a smartphone. Compression bridges this gap.

Core Techniques

Quantisation

Quantisation reduces the numerical precision of model weights and activations. Standard training uses 32-bit floating-point (FP32) numbers; quantisation maps these to lower-precision representations such as 16-bit float (FP16), 8-bit integer (INT8), or even 4-bit integer (INT4). The reduction in precision yields proportional decreases in memory usage and, on hardware with native integer arithmetic support (such as NVIDIA Tensor Cores and Google TPUs), significant gains in throughput.

Post-training quantisation (PTQ) applies quantisation after training is complete, requiring only a small calibration dataset. Quantisation-aware training (QAT) simulates quantisation noise during training, typically producing higher accuracy at very low bit widths. For large language models, techniques such as GPTQ, AWQ, and GGUF have enabled 4-bit quantisation of billion-parameter models with minimal perplexity degradation.

Pruning

Pruning removes redundant or low-importance parameters from a trained model. Unstructured pruning sets individual weights to zero based on a magnitude threshold, producing sparse weight matrices. Structured pruning removes entire channels, attention heads, or layers, which is more hardware-friendly as it produces dense smaller tensors rather than sparse ones.

Iterative magnitude pruning — alternating between pruning a fraction of weights and fine-tuning the remaining network — was demonstrated by Frankle and Carlin (2019) in the Lottery Ticket Hypothesis, which showed that large networks contain smaller subnetworks that can be trained to similar accuracy from initialisation.

Knowledge Distillation

Knowledge distillation trains a smaller student model to mimic the output distribution of a larger teacher model, rather than training the student from scratch on hard labels. The teacher's soft probability outputs over all classes carry richer information than one-hot labels, enabling the student to generalise better. DistilBERT, a 40% smaller version of BERT with 97% of its language understanding capability, is a prominent example.

Distillation can be applied to intermediate layer representations (feature-based distillation) and attention patterns (attention transfer) in addition to output logits, enabling the student to capture the teacher's internal reasoning structure.

Low-Rank Decomposition

Many weight matrices in neural networks exhibit low intrinsic rank — meaning their information content can be approximated with far fewer parameters. Low-rank decomposition factorises a large weight matrix W into the product of two smaller matrices, reducing parameter count. This technique underlies LoRA (Low-Rank Adaptation), which has become standard for efficient fine-tuning of large language models.

Weight Sharing and Tensor Decomposition

Weight sharing groups parameters into clusters and assigns a single shared value to each cluster, reducing the number of distinct values that must be stored. Tensor decomposition methods such as Tucker decomposition and CP decomposition extend low-rank factorisation to higher-dimensional weight tensors in convolutional networks.

Comparison of Techniques

| Technique | Accuracy Impact | Inference Speedup | Memory Reduction | Hardware Dependency | |---|---|---|---|---| | INT8 quantisation | Low | 2-4x | ~4x | Needs INT8 accelerator | | INT4 quantisation | Moderate | 3-6x | ~8x | Needs INT4 accelerator | | Unstructured pruning | Low-moderate | Limited | Variable | None | | Structured pruning | Moderate | 1.5-3x | 30-70% | None | | Knowledge distillation | Low | Depends on student | Depends on student | None | | Low-rank decomposition | Low | 1.2-2x | 20-50% | None |

Practical Applications

Model compression enables real-world deployments that would otherwise be infeasible. On-device speech recognition in smartphones relies on compressed acoustic models that must run within a memory budget of tens of megabytes. Industrial predictive maintenance sensors running on microcontrollers use TinyML models derived through aggressive quantisation and pruning. Cloud providers apply model compression to reduce per-inference GPU cost, improving the economics of serving large language models at scale.

See Also

References

  1. Frankle, J. & Carlin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR 2019.
  2. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Deep Learning Workshop.
  3. Dettmers, T. et al. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023.
  4. Cheng, Y. et al. (2024). A Comprehensive Review of Model Compression Techniques in Machine Learning. Applied Intelligence, Springer.
  5. Lin, J. et al. (2020). MCUNet: Tiny Deep Learning on IoT Devices. NeurIPS 2020.