AIWiki
Malaysia

Model Pruning

A model compression technique that removes redundant or low-importance parameters from a neural network to reduce size, memory footprint, and inference latency while preserving accuracy.

6 min readLast updated June 2026Infrastructure

Model pruning is a family of techniques for compressing neural networks by removing parameters, neurons, attention heads, or other model components deemed redundant for the target task. Pruning reduces the storage footprint, memory bandwidth requirements, and inference latency of deep learning models, typically at the cost of additional engineering effort and small accuracy degradation. Pruning sits alongside quantisation, knowledge distillation, and low-rank adaptation as one of the principal techniques used to deploy large models on edge devices, mobile hardware, and cost-constrained inference infrastructure.

Motivation

Deep neural networks, including modern transformer-based foundation models, are typically over-parameterised relative to the demands of any given downstream task. The lottery ticket hypothesis and a long line of compression research have shown that a substantial fraction of weights in trained networks can be removed with negligible impact on accuracy, provided the surviving subnetwork is fine-tuned appropriately. Pruning is motivated by deployment constraints — on-device inference, mobile GPU memory, batteries, and latency budgets — as well as by the desire to reduce cloud inference cost and energy consumption at scale.

Taxonomy

Pruning techniques are most commonly classified along two axes: granularity and criterion.

By granularity, pruning is either unstructured or structured. Unstructured pruning sets individual weights to zero, producing sparse weight matrices. Such sparsity can yield strong theoretical compression but is difficult to translate into wall-clock speedups on dense matrix hardware without specialised kernels or sparse accelerators. Structured pruning removes entire neurons, channels, filters, attention heads, or transformer layers, producing a smaller dense model that maps directly onto standard hardware without specialised support. Semi-structured patterns such as the N:M sparsity supported by NVIDIA Ampere and later GPUs (commonly 2:4 sparsity) sit between these extremes and are increasingly important for hardware-friendly pruning.

By criterion, pruning methods include magnitude pruning, which removes the smallest-magnitude weights; gradient or movement pruning, which removes weights whose values move toward zero during fine-tuning; second-order methods such as Optimal Brain Surgeon, which use Hessian information to identify low-impact weights; and activation-based methods that target neurons or channels with low average activation. More recent approaches use one-shot pruning at scale, such as SparseGPT and Wanda, which prune frontier-scale language models with no or minimal retraining, by combining magnitude with input-activation statistics.

Workflows

The classical pruning workflow is train, prune, fine-tune. A model is trained to convergence, weights are pruned by a chosen criterion to a target sparsity, and the surviving model is fine-tuned to recover accuracy. Iterative magnitude pruning repeats this loop, gradually increasing sparsity. Dynamic sparse training methods such as RigL prune and regrow connections during training, maintaining a fixed sparsity budget throughout. For very large language models, one-shot pruning followed by light fine-tuning has become a common pattern because end-to-end retraining is prohibitively expensive.

Effects on accuracy and efficiency

The accuracy-efficiency trade-off of pruning depends on the model family, task, sparsity level, and pruning criterion. Convolutional networks for image classification can typically be pruned to 90 percent sparsity or more with minimal accuracy loss. Transformer-based language models tolerate moderate unstructured sparsity (50 to 60 percent) well, with steeper accuracy degradation at higher levels. Structured pruning of attention heads and transformer layers is constrained by the network's depth and the redundancy of individual heads. When combined with quantisation and knowledge distillation, pruning often produces models that are several times smaller and faster than the original with limited end-task degradation.

Hardware and software support

Pruned models require execution stacks that can exploit sparsity. Hardware acceleration of sparse linear algebra has improved substantially: NVIDIA's Sparse Tensor Cores support 2:4 sparsity, and several inference frameworks expose APIs for structured and semi-structured sparse execution. Software libraries supporting pruning include PyTorch's torch.nn.utils.prune module, the SparseML toolkit from Neural Magic, NVIDIA's TensorRT and Apex libraries, the Hugging Face Optimum suite, and the DeepSparse runtime. Open-source frontier-model pruning projects have published pruned variants of Llama, Mistral, Qwen, and Falcon, often combined with quantisation to four or eight bits.

Limitations

Pruning is not a universal optimisation. Unstructured sparsity often does not translate into wall-clock speedups on commodity GPUs without specialised kernels. Aggressive pruning can disproportionately harm minority-class accuracy and degrade behaviour on out-of-distribution inputs, raising fairness and reliability concerns. Pruning interacts with other optimisations: combined pruning and quantisation pipelines require careful sequencing, and the accuracy budget for pruning must be allocated against quantisation and distillation. For language models, evaluation on a narrow benchmark may not reflect downstream task performance, so pruned models should be evaluated on the actual deployment workload before release.

See Also

References

References

  1. Han, S., Pool, J., Tran, J., and Dally, W. J. (2015). Learning both Weights and Connections for Efficient Neural Networks. NeurIPS.
  2. Frankle, J., and Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR.
  3. Frantar, E., and Alistarh, D. (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. ICML.
  4. Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2024). Wanda: A Simple and Effective Pruning Approach for Large Language Models. ICLR.