AIWiki
Malaysia

Quantisation

Quantisation is a model compression technique that reduces the numerical precision of a neural network's weights and activations from high-bit floating-point formats to lower-bit representations, decreasing memory usage and accelerating inference with minimal accuracy loss.

7 min readLast updated May 2026Infrastructure

Quantisation in the context of artificial intelligence is a model compression technique that reduces the numerical precision used to represent a neural network's parameters and intermediate activations. A standard neural network trained with 32-bit floating-point (FP32) weights can be quantised to 8-bit integers (INT8), 4-bit integers (INT4), or even lower bit-widths, yielding proportional reductions in memory consumption and improvements in inference throughput. As large language models have grown to tens and hundreds of billions of parameters, quantisation has become a practical necessity for deploying capable AI systems on resource-constrained hardware — from edge devices to single-GPU workstations — and for reducing the operating cost of cloud inference at scale.

Motivation

The computational and memory demands of state-of-the-art AI models have grown approximately tenfold every two years. A 70-billion-parameter language model stored in FP32 requires roughly 280 gigabytes of memory — beyond the capacity of all but the most expensive multi-GPU configurations. At FP16 (half precision), it still requires approximately 140 GB. Quantising the same model to INT8 reduces this to approximately 70 GB, and INT4 reduces it to 35 GB, bringing a 70B model within reach of a single high-end GPU or even a consumer-grade GPU when combined with offloading strategies.

Beyond memory, quantised operations are faster on hardware that supports low-precision arithmetic. Modern GPUs from NVIDIA (with INT8 and FP8 tensor cores) and custom AI accelerators such as Qualcomm's AI inference chips process lower-precision operations at substantially higher throughput, translating into reduced latency and higher request-per-second capacity for inference services.

Post-Training Quantisation

Post-training quantisation (PTQ) is applied after a model has been trained in full precision. The most straightforward variant — uniform linear quantisation — maps the continuous range of floating-point values in each weight tensor to a discrete set of integer values, determined by computing a scale factor and zero point from the observed distribution of those weights. PTQ requires only a small calibration dataset and no additional training, making it rapid to apply.

Simple PTQ at INT8 typically degrades model accuracy by less than 1–2% on benchmark tasks. However, at INT4 and below, naive PTQ can cause significant quality degradation because some weight tensors contain outlier values with much larger magnitudes than the majority, making it difficult to choose a scale factor that accurately represents all values simultaneously.

GPTQ

GPTQ (Generative Pre-trained Transformer Quantisation), introduced in 2022, addresses INT4 quantisation of large language models using a second-order optimisation procedure based on the Optimal Brain Compression framework. It quantises weights layer by layer, using the Hessian of the loss to identify and compensate for quantisation errors. GPTQ achieves INT4 quality close to FP16 baselines on most language modelling benchmarks and is widely used as a deployment format for open-weight models distributed through Hugging Face.

AWQ

Activation-aware Weight Quantisation (AWQ) takes a different approach by identifying the small fraction of weights (typically 1%) that most strongly influence activation magnitudes and protecting those weights from large quantisation errors while aggressively compressing the rest. AWQ generally achieves better accuracy than GPTQ at equivalent bit-widths, particularly for very low precision.

Quantisation-Aware Training

Quantisation-aware training (QAT) simulates quantisation during the training or fine-tuning process by inserting "fake quantise" operations into the computational graph. The model learns to be robust to the quantisation error introduced by lower precision, typically achieving higher accuracy than PTQ at the same bit-width, at the cost of requiring a full training run. QAT is most beneficial when the target precision is INT4 or lower, or when the task demands high accuracy on a narrow domain.

Mixed-Precision and Format Diversity

Modern large model deployments use mixed precision: a model may store weights in INT4 while performing computations in FP16 or BF16 (bfloat16), with conversion happening on the fly. NormalFloat4 (NF4), introduced with QLoRA, is a quantisation data type designed specifically for the approximately Gaussian distribution of neural network weights, achieving better representation quality than standard INT4 at the same bit-width. FP8 (8-bit floating point), supported in NVIDIA H100 GPUs and Google TPU v5, offers a middle ground between the accuracy of FP16 and the efficiency of INT8.

Trade-offs and Limitations

Quantisation introduces a trade-off between model quality and efficiency. The accuracy impact depends on model size (larger models are generally more resilient to quantisation), task difficulty, and the chosen bit-width. For most conversational and general-purpose tasks, INT8 and even INT4 quantisation of modern large language models produces outputs indistinguishable from FP16 in practice. For tasks requiring high numerical precision — mathematical reasoning, code generation, or scientific computation — more care is needed, and INT8 is often preferred over INT4. Research continues into 2-bit and 1-bit quantisation schemes, with models such as BitNet exploring whether extreme quantisation can yield functional language models.

References

  1. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv:2210.17323.
  2. Lin, J., Tang, J., Tang, H., et al. (2023). AWQ: Activation-aware weight quantization for LLM compression and acceleration. arXiv:2306.00978.
  3. Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv:2305.14314.
  4. NVIDIA. (2025). Optimizing LLMs for performance and accuracy with post-training quantization. NVIDIA Technical Blog.
  5. Sustainable LLM Inference for Edge AI. (2025). Evaluating quantized LLMs for energy efficiency, output accuracy, and inference latency. ACM Transactions on Internet of Things.
  6. Amazon Web Services. (2025). AI adoption surges 35 percent in Malaysia. AWS Research Report.