What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Quantisation

Quantisation is a model compression technique that reduces the numerical precision of a neural network's weights and activations from high-bit floating-point formats to lower-bit representations, decreasing memory usage and accelerating inference with minimal accuracy loss.

7 min readLast updated May 2026Infrastructure

Quantisation in the context of artificial intelligence is a model compression technique that reduces the numerical precision used to represent a neural network's parameters and intermediate activations. A standard neural network trained with 32-bit floating-point (FP32) weights can be quantised to 8-bit integers (INT8), 4-bit integers (INT4), or even lower bit-widths, yielding proportional reductions in memory consumption and improvements in inference throughput. As large language models have grown to tens and hundreds of billions of parameters, quantisation has become a practical necessity for deploying capable AI systems on resource-constrained hardware — from edge devices to single-GPU workstations — and for reducing the operating cost of cloud inference at scale.

Motivation

The computational and memory demands of state-of-the-art AI models have grown approximately tenfold every two years. A 70-billion-parameter language model stored in FP32 requires roughly 280 gigabytes of memory — beyond the capacity of all but the most expensive multi-GPU configurations. At FP16 (half precision), it still requires approximately 140 GB. Quantising the same model to INT8 reduces this to approximately 70 GB, and INT4 reduces it to 35 GB, bringing a 70B model within reach of a single high-end GPU or even a consumer-grade GPU when combined with offloading strategies.

Beyond memory, quantised operations are faster on hardware that supports low-precision arithmetic. Modern GPUs from NVIDIA (with INT8 and FP8 tensor cores) and custom AI accelerators such as Qualcomm's AI inference chips process lower-precision operations at substantially higher throughput, translating into reduced latency and higher request-per-second capacity for inference services.

Post-Training Quantisation

Post-training quantisation (PTQ) is applied after a model has been trained in full precision. The most straightforward variant — uniform linear quantisation — maps the continuous range of floating-point values in each weight tensor to a discrete set of integer values, determined by computing a scale factor and zero point from the observed distribution of those weights. PTQ requires only a small calibration dataset and no additional training, making it rapid to apply.

Simple PTQ at INT8 typically degrades model accuracy by less than 1–2% on benchmark tasks. However, at INT4 and below, naive PTQ can cause significant quality degradation because some weight tensors contain outlier values with much larger magnitudes than the majority, making it difficult to choose a scale factor that accurately represents all values simultaneously.

GPTQ

GPTQ (Generative Pre-trained Transformer Quantisation), introduced in 2022, addresses INT4 quantisation of large language models using a second-order optimisation procedure based on the Optimal Brain Compression framework. It quantises weights layer by layer, using the Hessian of the loss to identify and compensate for quantisation errors. GPTQ achieves INT4 quality close to FP16 baselines on most language modelling benchmarks and is widely used as a deployment format for open-weight models distributed through Hugging Face.

AWQ

Activation-aware Weight Quantisation (AWQ) takes a different approach by identifying the small fraction of weights (typically 1%) that most strongly influence activation magnitudes and protecting those weights from large quantisation errors while aggressively compressing the rest. AWQ generally achieves better accuracy than GPTQ at equivalent bit-widths, particularly for very low precision.

Quantisation-Aware Training

Quantisation-aware training (QAT) simulates quantisation during the training or fine-tuning process by inserting "fake quantise" operations into the computational graph. The model learns to be robust to the quantisation error introduced by lower precision, typically achieving higher accuracy than PTQ at the same bit-width, at the cost of requiring a full training run. QAT is most beneficial when the target precision is INT4 or lower, or when the task demands high accuracy on a narrow domain.

Mixed-Precision and Format Diversity

Modern large model deployments use mixed precision: a model may store weights in INT4 while performing computations in FP16 or BF16 (bfloat16), with conversion happening on the fly. NormalFloat4 (NF4), introduced with QLoRA, is a quantisation data type designed specifically for the approximately Gaussian distribution of neural network weights, achieving better representation quality than standard INT4 at the same bit-width. FP8 (8-bit floating point), supported in NVIDIA H100 GPUs and Google TPU v5, offers a middle ground between the accuracy of FP16 and the efficiency of INT8.

Trade-offs and Limitations

Quantisation introduces a trade-off between model quality and efficiency. The accuracy impact depends on model size (larger models are generally more resilient to quantisation), task difficulty, and the chosen bit-width. For most conversational and general-purpose tasks, INT8 and even INT4 quantisation of modern large language models produces outputs indistinguishable from FP16 in practice. For tasks requiring high numerical precision — mathematical reasoning, code generation, or scientific computation — more care is needed, and INT8 is often preferred over INT4. Research continues into 2-bit and 1-bit quantisation schemes, with models such as BitNet exploring whether extreme quantisation can yield functional language models.

Malaysian Context — Quantisation for Cost-Efficient AI in Malaysia

Quantisation has become a critical enabler for Malaysian enterprises and developers who wish to run large AI models economically, either on-premises or in the cloud. Given that many Malaysian SMEs and technology startups operate with constrained GPU budgets, quantised models — particularly those distributed via Hugging Face in GPTQ or AWQ format — allow development teams to run inference on a single NVIDIA RTX 4090 or a cloud instance with 24–80 GB of GPU memory rather than requiring multi-GPU nodes.

Axrail's Generative AI Laboratory, established with AWS as Malaysia's first such facility, uses Amazon Bedrock infrastructure which internally applies quantisation and model distillation to deliver cost-efficient inference. AWS data in 2025 indicates that distilled models run up to 500% faster and cost up to 75% less than full-precision equivalents, underscoring the commercial importance of compression techniques such as quantisation for Malaysian organisations using managed AI services.

Malaysia's manufacturing and industrial sector — which relies heavily on Penang's electronics and semiconductor industry — has applied quantised AI models in quality inspection systems that run inference on embedded GPUs and edge accelerators such as NVIDIA Jetson. These edge AI deployments depend on quantisation to fit capable vision models into devices with power and memory constraints typical of factory floor hardware.

Bank Negara Malaysia's (BNM) guidance on AI in financial services implicitly addresses the risk-accuracy trade-off that quantisation introduces: institutions deploying quantised models must validate that model output quality meets the required standards for the specific use case, particularly in high-stakes decisions such as credit scoring or fraud detection where degraded accuracy could have regulatory and customer-harm implications. The Malaysia AI Governance Framework similarly calls for model validation and ongoing monitoring, which includes verifying that optimisation techniques such as quantisation do not introduce unacceptable performance regression.

References

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv:2210.17323.
Lin, J., Tang, J., Tang, H., et al. (2023). AWQ: Activation-aware weight quantization for LLM compression and acceleration. arXiv:2306.00978.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv:2305.14314.
NVIDIA. (2025). Optimizing LLMs for performance and accuracy with post-training quantization. NVIDIA Technical Blog.
Sustainable LLM Inference for Edge AI. (2025). Evaluating quantized LLMs for energy efficiency, output accuracy, and inference latency. ACM Transactions on Internet of Things.
Amazon Web Services. (2025). AI adoption surges 35 percent in Malaysia. AWS Research Report.

Tags:quantisation model compression inference optimisation edge AI

Type	Model compression and inference optimisation
Common formats	INT8, INT4, FP8, NF4, FP4
Memory reduction	4× (INT8) to 8× (INT4) vs FP32 baseline
Key techniques	PTQ, QAT, GPTQ, AWQ, QLoRA
Related	LoRA, Knowledge distillation, Edge AI, Pruning