What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Speculative Decoding

Speculative decoding is an inference acceleration technique that uses a small draft model to propose multiple candidate tokens that a larger target model then verifies in parallel, achieving 2-4x throughput gains without changing output quality.

5 min readLast updated June 2026Infrastructure

Speculative decoding is an inference optimisation method for large language models (LLMs) that breaks the sequential bottleneck of standard autoregressive generation. Rather than having a large target model generate one token at a time — each step requiring a full forward pass — speculative decoding uses a smaller, faster draft model to propose multiple tokens simultaneously, which the target model then verifies in a single parallel pass. The net result is substantially higher throughput with no degradation in output quality.

Background

Standard autoregressive decoding is inherently sequential: the model generates token n before it can generate token n+1. While each individual forward pass through a large model is heavily parallelised on GPU hardware, the decode step itself processes only one token at a time, leaving much of the GPU's computational capacity underutilised. This sequential dependency is the primary bottleneck for LLM inference latency. Speculative decoding was introduced independently by Chen et al. and Leviathan et al. in concurrent 2023 papers as a way to exploit unused GPU parallelism during the generation phase.

How It Works

The algorithm operates in an iterative propose-and-verify loop. First, the draft model — a smaller, faster model with the same vocabulary as the target — generates a sequence of K candidate tokens (typically 4-8) given the current context. The target model then processes all K proposed tokens in a single forward pass, computing in parallel the probability distributions it would assign to each position.

The algorithm compares the target model's distribution against the draft model's distribution at each position using a token-level acceptance criterion. If the target model agrees with a proposed token (or agrees within a specified tolerance), that token is accepted. The process continues accepting tokens until a rejection is encountered. At the point of rejection, the target model samples its own token from a corrected distribution and the next draft-and-verify cycle begins.

The acceptance-rejection procedure is designed to preserve the exact output distribution of the target model: mathematically, the sequence of accepted tokens is guaranteed to be sampled from the same distribution as if the large model had generated each token autoregressively. Speculative decoding therefore produces outputs that are statistically identical to those of the target model alone.

Performance

Empirical results show consistent speedups of 2-4x on standard benchmarks, with the magnitude depending on how well the draft model's predictions align with the target model's distribution. Closely matched draft and target models — such as a 7B-parameter draft alongside a 70B-parameter target from the same model family — yield high acceptance rates and larger speedups.

The technique has been integrated into mainstream serving frameworks including vLLM and TensorRT-LLM, and NVIDIA has reported 3.6x throughput improvements on H200 GPU configurations running speculative decoding. Memory overhead is modest: the draft model requires additional GPU memory, but its substantially smaller size means this cost is typically well under 10 percent of the memory consumed by the target model.

Variants and Extensions

Several variants extend the core technique. Self-speculative decoding uses early-exit layers of the target model as the draft, eliminating the need for a separate model. Medusa adds multiple decoding heads directly to the target model to propose several candidate continuations in parallel. More recent approaches apply speculation at the level of reasoning steps rather than individual tokens, enabling draft models to propose multi-step chain-of-thought segments that the target model verifies — yielding additional speedups on reasoning-intensive tasks, with some studies reporting 1.5-2.5x gains and up to 9.9 percentage points of accuracy improvement on benchmarks.

Relation to Other Inference Optimisations

Speculative decoding is complementary to other inference acceleration methods. It can be combined with KV cache reuse to reduce prefill costs, with quantisation to shrink both draft and target model footprints, and with continuous batching to improve server throughput. In production deployments, speculative decoding is typically layered atop these techniques rather than used in isolation.

Malaysian Context — AI Inference Efficiency in Southeast Asia

Malaysia's growing AI deployment ecosystem has made inference efficiency an increasingly practical concern for local organisations. Companies building on Amazon Bedrock, Google Vertex AI, or Azure AI — all of which operate regional infrastructure serving Southeast Asia — benefit directly from speculative decoding when it is implemented at the serving layer, as reduced latency translates to lower per-token API costs and improved user experience.

Malaysian organisations participating in MDEC's Digital Hub programme and AI acceleration initiatives under the MyDigital Blueprint frequently build customer-facing LLM applications such as chatbots, document processing pipelines, and automated customer service tools. Speculative decoding reduces the cost of serving large models at scale, making production-quality LLM deployment more economically viable for smaller organisations and government agencies.

HRD Corp has funded AI infrastructure and MLOps training programmes, and awareness of inference optimisation techniques including speculative decoding is increasingly relevant for Malaysian ML engineers working on production deployments. Research interest in efficient inference has also grown at Malaysian universities including Universiti Malaya, Universiti Teknologi Malaysia, and Universiti Sains Malaysia, which have active AI research groups exploring transformer efficiency.

Regionally, AI Singapore's research agenda includes efficient LLM deployment for Southeast Asian languages, and speculative decoding has been evaluated as part of that infrastructure work. As token generation costs remain a key barrier to LLM adoption across ASEAN, techniques that reduce inference cost without quality loss are of strategic importance to the region's AI ecosystem. Malaysian cloud infrastructure providers and managed AI service operators increasingly list speculative decoding support among their serving capabilities.

References

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of ICML 2023.
Cai, T., Li, Y., Geng, Z., Peng, H., & Dao, T. (2024). Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. arXiv:2401.10774.
BentoML. (2025). Speculative Decoding. LLM Inference Handbook. https://bentoml.com/llm/inference-optimization/speculative-decoding

Tags:inference optimization llm performance

Type	Inference optimisation
Proposed by	Chen et al. (Google Brain); Leviathan et al. (Google Research), 2023
Key use	Reducing LLM token generation latency
Speedup	2-4x over standard autoregressive decoding
Related	KV Cache, Inference, Quantisation

Background

How It Works

Performance

Variants and Extensions

Relation to Other Inference Optimisations

See Also

References

References