Speculative Decoding
Speculative decoding is an inference acceleration technique that uses a small draft model to propose multiple candidate tokens that a larger target model then verifies in parallel, achieving 2-4x throughput gains without changing output quality.
Speculative decoding is an inference optimisation method for large language models (LLMs) that breaks the sequential bottleneck of standard autoregressive generation. Rather than having a large target model generate one token at a time — each step requiring a full forward pass — speculative decoding uses a smaller, faster draft model to propose multiple tokens simultaneously, which the target model then verifies in a single parallel pass. The net result is substantially higher throughput with no degradation in output quality.
Background
Standard autoregressive decoding is inherently sequential: the model generates token n before it can generate token n+1. While each individual forward pass through a large model is heavily parallelised on GPU hardware, the decode step itself processes only one token at a time, leaving much of the GPU's computational capacity underutilised. This sequential dependency is the primary bottleneck for LLM inference latency. Speculative decoding was introduced independently by Chen et al. and Leviathan et al. in concurrent 2023 papers as a way to exploit unused GPU parallelism during the generation phase.
How It Works
The algorithm operates in an iterative propose-and-verify loop. First, the draft model — a smaller, faster model with the same vocabulary as the target — generates a sequence of K candidate tokens (typically 4-8) given the current context. The target model then processes all K proposed tokens in a single forward pass, computing in parallel the probability distributions it would assign to each position.
The algorithm compares the target model's distribution against the draft model's distribution at each position using a token-level acceptance criterion. If the target model agrees with a proposed token (or agrees within a specified tolerance), that token is accepted. The process continues accepting tokens until a rejection is encountered. At the point of rejection, the target model samples its own token from a corrected distribution and the next draft-and-verify cycle begins.
The acceptance-rejection procedure is designed to preserve the exact output distribution of the target model: mathematically, the sequence of accepted tokens is guaranteed to be sampled from the same distribution as if the large model had generated each token autoregressively. Speculative decoding therefore produces outputs that are statistically identical to those of the target model alone.
Performance
Empirical results show consistent speedups of 2-4x on standard benchmarks, with the magnitude depending on how well the draft model's predictions align with the target model's distribution. Closely matched draft and target models — such as a 7B-parameter draft alongside a 70B-parameter target from the same model family — yield high acceptance rates and larger speedups.
The technique has been integrated into mainstream serving frameworks including vLLM and TensorRT-LLM, and NVIDIA has reported 3.6x throughput improvements on H200 GPU configurations running speculative decoding. Memory overhead is modest: the draft model requires additional GPU memory, but its substantially smaller size means this cost is typically well under 10 percent of the memory consumed by the target model.
Variants and Extensions
Several variants extend the core technique. Self-speculative decoding uses early-exit layers of the target model as the draft, eliminating the need for a separate model. Medusa adds multiple decoding heads directly to the target model to propose several candidate continuations in parallel. More recent approaches apply speculation at the level of reasoning steps rather than individual tokens, enabling draft models to propose multi-step chain-of-thought segments that the target model verifies — yielding additional speedups on reasoning-intensive tasks, with some studies reporting 1.5-2.5x gains and up to 9.9 percentage points of accuracy improvement on benchmarks.
Relation to Other Inference Optimisations
Speculative decoding is complementary to other inference acceleration methods. It can be combined with KV cache reuse to reduce prefill costs, with quantisation to shrink both draft and target model footprints, and with continuous batching to improve server throughput. In production deployments, speculative decoding is typically layered atop these techniques rather than used in isolation.
See Also
References
References
- Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318.
- Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of ICML 2023.
- Cai, T., Li, Y., Geng, Z., Peng, H., & Dao, T. (2024). Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. arXiv:2401.10774.
- BentoML. (2025). Speculative Decoding. LLM Inference Handbook. https://bentoml.com/llm/inference-optimization/speculative-decoding