KV Cache
A KV cache (key-value cache) is a memory optimisation used in transformer inference that stores pre-computed key and value tensors from the attention mechanism, eliminating redundant recomputation when generating tokens sequentially.
The KV cache (key-value cache) is a standard memory optimisation applied during the inference phase of transformer-based language models. It stores the intermediate key (K) and value (V) tensors produced by the self-attention layers for every token that has already been processed, so that those computations do not need to be repeated when generating subsequent tokens. Without a KV cache, producing each new token would require re-processing the entire context from scratch, making inference cost grow quadratically with sequence length. With caching, only the new token requires a fresh attention computation, reducing the per-step cost dramatically.
Background in Transformer Attention
The self-attention mechanism at the core of every transformer layer computes three sets of vectors for each input token: query (Q), key (K), and value (V). The output for any query token is computed by comparing that token's query vector against the key vectors of all tokens in context, then forming a weighted sum of the corresponding value vectors. During autoregressive text generation, the model produces one token at a time. After the initial prompt is processed in the prefill phase, every subsequent generation step adds exactly one new token. The keys and values for all previously seen tokens remain unchanged; only the new token contributes new K and V tensors. The KV cache retains these pre-computed tensors in GPU memory, so the attention computation at each new step processes only the new token's query against the full cached set.
Memory Implications
Although the KV cache dramatically reduces compute, it introduces significant memory pressure. Cache size scales with four factors: context length (number of tokens in the prompt plus all generated tokens so far), the number of transformer layers, the number of attention heads, and the hidden dimension per head. For a large model serving long contexts, the KV cache may consume multiple gigabytes of GPU memory per active request. In multi-user serving scenarios, this means KV cache capacity directly constrains the number of concurrent requests a server can handle.
Several techniques have been developed to manage this memory cost. Quantisation of KV cache tensors to lower precision (INT8 or INT4) reduces footprint while tolerating modest accuracy trade-offs. Paged KV caching, introduced by the vLLM system in 2023, allocates cache memory in fixed-size pages rather than contiguous blocks, reducing fragmentation and enabling efficient memory sharing across requests with identical prefixes. Sliding window attention limits the cache to only the most recent W tokens, sacrificing distant context in exchange for bounded memory usage. Prefix caching (also called prompt caching) reuses the KV cache of shared prompt segments across multiple requests, which is especially valuable when many requests share the same system prompt.
Prefill and Decode Phases
LLM inference divides into two distinct phases. During the prefill phase, all prompt tokens are processed simultaneously in a single parallel forward pass, which populates the KV cache. During the decode phase, tokens are generated one at a time, each step reading from the cached tensors. The prefill phase is compute-bound and parallelises efficiently on GPU hardware. The decode phase is memory-bandwidth-bound: the bottleneck is the speed at which the GPU can read K and V tensors from high-bandwidth memory (HBM) rather than the number of arithmetic operations performed. This asymmetry has motivated specialised batching strategies, hardware features, and architectural choices aimed at improving decode throughput.
Multi-Query and Grouped-Query Attention
Standard multi-head attention maintains separate K and V tensors for every attention head, multiplying KV cache size by the number of heads. Multi-query attention (MQA) reduces this by sharing a single K and V set across all query heads, shrinking the cache proportionally at the cost of some model quality. Grouped-query attention (GQA), used in models such as Llama 3 and Mistral, provides a middle ground: several query heads share a single K-V pair, balancing cache size against representational capacity. GQA has become the dominant approach in modern open-weight models precisely because it controls KV cache memory without the full quality penalty of MQA.
Production Considerations
In practice, KV cache management is one of the most consequential engineering decisions in LLM deployment. Serving systems such as vLLM, TensorRT-LLM, and SGLang have each developed sophisticated cache schedulers that handle dynamic allocation, eviction, and reuse across concurrent requests. KV cache compression and offloading to CPU memory or SSDs are active areas of research for models serving very long contexts, such as those used in legal document analysis, codebase understanding, or long-form dialogue.
See Also
References
References
- Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of SOSP 2023.
- Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150.
- Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., & Sanghai, S. (2023). GQA: Training Generalised Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245.
- AI21 Labs. (2025). What is a KV Cache?. AI21 Glossary. https://www.ai21.com/glossary/foundational-llm/what-is-a-kv-cache/
- Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., & Dean, J. (2023). Efficiently Scaling Transformer Inference. Proceedings of MLSys 2023.