What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

KV Cache

A KV cache (key-value cache) is a memory optimisation used in transformer inference that stores pre-computed key and value tensors from the attention mechanism, eliminating redundant recomputation when generating tokens sequentially.

6 min readLast updated June 2026Infrastructure

The KV cache (key-value cache) is a standard memory optimisation applied during the inference phase of transformer-based language models. It stores the intermediate key (K) and value (V) tensors produced by the self-attention layers for every token that has already been processed, so that those computations do not need to be repeated when generating subsequent tokens. Without a KV cache, producing each new token would require re-processing the entire context from scratch, making inference cost grow quadratically with sequence length. With caching, only the new token requires a fresh attention computation, reducing the per-step cost dramatically.

Background in Transformer Attention

The self-attention mechanism at the core of every transformer layer computes three sets of vectors for each input token: query (Q), key (K), and value (V). The output for any query token is computed by comparing that token's query vector against the key vectors of all tokens in context, then forming a weighted sum of the corresponding value vectors. During autoregressive text generation, the model produces one token at a time. After the initial prompt is processed in the prefill phase, every subsequent generation step adds exactly one new token. The keys and values for all previously seen tokens remain unchanged; only the new token contributes new K and V tensors. The KV cache retains these pre-computed tensors in GPU memory, so the attention computation at each new step processes only the new token's query against the full cached set.

Memory Implications

Although the KV cache dramatically reduces compute, it introduces significant memory pressure. Cache size scales with four factors: context length (number of tokens in the prompt plus all generated tokens so far), the number of transformer layers, the number of attention heads, and the hidden dimension per head. For a large model serving long contexts, the KV cache may consume multiple gigabytes of GPU memory per active request. In multi-user serving scenarios, this means KV cache capacity directly constrains the number of concurrent requests a server can handle.

Several techniques have been developed to manage this memory cost. Quantisation of KV cache tensors to lower precision (INT8 or INT4) reduces footprint while tolerating modest accuracy trade-offs. Paged KV caching, introduced by the vLLM system in 2023, allocates cache memory in fixed-size pages rather than contiguous blocks, reducing fragmentation and enabling efficient memory sharing across requests with identical prefixes. Sliding window attention limits the cache to only the most recent W tokens, sacrificing distant context in exchange for bounded memory usage. Prefix caching (also called prompt caching) reuses the KV cache of shared prompt segments across multiple requests, which is especially valuable when many requests share the same system prompt.

Prefill and Decode Phases

LLM inference divides into two distinct phases. During the prefill phase, all prompt tokens are processed simultaneously in a single parallel forward pass, which populates the KV cache. During the decode phase, tokens are generated one at a time, each step reading from the cached tensors. The prefill phase is compute-bound and parallelises efficiently on GPU hardware. The decode phase is memory-bandwidth-bound: the bottleneck is the speed at which the GPU can read K and V tensors from high-bandwidth memory (HBM) rather than the number of arithmetic operations performed. This asymmetry has motivated specialised batching strategies, hardware features, and architectural choices aimed at improving decode throughput.

Multi-Query and Grouped-Query Attention

Standard multi-head attention maintains separate K and V tensors for every attention head, multiplying KV cache size by the number of heads. Multi-query attention (MQA) reduces this by sharing a single K and V set across all query heads, shrinking the cache proportionally at the cost of some model quality. Grouped-query attention (GQA), used in models such as Llama 3 and Mistral, provides a middle ground: several query heads share a single K-V pair, balancing cache size against representational capacity. GQA has become the dominant approach in modern open-weight models precisely because it controls KV cache memory without the full quality penalty of MQA.

Production Considerations

In practice, KV cache management is one of the most consequential engineering decisions in LLM deployment. Serving systems such as vLLM, TensorRT-LLM, and SGLang have each developed sophisticated cache schedulers that handle dynamic allocation, eviction, and reuse across concurrent requests. KV cache compression and offloading to CPU memory or SSDs are active areas of research for models serving very long contexts, such as those used in legal document analysis, codebase understanding, or long-form dialogue.

Malaysian Context — Efficient LLM Serving for Local Deployments

For Malaysian organisations deploying LLM-based services in banking, government, healthcare, or education, GPU memory constraints are a practical operational concern. Institutions such as Maybank, CIMB, and Telekom Malaysia (TM) exploring internal AI deployments on cloud infrastructure must account for how context length and batch size interact with KV cache memory when designing their serving architectures.

MDEC's Digital Economy initiatives have encouraged Malaysian companies to build AI-powered customer service tools and document automation pipelines, many of which serve multiple concurrent users. Efficient KV cache management, including prefix caching for shared system prompts and quantised caches for reduced memory footprint, directly affects the unit economics of these services — particularly for smaller Malaysian technology companies operating on constrained GPU budgets.

Cloud providers with regional infrastructure serving Malaysian customers — including Amazon Web Services (Asia Pacific Singapore region), Microsoft Azure, and Google Cloud — have incorporated paged KV caching and prefix sharing into their managed LLM serving products. Malaysian AI practitioners accessing these platforms through BNM-regulated cloud procurement processes or via MDEC-facilitated programmes benefit from these optimisations at the infrastructure layer.

At the national level, understanding KV cache constraints is important for planning AI infrastructure under the MyDigital Blueprint and the National AI Roadmap. Government agencies considering on-premise LLM deployments — for example, for processing sensitive documents outside public cloud environments — need to account for KV cache memory requirements when sizing GPU hardware. HRD Corp-funded AI engineering programmes increasingly include LLM inference architecture as a curriculum topic, reflecting industry demand for engineers who understand these memory-compute trade-offs.

References

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of SOSP 2023.
Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150.
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., & Sanghai, S. (2023). GQA: Training Generalised Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245.
AI21 Labs. (2025). What is a KV Cache?. AI21 Glossary. https://www.ai21.com/glossary/foundational-llm/what-is-a-kv-cache/
Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., & Dean, J. (2023). Efficiently Scaling Transformer Inference. Proceedings of MLSys 2023.

Tags:inference memory transformer optimization

Full name	Key-Value Cache
Type	Inference memory optimisation
Used in	Transformer-based language models
Key benefit	Reduces per-step cost from O(n^2) to O(n) in sequence length
Trade-off	GPU memory scales with batch size and context length
Related	Speculative Decoding, Attention Mechanism, Inference

Background in Transformer Attention

Memory Implications

Prefill and Decode Phases

Multi-Query and Grouped-Query Attention

Production Considerations

See Also

References

References