Prompt Caching
Prompt caching is an inference optimisation technique that stores precomputed key-value representations of repeated prompt prefixes, reducing latency and token processing costs for applications with stable system prompts or long shared contexts.
Prompt caching is a technique applied at the inference layer of large language model (LLM) APIs that stores the computed key-value (KV) tensor representations produced by the attention mechanism when processing a prompt prefix. When a subsequent request shares the same prefix, the provider retrieves the stored tensors rather than recomputing them, eliminating redundant forward-pass computation for the cached portion. The result is substantially lower latency and reduced per-token costs for applications whose prompts share a large stable component — such as a long system prompt, a retrieved document corpus, or a multi-turn conversation history.
Background: The KV Cache
To understand prompt caching, it is useful to understand the standard KV cache present in all autoregressive Transformer models. During generation, each transformer layer computes key and value matrices for every input token. Rather than recomputing these for all prior tokens at each generation step, the model stores them in a KV cache in GPU memory and reuses them. This per-sequence cache is why the generation of long sequences is faster than it would otherwise be.
Prompt caching extends this concept across API requests and across users. Instead of discarding the KV cache when a request completes, the provider retains the cache for a fixed time-to-live (TTL) period and makes it available for subsequent requests that share the same prefix. Because the same system prompt might be sent millions of times per day by different instances of the same application, the aggregate compute saving is substantial.
Implementation by Provider
Anthropic introduced prompt caching for Claude in mid-2024. Caching is activated by adding cache_control markers to message content blocks in the API request. Cached tokens are priced at approximately USD 0.30 per million tokens for cache reads, compared to USD 3.00 per million for regular input tokens on Claude 3.5 Sonnet — a tenfold cost reduction. The cache has a 5-minute TTL that refreshes each time the cached segment is accessed. Anthropic reports typical latency reductions of up to 85 percent for time-to-first-token on prompts with large cached prefixes.
OpenAI introduced automatic prompt caching for GPT-4o and GPT-4o-mini in late 2024. OpenAI's implementation requires no explicit developer configuration — caching is applied automatically to the longest matching prefix in the request that meets a minimum length threshold. Cached tokens are charged at 50 percent of the standard input token price. GPT-4.1 and GPT-5.1 series support cache retention up to 24 hours, making the feature particularly valuable for applications with consistent daily system prompts.
Google provides implicit caching and explicit context caching on Gemini 1.5 and later models through Vertex AI and AI Studio. Explicit context caching allows developers to upload a document or long context once and obtain a cache handle that can be referenced in subsequent requests, bypassing the need to resend the full content.
Amazon Bedrock announced general availability of prompt caching in April 2025, supporting select Anthropic Claude and Amazon Nova models. This makes prompt caching accessible to Malaysian and Southeast Asian enterprises using AWS infrastructure without direct contracts with model providers.
Cost and Latency Impact
The economics of prompt caching are most favourable for applications where the system prompt or shared context constitutes a large fraction of total input tokens. A typical enterprise chatbot with a 10,000-token system prompt and 500-token user messages achieves a cache hit rate above 90 percent in steady-state production, reducing effective input token costs by over 80 percent compared to the uncached case. Across an application serving millions of daily users, the monthly cost saving can determine product viability.
Latency benefits are equally significant. Time-to-first-token — the delay between submitting a request and receiving the first generated token — is directly proportional to the number of input tokens processed. With caching, only the uncached portion of the prompt (typically the current user message) requires fresh computation, dramatically reducing this delay even for very long-context applications.
Use Cases
Prompt caching is most valuable in several scenarios: applications with long, stable system prompts such as legal document analysis tools and customer support agents; multi-turn conversational agents that accumulate long context over a session; retrieval-augmented generation (RAG) pipelines where the same retrieved documents are referenced across multiple queries; and batch processing workflows where many queries share a common document or dataset prefix.
See Also
References
- Anthropic. (2024). Prompt Caching Documentation. docs.anthropic.com.
- OpenAI. (2024). Prompt Caching in the OpenAI API. platform.openai.com/docs.
- AWS. (2025). Amazon Bedrock Announces General Availability of Prompt Caching. aws.amazon.com.
- DigitalOcean. (2024). Prompt Caching with OpenAI and Anthropic Models. digitalocean.com/blog.