What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Prompt Caching

Prompt caching is an inference optimisation technique that stores precomputed key-value representations of repeated prompt prefixes, reducing latency and token processing costs for applications with stable system prompts or long shared contexts.

6 min readLast updated June 2026Infrastructure

Prompt caching is a technique applied at the inference layer of large language model (LLM) APIs that stores the computed key-value (KV) tensor representations produced by the attention mechanism when processing a prompt prefix. When a subsequent request shares the same prefix, the provider retrieves the stored tensors rather than recomputing them, eliminating redundant forward-pass computation for the cached portion. The result is substantially lower latency and reduced per-token costs for applications whose prompts share a large stable component — such as a long system prompt, a retrieved document corpus, or a multi-turn conversation history.

Background: The KV Cache

To understand prompt caching, it is useful to understand the standard KV cache present in all autoregressive Transformer models. During generation, each transformer layer computes key and value matrices for every input token. Rather than recomputing these for all prior tokens at each generation step, the model stores them in a KV cache in GPU memory and reuses them. This per-sequence cache is why the generation of long sequences is faster than it would otherwise be.

Prompt caching extends this concept across API requests and across users. Instead of discarding the KV cache when a request completes, the provider retains the cache for a fixed time-to-live (TTL) period and makes it available for subsequent requests that share the same prefix. Because the same system prompt might be sent millions of times per day by different instances of the same application, the aggregate compute saving is substantial.

Implementation by Provider

Anthropic introduced prompt caching for Claude in mid-2024. Caching is activated by adding cache_control markers to message content blocks in the API request. Cached tokens are priced at approximately USD 0.30 per million tokens for cache reads, compared to USD 3.00 per million for regular input tokens on Claude 3.5 Sonnet — a tenfold cost reduction. The cache has a 5-minute TTL that refreshes each time the cached segment is accessed. Anthropic reports typical latency reductions of up to 85 percent for time-to-first-token on prompts with large cached prefixes.

OpenAI introduced automatic prompt caching for GPT-4o and GPT-4o-mini in late 2024. OpenAI's implementation requires no explicit developer configuration — caching is applied automatically to the longest matching prefix in the request that meets a minimum length threshold. Cached tokens are charged at 50 percent of the standard input token price. GPT-4.1 and GPT-5.1 series support cache retention up to 24 hours, making the feature particularly valuable for applications with consistent daily system prompts.

Google provides implicit caching and explicit context caching on Gemini 1.5 and later models through Vertex AI and AI Studio. Explicit context caching allows developers to upload a document or long context once and obtain a cache handle that can be referenced in subsequent requests, bypassing the need to resend the full content.

Amazon Bedrock announced general availability of prompt caching in April 2025, supporting select Anthropic Claude and Amazon Nova models. This makes prompt caching accessible to Malaysian and Southeast Asian enterprises using AWS infrastructure without direct contracts with model providers.

Cost and Latency Impact

The economics of prompt caching are most favourable for applications where the system prompt or shared context constitutes a large fraction of total input tokens. A typical enterprise chatbot with a 10,000-token system prompt and 500-token user messages achieves a cache hit rate above 90 percent in steady-state production, reducing effective input token costs by over 80 percent compared to the uncached case. Across an application serving millions of daily users, the monthly cost saving can determine product viability.

Latency benefits are equally significant. Time-to-first-token — the delay between submitting a request and receiving the first generated token — is directly proportional to the number of input tokens processed. With caching, only the uncached portion of the prompt (typically the current user message) requires fresh computation, dramatically reducing this delay even for very long-context applications.

Use Cases

Prompt caching is most valuable in several scenarios: applications with long, stable system prompts such as legal document analysis tools and customer support agents; multi-turn conversational agents that accumulate long context over a session; retrieval-augmented generation (RAG) pipelines where the same retrieved documents are referenced across multiple queries; and batch processing workflows where many queries share a common document or dataset prefix.

Malaysian Context — Cloud AI Costs and Local Deployment

For Malaysian AI developers and enterprises, prompt caching represents a meaningful cost optimisation that directly affects the business viability of LLM-based products. Malaysia's technology startup ecosystem — including companies in MDEC's Malaysia Digital programme and beneficiaries of the Cradle Fund and Malaysia Venture Capital Management (MAVCAP) — typically operates on tighter infrastructure budgets than counterparts in the United States or Europe, making per-token cost reductions disproportionately impactful.

AWS in Malaysia, operating through VSTECS and direct enterprise accounts, has made Amazon Bedrock's prompt caching feature available to Malaysian customers following the April 2025 general availability release. Malaysian financial institutions including Maybank, CIMB, and Public Bank that use Bedrock-hosted Claude models for internal document search, regulatory compliance automation, and customer-facing chatbots are among the earliest adopters.

Telekomunikasi Malaysia (TM) and Maxis, which provide cloud infrastructure services and AI platform offerings to Malaysian enterprise customers, have incorporated prompt caching into their AI-as-a-service product documentation and technical workshops. The MDEC Digital Hub in Cyberjaya hosts AI developer events where prompt caching and inference optimisation feature in cloud provider workshops by AWS, Azure, and Google.

HRD Corp-approved training programmes covering LLM application development increasingly include prompt caching as part of production-readiness instruction, recognising that cost management is a practical competency for Malaysian AI practitioners building commercially deployed systems. The alignment of prompt caching with sustainable AI deployment — reducing energy consumption proportional to compute savings — also resonates with Malaysia's National Energy Transition Roadmap commitments, as data centre energy usage is a growing policy consideration.

References

Anthropic. (2024). Prompt Caching Documentation. docs.anthropic.com.
OpenAI. (2024). Prompt Caching in the OpenAI API. platform.openai.com/docs.
AWS. (2025). Amazon Bedrock Announces General Availability of Prompt Caching. aws.amazon.com.
DigitalOcean. (2024). Prompt Caching with OpenAI and Anthropic Models. digitalocean.com/blog.

Tags:prompt caching LLM inference API optimisation KV cache cost reduction

Type	Inference optimisation
Mechanism	Reuse of precomputed KV cache segments
Cost reduction	50-90% on cached token processing
Latency reduction	Up to 85% time-to-first-token
Supported by	Anthropic, OpenAI, Google, Amazon Bedrock
Related	KV cache, inference, context window