What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Context Window

The maximum number of tokens — including the prompt, prior conversation, retrieved documents, and the model's own output — that a large language model can process in a single forward pass.

5 min readLast updated May 2026Foundations

The context window of a large language model is the maximum number of tokens that the model can process in a single forward pass. It is the size of the working memory available to the model and includes every token in the system prompt, the user's prompt, prior conversation turns, retrieved documents passed in for grounding, and the tokens the model is currently generating. When the total exceeds the context window, the model cannot attend to the overflow at all, and content is either truncated, summarised, or pushed out by the application layer.

Why context windows are bounded

Modern language models are built on the transformer architecture, in which every token in the input attends to every other token through the self-attention mechanism. The compute and memory cost of full self-attention scales quadratically with sequence length, meaning that doubling the context window roughly quadruples the cost of processing a request. This quadratic scaling has historically been the primary technical reason that context windows were kept short, and a great deal of recent research is devoted to circumventing it.

Memory pressure is also a significant constraint. During inference, the key-value (KV) cache that stores intermediate attention computations grows linearly with the number of tokens generated, often becoming the dominant consumer of GPU memory for long-context workloads.

Evolution of context length

In early 2023, most production language models operated with context windows of 2,000 to 8,000 tokens. By the end of 2025, 128,000-token and 200,000-token windows had become standard across frontier models, and several systems supported context windows of one million tokens or more.

| Model | Context window | |-------|---------------| | GPT-3 (2020) | 2,048 tokens | | GPT-4 (2023) | 8K / 32K tokens | | Claude 2 (2023) | 100K tokens | | Claude 3 / 4 (2024–2026) | 200K / 1M tokens | | Gemini 2.5 Pro (2025) | 2M tokens | | GPT-5 (2025) | 400K tokens | | Qwen2.5-1M (2025) | 1M tokens |

Techniques for extending context

Several architectural and training innovations enable long context windows. Rotary position embeddings (RoPE) and Attention with Linear Biases (ALiBi) allow models to generalise to sequence lengths beyond those seen during training. Sparse attention mechanisms — including sliding-window attention, dilated attention, and the Longformer pattern — reduce the quadratic cost by attending to a subset of positions. State-space models such as Mamba and hybrid architectures avoid quadratic attention entirely. FlashAttention and PagedAttention optimise memory layout and KV-cache management at the kernel level.

Training data must also be adapted: hierarchical synthetic data generation and instruction-tuned long-context corpora are used to ensure that models actually learn to use the additional space rather than ignoring it.

The lost-in-the-middle problem

A long advertised context window does not automatically mean strong recall across that window. Empirical studies have shown that many models exhibit a U-shaped recall curve, attending most reliably to tokens at the very beginning and the very end of the input while neglecting material in the middle — a phenomenon known as lost-in-the-middle. Newer models such as Claude Sonnet 4 and Gemini 2.5 Pro have largely closed this gap, but the effect remains a useful caution: a model with a one-million-token window may still produce better answers when given a tightly curated 20,000-token retrieval result than when handed the full corpus.

Practical implications

For application developers, context window choice affects cost, latency, and quality. Larger context allows more documents to be passed in directly, simplifying retrieval-augmented generation pipelines, but inference cost typically scales linearly or worse with input length. Most production systems combine moderate context windows with vector retrieval, semantic caching, and prompt compression to balance the trade-offs.

Malaysian Context — Document-Heavy Workloads and PDPA Considerations

Long-context language models are increasingly relevant in Malaysia for document-heavy workloads in financial services, legal practice, and the public sector. Maybank, CIMB, RHB, and Public Bank handle large volumes of policy documents, regulatory filings, and customer correspondence where the ability to process an entire document in a single inference simplifies compliance workflows. Bank Negara Malaysia's Risk Management in Technology (RMiT) policy and the broader regulatory regime favour deployments where evidence trails can be reconstructed from a single model call rather than a chain of retrieval steps.

Malaysian law firms and audit firms — including the Big Four offices in Kuala Lumpur — have piloted long-context models for due-diligence document review and contract analysis. The Personal Data Protection Act 2010 (PDPA) imposes obligations on the processing of personal data, and Malaysian organisations typically prefer cloud regions in Singapore or local deployments via AWS, Azure, or Google Cloud Malaysia for personal-data-bearing prompts.

The MyDigital Blueprint and MDEC's National AI Roadmap identify language-model-driven document processing as a target adoption area for public-sector modernisation. Pilot projects within Jabatan Perdana Menteri and several Ministry of Finance agencies have evaluated long-context models for Hansard summarisation, Bahasa Malaysia legal corpus analysis, and budget document review.

Cost remains a constraint for Malaysian users. Frontier long-context APIs are priced in US dollars and metered per token, which makes prompt compression, retrieval-augmented generation, and context caching commercially important for sustainable production deployments.

References

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
Liu, N., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism. arXiv:2307.08691.
Anthropic. (2024). Long Context Engineering Notes. Anthropic Engineering Blog.
Bank Negara Malaysia. (2023). Risk Management in Technology Policy Document. BNM.

Tags:context-window tokens llm attention inference

Type	Model architecture property
Measured in	Tokens
Typical range (2026)	8K – 2M tokens
Key constraint	Quadratic attention cost
Related concepts	Tokenisation, attention, RAG

Why context windows are bounded

Evolution of context length

Techniques for extending context

The lost-in-the-middle problem

Practical implications

See Also

References

References