Context Window
The maximum number of tokens — including the prompt, prior conversation, retrieved documents, and the model's own output — that a large language model can process in a single forward pass.
The context window of a large language model is the maximum number of tokens that the model can process in a single forward pass. It is the size of the working memory available to the model and includes every token in the system prompt, the user's prompt, prior conversation turns, retrieved documents passed in for grounding, and the tokens the model is currently generating. When the total exceeds the context window, the model cannot attend to the overflow at all, and content is either truncated, summarised, or pushed out by the application layer.
Why context windows are bounded
Modern language models are built on the transformer architecture, in which every token in the input attends to every other token through the self-attention mechanism. The compute and memory cost of full self-attention scales quadratically with sequence length, meaning that doubling the context window roughly quadruples the cost of processing a request. This quadratic scaling has historically been the primary technical reason that context windows were kept short, and a great deal of recent research is devoted to circumventing it.
Memory pressure is also a significant constraint. During inference, the key-value (KV) cache that stores intermediate attention computations grows linearly with the number of tokens generated, often becoming the dominant consumer of GPU memory for long-context workloads.
Evolution of context length
In early 2023, most production language models operated with context windows of 2,000 to 8,000 tokens. By the end of 2025, 128,000-token and 200,000-token windows had become standard across frontier models, and several systems supported context windows of one million tokens or more.
| Model | Context window | |-------|---------------| | GPT-3 (2020) | 2,048 tokens | | GPT-4 (2023) | 8K / 32K tokens | | Claude 2 (2023) | 100K tokens | | Claude 3 / 4 (2024–2026) | 200K / 1M tokens | | Gemini 2.5 Pro (2025) | 2M tokens | | GPT-5 (2025) | 400K tokens | | Qwen2.5-1M (2025) | 1M tokens |
Techniques for extending context
Several architectural and training innovations enable long context windows. Rotary position embeddings (RoPE) and Attention with Linear Biases (ALiBi) allow models to generalise to sequence lengths beyond those seen during training. Sparse attention mechanisms — including sliding-window attention, dilated attention, and the Longformer pattern — reduce the quadratic cost by attending to a subset of positions. State-space models such as Mamba and hybrid architectures avoid quadratic attention entirely. FlashAttention and PagedAttention optimise memory layout and KV-cache management at the kernel level.
Training data must also be adapted: hierarchical synthetic data generation and instruction-tuned long-context corpora are used to ensure that models actually learn to use the additional space rather than ignoring it.
The lost-in-the-middle problem
A long advertised context window does not automatically mean strong recall across that window. Empirical studies have shown that many models exhibit a U-shaped recall curve, attending most reliably to tokens at the very beginning and the very end of the input while neglecting material in the middle — a phenomenon known as lost-in-the-middle. Newer models such as Claude Sonnet 4 and Gemini 2.5 Pro have largely closed this gap, but the effect remains a useful caution: a model with a one-million-token window may still produce better answers when given a tightly curated 20,000-token retrieval result than when handed the full corpus.
Practical implications
For application developers, context window choice affects cost, latency, and quality. Larger context allows more documents to be passed in directly, simplifying retrieval-augmented generation pipelines, but inference cost typically scales linearly or worse with input length. Most production systems combine moderate context windows with vector retrieval, semantic caching, and prompt compression to balance the trade-offs.
See Also
References
References
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.
- Liu, N., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
- Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism. arXiv:2307.08691.
- Anthropic. (2024). Long Context Engineering Notes. Anthropic Engineering Blog.
- Bank Negara Malaysia. (2023). Risk Management in Technology Policy Document. BNM.