vLLM
vLLM is an open-source library for fast and memory-efficient large language model inference and serving, built around the PagedAttention algorithm for optimised GPU memory management.
vLLM is an open-source library designed for high-throughput and memory-efficient serving of large language models. Originally developed by researchers at the UC Berkeley Sky Computing Lab and published in 2023, vLLM introduced PagedAttention, a novel memory management algorithm that treats the KV cache in transformer models analogously to virtual memory paging in operating systems. The library has since become a standard component in production LLM deployment pipelines and is used at scale by organisations including Meta, Mistral AI, and Cohere.
Background
Deploying large language models at scale involves a fundamental memory management challenge. During autoregressive text generation, transformer models maintain a key-value (KV) cache that stores intermediate attention computations for all tokens in the context window. In naive implementations, this cache is allocated as a contiguous block of GPU memory sized for the maximum possible sequence length. This approach leads to two forms of waste: internal fragmentation when actual sequences are shorter than the maximum, and external fragmentation when multiple requests of varying lengths compete for GPU memory. The Berkeley team estimated that naive implementations waste 60-80% of GPU memory allocated to KV caches.
PagedAttention
PagedAttention, the core algorithmic contribution of vLLM, manages KV cache memory in non-contiguous fixed-size blocks rather than pre-allocated contiguous regions. Each block holds key-value vectors for a fixed number of tokens, and a logical-to-physical block mapping allows the system to compose a sequence's cache from non-adjacent physical blocks. This design has several consequences.
Memory waste is nearly eliminated because physical blocks are only allocated as tokens are generated, not speculatively for the maximum length. Blocks can be shared between concurrent requests that share a common prefix, such as a system prompt applied to many user queries. When a sequence ends, its blocks are immediately returned to a pool for reuse by new requests. Together, these properties allow vLLM to serve significantly more concurrent requests on the same hardware than alternative implementations.
Benchmark comparisons published in the original vLLM paper showed 2-24x higher throughput than Hugging Face Transformers on identical hardware, depending on model size, batch composition, and sequence length distribution.
Architecture and Components
vLLM operates as an inference server with two primary layers. The engine layer handles scheduling, memory management, and model execution. The API layer exposes an OpenAI-compatible REST interface supporting both the chat completions and text completions endpoints, allowing client applications built for the OpenAI API to switch to vLLM without code changes.
The scheduler in vLLM uses a preemption mechanism to handle cases where GPU memory becomes exhausted during generation. When memory pressure is detected, the scheduler can swap KV cache blocks to CPU RAM or recompute them from scratch when the sequence resumes, enabling graceful degradation rather than request failure.
Key Features
vLLM supports continuous batching, which processes tokens from different requests in the same forward pass rather than waiting for requests to form a complete batch before beginning computation. This reduces latency for interactive use cases. The library also supports speculative decoding, prefix caching, chunked prefill, guided decoding (constrained generation to JSON schemas or regular expressions), and LoRA adapters for fine-tuned models.
| Feature | Description | |---|---| | PagedAttention | Block-based KV cache for near-zero memory waste | | Continuous batching | Processes mixed requests without fixed batch boundaries | | Prefix caching | Reuses cached KV blocks for shared prompt prefixes | | Speculative decoding | Uses a smaller draft model to accelerate output | | Multi-modal support | Handles image and text inputs for vision-language models | | Tensor parallelism | Distributes model layers across multiple GPUs |
Performance and Use Cases
vLLM is optimised for server-side deployments handling many concurrent users. The library is less suitable than Ollama or llama.cpp for single-user local inference because its Python-based architecture introduces overhead that matters at low concurrency. At scale, however, vLLM consistently outperforms alternatives. Stripe reported a 73% reduction in inference costs after migrating to vLLM for an internal service processing 50 million daily API calls.
The library integrates with model hubs such as Hugging Face and supports models in various formats including safetensors and GGUF. Popular models served via vLLM in production include Llama 3.1, Mistral, Qwen 2.5, and DeepSeek.
Versions and Ecosystem
The vLLM project transitioned to a V1 engine architecture in 2025 that further improved scheduling efficiency and added disaggregated prefill/decode stages, allowing the computationally intensive prefill phase to be separated from the incremental decode phase and allocated to different hardware. This separation is relevant for optimising the mix of latency and throughput in multi-tenant deployments.
See Also
References
- Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., ... and Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP). ACM.
- vLLM Team. (2025). Inside vLLM: Anatomy of a High-Throughput LLM Inference System. vLLM Blog. https://vllm.ai/blog/
- vLLM Documentation. (2026). vLLM: Easy, Fast, and Cheap LLM Serving. https://docs.vllm.ai/
- MDEC. (2025). Malaysia AI Ecosystem Report 2025. Malaysia Digital Economy Corporation.