What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

vLLM

vLLM is an open-source library for fast and memory-efficient large language model inference and serving, built around the PagedAttention algorithm for optimised GPU memory management.

6 min readLast updated June 2026Infrastructure

vLLM is an open-source library designed for high-throughput and memory-efficient serving of large language models. Originally developed by researchers at the UC Berkeley Sky Computing Lab and published in 2023, vLLM introduced PagedAttention, a novel memory management algorithm that treats the KV cache in transformer models analogously to virtual memory paging in operating systems. The library has since become a standard component in production LLM deployment pipelines and is used at scale by organisations including Meta, Mistral AI, and Cohere.

Background

Deploying large language models at scale involves a fundamental memory management challenge. During autoregressive text generation, transformer models maintain a key-value (KV) cache that stores intermediate attention computations for all tokens in the context window. In naive implementations, this cache is allocated as a contiguous block of GPU memory sized for the maximum possible sequence length. This approach leads to two forms of waste: internal fragmentation when actual sequences are shorter than the maximum, and external fragmentation when multiple requests of varying lengths compete for GPU memory. The Berkeley team estimated that naive implementations waste 60-80% of GPU memory allocated to KV caches.

PagedAttention

PagedAttention, the core algorithmic contribution of vLLM, manages KV cache memory in non-contiguous fixed-size blocks rather than pre-allocated contiguous regions. Each block holds key-value vectors for a fixed number of tokens, and a logical-to-physical block mapping allows the system to compose a sequence's cache from non-adjacent physical blocks. This design has several consequences.

Memory waste is nearly eliminated because physical blocks are only allocated as tokens are generated, not speculatively for the maximum length. Blocks can be shared between concurrent requests that share a common prefix, such as a system prompt applied to many user queries. When a sequence ends, its blocks are immediately returned to a pool for reuse by new requests. Together, these properties allow vLLM to serve significantly more concurrent requests on the same hardware than alternative implementations.

Benchmark comparisons published in the original vLLM paper showed 2-24x higher throughput than Hugging Face Transformers on identical hardware, depending on model size, batch composition, and sequence length distribution.

Architecture and Components

vLLM operates as an inference server with two primary layers. The engine layer handles scheduling, memory management, and model execution. The API layer exposes an OpenAI-compatible REST interface supporting both the chat completions and text completions endpoints, allowing client applications built for the OpenAI API to switch to vLLM without code changes.

The scheduler in vLLM uses a preemption mechanism to handle cases where GPU memory becomes exhausted during generation. When memory pressure is detected, the scheduler can swap KV cache blocks to CPU RAM or recompute them from scratch when the sequence resumes, enabling graceful degradation rather than request failure.

Key Features

vLLM supports continuous batching, which processes tokens from different requests in the same forward pass rather than waiting for requests to form a complete batch before beginning computation. This reduces latency for interactive use cases. The library also supports speculative decoding, prefix caching, chunked prefill, guided decoding (constrained generation to JSON schemas or regular expressions), and LoRA adapters for fine-tuned models.

| Feature | Description | |---|---| | PagedAttention | Block-based KV cache for near-zero memory waste | | Continuous batching | Processes mixed requests without fixed batch boundaries | | Prefix caching | Reuses cached KV blocks for shared prompt prefixes | | Speculative decoding | Uses a smaller draft model to accelerate output | | Multi-modal support | Handles image and text inputs for vision-language models | | Tensor parallelism | Distributes model layers across multiple GPUs |

Performance and Use Cases

vLLM is optimised for server-side deployments handling many concurrent users. The library is less suitable than Ollama or llama.cpp for single-user local inference because its Python-based architecture introduces overhead that matters at low concurrency. At scale, however, vLLM consistently outperforms alternatives. Stripe reported a 73% reduction in inference costs after migrating to vLLM for an internal service processing 50 million daily API calls.

The library integrates with model hubs such as Hugging Face and supports models in various formats including safetensors and GGUF. Popular models served via vLLM in production include Llama 3.1, Mistral, Qwen 2.5, and DeepSeek.

Versions and Ecosystem

The vLLM project transitioned to a V1 engine architecture in 2025 that further improved scheduling efficiency and added disaggregated prefill/decode stages, allowing the computationally intensive prefill phase to be separated from the incremental decode phase and allocated to different hardware. This separation is relevant for optimising the mix of latency and throughput in multi-tenant deployments.

Malaysian Context — Production AI Serving Infrastructure

Malaysian enterprises and cloud-hosted services deploying large language models in production have increasingly adopted vLLM as a standard serving layer. The library's OpenAI-compatible API means that organisations building on top of the OpenAI API can migrate to locally-hosted open-weight models served by vLLM with minimal code changes, an important consideration for organisations seeking to reduce dependency on overseas API providers.

AWS Malaysia, Microsoft Azure Malaysia, and Google Cloud Malaysia all offer GPU-equipped virtual machine instances (such as instances with NVIDIA A100 or H100 accelerators) suitable for running vLLM in production. Malaysian financial institutions and government agencies exploring on-premises model deployment for data sovereignty reasons can use vLLM on GPU servers co-located in Malaysian data centres to maintain local data residency.

Malaysian technology companies participating in the MSC Malaysia ecosystem and startups backed by initiatives such as MDEC's Tech Accelerator have used vLLM to build internal AI tools for document analysis, customer service automation, and code generation without routing sensitive data to external cloud APIs.

MDEC's AI Centre of Excellence and Hugging Face have together promoted awareness of open-source inference tools among Malaysian developers. Training programmes offered through HRD Corp-accredited providers increasingly include vLLM deployment as a module within MLOps curricula, reflecting industry demand for engineers with practical experience in production LLM serving.

The Teragrid AI platform, based in Penang, uses open-source serving infrastructure including vLLM-compatible endpoints to deliver AI capabilities to Malaysian enterprise clients, illustrating how production-grade open-source inference tooling is being adopted across the local AI ecosystem.

References

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., ... and Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP). ACM.
vLLM Team. (2025). Inside vLLM: Anatomy of a High-Throughput LLM Inference System. vLLM Blog. https://vllm.ai/blog/
vLLM Documentation. (2026). vLLM: Easy, Fast, and Cheap LLM Serving. https://docs.vllm.ai/
MDEC. (2025). Malaysia AI Ecosystem Report 2025. Malaysia Digital Economy Corporation.

Tags:inference model-serving gpu high-throughput

Type	LLM inference and serving library
Developed by	UC Berkeley Sky Computing Lab
Initial release	2023
License	Apache 2.0
Key innovation	PagedAttention
Related	Ollama, TGI, TensorRT-LLM

Background

PagedAttention

Architecture and Components

Key Features

Performance and Use Cases

Versions and Ecosystem

See Also

References