What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Model Serving

Model serving is the discipline of deploying trained machine learning models behind APIs or runtimes so that production applications can request predictions at scale with predictable latency, throughput, and reliability.

5 min readLast updated May 2026Infrastructure

Model serving is the engineering practice of making trained machine learning models available to production applications, usually through a network API, with the operational properties required of any production service: defined latency budgets, predictable throughput, graceful scaling under load, observability, fault tolerance, version control, and security. It is one of the core disciplines of MLOps and sits at the boundary between model development and production engineering.

Why a dedicated layer exists

Naive deployment patterns — loading a model directly inside an application process — work for prototypes but break under production conditions. Memory footprints of large models exceed those of typical application servers; GPU resources demand careful allocation; tail latency under traffic spikes requires batching and autoscaling; multiple model versions must coexist for A/B tests and canary rollouts; observability must capture both system metrics and prediction-level diagnostics; and security boundaries between models and application code must be enforced. A dedicated serving layer addresses these concerns once for all consuming applications.

Architectural patterns

Several common patterns are used in practice.

Online synchronous serving answers a single request with a single prediction, used for chat assistants, search ranking, fraud scoring, and recommendation. Latency budgets are tight — typically tens to hundreds of milliseconds.

Streaming serving returns partial output as it is generated, the dominant pattern for autoregressive language models. Time-to-first-token and inter-token latency are the user-facing metrics.

Batch serving scores large datasets offline on a schedule, used for nightly recommendations, marketing scores, and analytics enrichment. Throughput dominates and latency is largely irrelevant.

Embedded or edge serving runs the model inside the consuming application or device, removing network hops at the cost of update friction.

Components of a serving stack

A production serving stack typically contains several layers stacked from hardware upward.

A runtime — Triton Inference Server, vLLM, TorchServe, TensorFlow Serving, BentoML, Ray Serve, ONNX Runtime Server, or a model-specific server such as Text Generation Inference (TGI) — loads the model and executes forward passes, often with optimised kernels.

An orchestrator — typically Kubernetes via projects such as KServe, Seldon Core, or KubeRay — places serving pods on GPUs, scales replicas with traffic, and routes requests across model versions.

A gateway layer handles authentication, rate limiting, request shaping, tokenisation, and routing across providers, often implemented with envoy, an API gateway, or a custom LLM gateway such as LiteLLM or Portkey.

An observability layer captures latency histograms, GPU utilisation, batch sizes, prompt and completion content (with privacy controls), evaluation metrics, and prediction drift. Common tools include Prometheus, Grafana, OpenTelemetry, Datadog, Arize, WhyLabs, and Langfuse.

A model registry — MLflow, Vertex AI Model Registry, SageMaker Model Registry, Weights & Biases, or BentoML's bento store — versions trained artefacts and links them to deployment manifests.

Optimisation techniques

Serving teams routinely apply continuous batching, paged attention, key-value caching, quantisation (INT8, FP8, INT4), speculative decoding, tensor parallelism across multiple GPUs, and graph compilation through TensorRT, ONNX Runtime, vLLM, or compiler stacks such as XLA and TVM. The right combination depends on model architecture, accelerator, batch size, and latency budget.

Common deployment patterns

| Pattern | Trade-off | Typical use | |---|---|---| | Single replica per model | Simple, no isolation across versions | Internal services | | Blue/green deployment | Atomic cutover, double cost briefly | Critical paths | | Canary release | Gradual rollout, slow rollback | LLM upgrades | | Shadow traffic | Compare versions on real traffic | Pre-launch validation | | Multi-model server | High GPU utilisation, shared failure modes | Diverse internal models |

Managed services

Hyperscaler platforms have absorbed much of the operational complexity. Amazon Bedrock, Google Vertex AI, Azure AI Foundry, AWS SageMaker, OpenAI API, Anthropic API, Together AI, Fireworks AI, Replicate, and Hugging Face Inference Endpoints all offer hosted serving with autoscaling, region selection, and per-token billing. Self-hosted serving remains common for organisations with strict data-residency requirements, large fixed workloads, or regulated industries.

Malaysian Context — Serving infrastructure and regulatory expectations

Malaysian model-serving practice is shaped by data-residency requirements under the PDPA amendments effective 2025, sector-specific outsourcing rules from Bank Negara Malaysia (BNM) and the Securities Commission Malaysia (SC), and critical-information-infrastructure obligations under NACSA and Cybersecurity Malaysia.

For regulated workloads, banks such as Maybank, CIMB, Hong Leong Bank, RHB, Public Bank, and AmBank combine cloud regions hosted in Malaysia with on-premises GPU clusters for sensitive scoring. Petronas Digital serves predictive-maintenance and document-AI models from internal clusters. Telcos including TM, Maxis, and CelcomDigi operate serving infrastructure for customer chat, billing, and network operations.

Local infrastructure providers — YTL Communications (AI Data Centre Park in Kulai), TM (TM ONE Sovereign Cloud), TIME dotCom, AIMS Data Centre, and NTT Malaysia — offer GPU-backed serving environments certified for relevant Malaysian compliance regimes. VSTECS Berhad distributes NVIDIA hardware and software used in many of these stacks.

Malaysian engineering teams routinely use open-source runtimes (vLLM, Triton, BentoML, KServe) on top of Kubernetes platforms managed in Cyberjaya, Kuala Lumpur, Penang, and Johor data centres. HRD Corp funds MLOps and serving-engineering training delivered by AWS, NVIDIA, Microsoft, Google, and Malaysian universities including Universiti Malaya, UTM, USM, and MMU.

References

Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP.
NVIDIA. (2024). Triton Inference Server Documentation. NVIDIA Corporation.
KServe Project. (2024). KServe Architecture and Concepts. Cloud Native Computing Foundation.
Hugging Face. (2024). Text Generation Inference (TGI) Documentation. Hugging Face.
BNM. (2023). Risk Management in Technology (RMiT) Policy Document. Bank Negara Malaysia.

Tags:model-serving mlops inference deployment

Type	MLOps discipline
Goal	Production-grade inference behind APIs
Common runtimes	Triton, vLLM, TorchServe, BentoML, KServe
Key concerns	Latency, throughput, scaling, observability
Related	Inference, MLOps, ONNX, quantisation