AIWiki
Malaysia

Model Serving

Model serving is the discipline of deploying trained machine learning models behind APIs or runtimes so that production applications can request predictions at scale with predictable latency, throughput, and reliability.

5 min readLast updated May 2026Infrastructure

Model serving is the engineering practice of making trained machine learning models available to production applications, usually through a network API, with the operational properties required of any production service: defined latency budgets, predictable throughput, graceful scaling under load, observability, fault tolerance, version control, and security. It is one of the core disciplines of MLOps and sits at the boundary between model development and production engineering.

Why a dedicated layer exists

Naive deployment patterns — loading a model directly inside an application process — work for prototypes but break under production conditions. Memory footprints of large models exceed those of typical application servers; GPU resources demand careful allocation; tail latency under traffic spikes requires batching and autoscaling; multiple model versions must coexist for A/B tests and canary rollouts; observability must capture both system metrics and prediction-level diagnostics; and security boundaries between models and application code must be enforced. A dedicated serving layer addresses these concerns once for all consuming applications.

Architectural patterns

Several common patterns are used in practice.

Online synchronous serving answers a single request with a single prediction, used for chat assistants, search ranking, fraud scoring, and recommendation. Latency budgets are tight — typically tens to hundreds of milliseconds.

Streaming serving returns partial output as it is generated, the dominant pattern for autoregressive language models. Time-to-first-token and inter-token latency are the user-facing metrics.

Batch serving scores large datasets offline on a schedule, used for nightly recommendations, marketing scores, and analytics enrichment. Throughput dominates and latency is largely irrelevant.

Embedded or edge serving runs the model inside the consuming application or device, removing network hops at the cost of update friction.

Components of a serving stack

A production serving stack typically contains several layers stacked from hardware upward.

A runtime — Triton Inference Server, vLLM, TorchServe, TensorFlow Serving, BentoML, Ray Serve, ONNX Runtime Server, or a model-specific server such as Text Generation Inference (TGI) — loads the model and executes forward passes, often with optimised kernels.

An orchestrator — typically Kubernetes via projects such as KServe, Seldon Core, or KubeRay — places serving pods on GPUs, scales replicas with traffic, and routes requests across model versions.

A gateway layer handles authentication, rate limiting, request shaping, tokenisation, and routing across providers, often implemented with envoy, an API gateway, or a custom LLM gateway such as LiteLLM or Portkey.

An observability layer captures latency histograms, GPU utilisation, batch sizes, prompt and completion content (with privacy controls), evaluation metrics, and prediction drift. Common tools include Prometheus, Grafana, OpenTelemetry, Datadog, Arize, WhyLabs, and Langfuse.

A model registry — MLflow, Vertex AI Model Registry, SageMaker Model Registry, Weights & Biases, or BentoML's bento store — versions trained artefacts and links them to deployment manifests.

Optimisation techniques

Serving teams routinely apply continuous batching, paged attention, key-value caching, quantisation (INT8, FP8, INT4), speculative decoding, tensor parallelism across multiple GPUs, and graph compilation through TensorRT, ONNX Runtime, vLLM, or compiler stacks such as XLA and TVM. The right combination depends on model architecture, accelerator, batch size, and latency budget.

Common deployment patterns

| Pattern | Trade-off | Typical use | |---|---|---| | Single replica per model | Simple, no isolation across versions | Internal services | | Blue/green deployment | Atomic cutover, double cost briefly | Critical paths | | Canary release | Gradual rollout, slow rollback | LLM upgrades | | Shadow traffic | Compare versions on real traffic | Pre-launch validation | | Multi-model server | High GPU utilisation, shared failure modes | Diverse internal models |

Managed services

Hyperscaler platforms have absorbed much of the operational complexity. Amazon Bedrock, Google Vertex AI, Azure AI Foundry, AWS SageMaker, OpenAI API, Anthropic API, Together AI, Fireworks AI, Replicate, and Hugging Face Inference Endpoints all offer hosted serving with autoscaling, region selection, and per-token billing. Self-hosted serving remains common for organisations with strict data-residency requirements, large fixed workloads, or regulated industries.

References

  1. Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP.
  2. NVIDIA. (2024). Triton Inference Server Documentation. NVIDIA Corporation.
  3. KServe Project. (2024). KServe Architecture and Concepts. Cloud Native Computing Foundation.
  4. Hugging Face. (2024). Text Generation Inference (TGI) Documentation. Hugging Face.
  5. BNM. (2023). Risk Management in Technology (RMiT) Policy Document. Bank Negara Malaysia.