AIWiki
Malaysia

Inference (Machine Learning)

Inference is the phase in which a trained machine learning model is used to generate predictions or outputs from new input data, distinct from the earlier training phase.

5 min readLast updated May 2026Infrastructure

Inference in machine learning refers to the execution of a trained model on previously unseen inputs to produce predictions, classifications, embeddings, or generated content. It is the operational counterpart to training: where training adjusts model parameters to fit data, inference holds those parameters fixed and uses them to answer queries. In production AI systems, inference accounts for the majority of compute cost over a model's lifetime, often by an order of magnitude relative to the one-time cost of training.

Inference workflow

A typical inference request passes through several stages. Input is received, validated, and tokenised or otherwise preprocessed. A serving runtime loads the model weights — often kept resident in GPU memory — and executes a forward pass through the network. The raw output is then post-processed: logits become probabilities, embeddings are normalised, generated tokens are detokenised into text, or bounding boxes are decoded into image annotations. Results are returned synchronously to the caller or written to a downstream queue.

For large language models, inference is autoregressive: each new token is sampled from a probability distribution conditioned on the prompt and all previously generated tokens, requiring a separate forward pass per token. Techniques such as key-value caching, speculative decoding, and continuous batching are used to amortise compute and improve throughput.

Key performance dimensions

Inference engineering centres on four measurable properties.

Latency is the time between request submission and response delivery. For chat assistants, the time-to-first-token dominates the perceived responsiveness, while time-per-output-token governs the rate at which the response streams to the user.

Throughput is the number of requests or tokens served per second per accelerator. Operators trade latency against throughput by batching: larger batches improve hardware utilisation but increase queueing delay.

Cost per inference combines accelerator hours, memory bandwidth, networking, and energy. For large models served at scale, even small reductions in cost per token translate into substantial savings.

Accuracy under optimisation is the degree to which compression techniques preserve task quality. Quantisation, pruning, and distillation reduce cost at the price of some quality loss, and the trade-off must be measured against task-specific benchmarks.

Optimisation techniques

A range of techniques accelerate inference without retraining the model.

Quantisation reduces the numerical precision of weights and activations from 32-bit floating point to 8-bit integer or lower. Modern formats such as INT4 and FP8 preserve most of the model's accuracy while halving or quartering memory bandwidth requirements.

Pruning removes weights or attention heads whose contribution is small. Structured pruning yields speed-ups on commodity hardware; unstructured pruning requires sparse-matrix support.

Knowledge distillation trains a smaller "student" model to mimic a larger "teacher". The student delivers similar quality on the target domain at a fraction of the inference cost.

Speculative decoding uses a small draft model to propose several tokens at a time, which the large model then verifies in a single forward pass — typically doubling throughput for autoregressive generation.

Compilation via frameworks such as TensorRT, vLLM, ONNX Runtime, OpenVINO, and TVM fuses operators, optimises memory layouts, and selects efficient kernels for the target hardware.

Deployment patterns

Inference is deployed in several patterns. Cloud inference runs on managed services such as Amazon Bedrock, Google Vertex AI, and Azure AI, where customers pay per token or per request. Self-hosted inference runs on dedicated GPU instances, allowing greater control over data residency and cost at scale. On-device inference runs models directly on phones, laptops, vehicles, or embedded systems using runtimes such as Core ML, TensorFlow Lite, and ONNX Runtime Mobile. Edge inference sits between the device and the cloud — typically in a regional point of presence — to reduce latency and bandwidth.

Hardware

Accelerators dominate large-model inference. NVIDIA's H100 and H200 GPUs, AMD's MI300X, Google's TPU v5e and v5p, AWS Inferentia2 and Trainium, and a growing field of dedicated AI silicon from Groq, Cerebras, and SambaNova target different points on the latency–throughput–cost curve. CPUs remain competitive for small models, batch workloads, and tabular data.

References

  1. Reddi, V. et al. (2020). MLPerf Inference Benchmark. ISCA.
  2. Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML.
  3. Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP.
  4. NVIDIA. (2024). TensorRT-LLM Developer Guide. NVIDIA Corporation.
  5. MDEC. (2024). Malaysia Digital AI Infrastructure Report. Malaysia Digital Economy Corporation.