What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Inference (Machine Learning)

Inference is the phase in which a trained machine learning model is used to generate predictions or outputs from new input data, distinct from the earlier training phase.

5 min readLast updated May 2026Infrastructure

Inference in machine learning refers to the execution of a trained model on previously unseen inputs to produce predictions, classifications, embeddings, or generated content. It is the operational counterpart to training: where training adjusts model parameters to fit data, inference holds those parameters fixed and uses them to answer queries. In production AI systems, inference accounts for the majority of compute cost over a model's lifetime, often by an order of magnitude relative to the one-time cost of training.

Inference workflow

A typical inference request passes through several stages. Input is received, validated, and tokenised or otherwise preprocessed. A serving runtime loads the model weights — often kept resident in GPU memory — and executes a forward pass through the network. The raw output is then post-processed: logits become probabilities, embeddings are normalised, generated tokens are detokenised into text, or bounding boxes are decoded into image annotations. Results are returned synchronously to the caller or written to a downstream queue.

For large language models, inference is autoregressive: each new token is sampled from a probability distribution conditioned on the prompt and all previously generated tokens, requiring a separate forward pass per token. Techniques such as key-value caching, speculative decoding, and continuous batching are used to amortise compute and improve throughput.

Key performance dimensions

Inference engineering centres on four measurable properties.

Latency is the time between request submission and response delivery. For chat assistants, the time-to-first-token dominates the perceived responsiveness, while time-per-output-token governs the rate at which the response streams to the user.

Throughput is the number of requests or tokens served per second per accelerator. Operators trade latency against throughput by batching: larger batches improve hardware utilisation but increase queueing delay.

Cost per inference combines accelerator hours, memory bandwidth, networking, and energy. For large models served at scale, even small reductions in cost per token translate into substantial savings.

Accuracy under optimisation is the degree to which compression techniques preserve task quality. Quantisation, pruning, and distillation reduce cost at the price of some quality loss, and the trade-off must be measured against task-specific benchmarks.

Optimisation techniques

A range of techniques accelerate inference without retraining the model.

Quantisation reduces the numerical precision of weights and activations from 32-bit floating point to 8-bit integer or lower. Modern formats such as INT4 and FP8 preserve most of the model's accuracy while halving or quartering memory bandwidth requirements.

Pruning removes weights or attention heads whose contribution is small. Structured pruning yields speed-ups on commodity hardware; unstructured pruning requires sparse-matrix support.

Knowledge distillation trains a smaller "student" model to mimic a larger "teacher". The student delivers similar quality on the target domain at a fraction of the inference cost.

Speculative decoding uses a small draft model to propose several tokens at a time, which the large model then verifies in a single forward pass — typically doubling throughput for autoregressive generation.

Compilation via frameworks such as TensorRT, vLLM, ONNX Runtime, OpenVINO, and TVM fuses operators, optimises memory layouts, and selects efficient kernels for the target hardware.

Deployment patterns

Inference is deployed in several patterns. Cloud inference runs on managed services such as Amazon Bedrock, Google Vertex AI, and Azure AI, where customers pay per token or per request. Self-hosted inference runs on dedicated GPU instances, allowing greater control over data residency and cost at scale. On-device inference runs models directly on phones, laptops, vehicles, or embedded systems using runtimes such as Core ML, TensorFlow Lite, and ONNX Runtime Mobile. Edge inference sits between the device and the cloud — typically in a regional point of presence — to reduce latency and bandwidth.

Hardware

Accelerators dominate large-model inference. NVIDIA's H100 and H200 GPUs, AMD's MI300X, Google's TPU v5e and v5p, AWS Inferentia2 and Trainium, and a growing field of dedicated AI silicon from Groq, Cerebras, and SambaNova target different points on the latency–throughput–cost curve. CPUs remain competitive for small models, batch workloads, and tabular data.

Malaysian Context — Inference workloads in Malaysia

Malaysia has positioned itself as a regional inference hub. The wave of data-centre investments in Johor, Cyberjaya, and Selangor by hyperscalers including Microsoft, Google, Amazon Web Services, Oracle, and Bytedance is driven primarily by inference demand from Southeast Asian markets — both Singapore-bound spillover and rising domestic usage.

Sovereign-cloud and locally hosted inference are receiving particular attention. YTL Communications' AI Data Centre Park in Kulai, TM's upcoming AI cloud, and TIME dotCom's AI infrastructure offerings target customers subject to PDPA data-residency expectations and BNM outsourcing requirements. VSTECS Berhad distributes NVIDIA inference hardware and software to system integrators.

Adoption spans regulated sectors. Maybank, CIMB, Hong Leong Bank, and Public Bank deploy inference workloads for fraud detection, document understanding, and customer chat. Petronas uses inference for predictive maintenance across upstream assets. AirAsia, Grab Malaysia, and Shopee run recommendation and pricing models with strict latency budgets. HRD Corp funds training in inference engineering and MLOps for local engineers, and universities such as Universiti Malaya, UPM, and Multimedia University offer postgraduate modules on model serving and edge AI.

References

Reddi, V. et al. (2020). MLPerf Inference Benchmark. ISCA.
Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML.
Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP.
NVIDIA. (2024). TensorRT-LLM Developer Guide. NVIDIA Corporation.
MDEC. (2024). Malaysia Digital AI Infrastructure Report. Malaysia Digital Economy Corporation.

Tags:inference deployment serving latency

Type	Deployment phase of ML lifecycle
Opposite of	Training
Key metrics	Latency, throughput, cost per token
Hardware	GPU, TPU, CPU, NPU, custom ASIC
Related	Quantisation, model serving, ONNX