What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

LLM Routing

The practice of dynamically selecting which large language model should handle a given query in order to balance cost, latency, and output quality across a pool of models.

5 min readLast updated July 2026Infrastructure

LLM routing is the practice of dynamically deciding which of several large language models should answer a given request, with the aim of balancing cost, latency, and output quality across a pool of models. Rather than sending every query to a single powerful and expensive model, a routing system inspects each request and directs it to the cheapest model likely to produce an acceptable answer, escalating only harder requests to stronger models. As organisations deploy applications backed by many models from different providers, routing has become a standard component of production LLM infrastructure.

Motivation

Large models are costly to run and slower to respond, yet many real-world queries are simple enough for a smaller model to handle well. Sending a trivial classification or greeting to a frontier model wastes money and time. Routing exploits this variance in query difficulty. Reported results from routing research and industry deployments describe cost reductions of roughly thirty to seventy percent while maintaining quality, and specific benchmarks in which a matrix-factorisation router sent only a small fraction of queries to the strong model while preserving most of its quality. These savings compound at scale, making routing attractive for high-volume services.

How routing decisions are made

A useful way to organise the design space is by three questions: when the decision is made, what information it uses, and how it is computed.

The decision can be made before the request reaches any model, during inference, or after a first model has produced a draft. Pre-request routing is fastest but must predict difficulty without seeing an answer. Post-response routing, often called a cascade, first tries a cheap model and only escalates when the initial answer looks inadequate.

The information feeding the decision may include features of the query itself, metadata about each candidate model such as cost and known strengths, and historical performance on similar queries. Semantic routers convert the query into an embedding and use it to predict which model is most suitable, an approach exemplified by systems that route based on learned representations of intent.

The computation itself ranges from simple hand-written rules, through trained classifiers such as lightweight encoder models, to reinforcement learning and cascade strategies. RouteLLM, a widely cited approach, trains routers on human preference data to decide between a strong and a weak model.

| Strategy | Timing | Trade-off | | --- | --- | --- | | Rule-based routing | Before request | Simple, but brittle | | Classifier or embedding router | Before request | Learns difficulty, needs training data | | Cascade | After first response | High quality retention, extra latency on hard queries |

Relationship to other techniques

Routing at the application level, choosing among distinct models, is conceptually related to but distinct from the mixture-of-experts architecture, where a gating network routes tokens among expert sub-networks inside a single model. LLM gateways often bundle routing with other cross-cutting concerns such as caching, rate limiting, observability, and failover across providers. Open-source projects, including semantic routers integrated with serving engines, have made routing accessible beyond large technology companies. The main risks are misrouting, where a query is sent to a model too weak to handle it, and the added complexity of maintaining and evaluating the router itself.

Malaysian Context — Cost-Efficient AI Deployment

Cost efficiency is a central concern for Malaysian organisations adopting generative AI, particularly small and medium enterprises that dominate the economy and lack the budgets of large multinationals. LLM routing directly addresses this by letting a business serve most traffic with cheaper models while reserving premium models for genuinely difficult tasks, an approach that aligns with the affordability goals of the MyDigital agenda and MDEC's SME digitalisation grants.

Malaysian banks and telcos such as Maybank, CIMB, Maxis, and CelcomDigi operate high-volume customer-service and analytics workloads where routing can materially reduce inference spending. For services handling personal or financial data, routing systems must also respect the Personal Data Protection Act (PDPA) and Bank Negara Malaysia guidance, which may require that sensitive queries stay within approved, locally hosted, or on-premise models rather than being routed to external providers, adding a compliance dimension to routing policy.

The rise of Malaysian and regional large language models, including MaLLaM, ILMU, and the regional SEA-LION family, creates opportunities for routing systems that direct Malay-language or locally contextual queries to specialised local models while sending general English queries to global models. This mixed-provider routing lets organisations benefit from local language competence without abandoning frontier capabilities.

Malaysia's expanding data-centre capacity in Johor and Cyberjaya, including sovereign and commercial AI cloud offerings, supports the local hosting that compliance-sensitive routing requires, and NACSA cybersecurity guidance is relevant where routing spans multiple hosting environments.

References

Ong, I., et al. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv.
vLLM Semantic Router project. (2025). Open-Source LLM Router for Mixture-of-Models. vllm-semantic-router.com.
Survey on Dynamic Routing for LLMs. (2026). Towards Generalized Routing: Model and Agent Orchestration. arXiv.

Tags:MLOps inference cost optimization LLM serving

Type	Inference orchestration technique
Goal	Balance cost, latency, and quality
Decision input	Query features, model metadata, history
Common methods	Classifiers, cascades, embeddings
Related	Model serving, Mixture of experts, Inference