AIWiki
Malaysia

LLM Routing

The practice of dynamically selecting which large language model should handle a given query in order to balance cost, latency, and output quality across a pool of models.

5 min readLast updated July 2026Infrastructure

LLM routing is the practice of dynamically deciding which of several large language models should answer a given request, with the aim of balancing cost, latency, and output quality across a pool of models. Rather than sending every query to a single powerful and expensive model, a routing system inspects each request and directs it to the cheapest model likely to produce an acceptable answer, escalating only harder requests to stronger models. As organisations deploy applications backed by many models from different providers, routing has become a standard component of production LLM infrastructure.

Motivation

Large models are costly to run and slower to respond, yet many real-world queries are simple enough for a smaller model to handle well. Sending a trivial classification or greeting to a frontier model wastes money and time. Routing exploits this variance in query difficulty. Reported results from routing research and industry deployments describe cost reductions of roughly thirty to seventy percent while maintaining quality, and specific benchmarks in which a matrix-factorisation router sent only a small fraction of queries to the strong model while preserving most of its quality. These savings compound at scale, making routing attractive for high-volume services.

How routing decisions are made

A useful way to organise the design space is by three questions: when the decision is made, what information it uses, and how it is computed.

The decision can be made before the request reaches any model, during inference, or after a first model has produced a draft. Pre-request routing is fastest but must predict difficulty without seeing an answer. Post-response routing, often called a cascade, first tries a cheap model and only escalates when the initial answer looks inadequate.

The information feeding the decision may include features of the query itself, metadata about each candidate model such as cost and known strengths, and historical performance on similar queries. Semantic routers convert the query into an embedding and use it to predict which model is most suitable, an approach exemplified by systems that route based on learned representations of intent.

The computation itself ranges from simple hand-written rules, through trained classifiers such as lightweight encoder models, to reinforcement learning and cascade strategies. RouteLLM, a widely cited approach, trains routers on human preference data to decide between a strong and a weak model.

| Strategy | Timing | Trade-off | | --- | --- | --- | | Rule-based routing | Before request | Simple, but brittle | | Classifier or embedding router | Before request | Learns difficulty, needs training data | | Cascade | After first response | High quality retention, extra latency on hard queries |

Relationship to other techniques

Routing at the application level, choosing among distinct models, is conceptually related to but distinct from the mixture-of-experts architecture, where a gating network routes tokens among expert sub-networks inside a single model. LLM gateways often bundle routing with other cross-cutting concerns such as caching, rate limiting, observability, and failover across providers. Open-source projects, including semantic routers integrated with serving engines, have made routing accessible beyond large technology companies. The main risks are misrouting, where a query is sent to a model too weak to handle it, and the added complexity of maintaining and evaluating the router itself.

References

  1. Ong, I., et al. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv.
  2. vLLM Semantic Router project. (2025). Open-Source LLM Router for Mixture-of-Models. vllm-semantic-router.com.
  3. Survey on Dynamic Routing for LLMs. (2026). Towards Generalized Routing: Model and Agent Orchestration. arXiv.