Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

157 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

BEST-Route: Adaptive LLM Routing

Updated 1 July 2025

BEST-Route is an adaptive large language model routing framework that dynamically selects both the model and the number of responses needed to meet a defined quality threshold.
It employs a proxy reward model and multi-head classifier to efficiently estimate response quality and guide decision-making for cost-effective LLM deployment.
Experimental results show up to 60% cost reduction with minimal quality drop, making BEST-Route ideal for enterprise-scale applications and API providers.

BEST-Route is an adaptive LLM routing framework that combines dynamic model selection and best-of-n sampling to achieve highly cost-effective, quality-preserving inference at deployment time. It is designed for scenarios where LLM deployments must balance response quality with substantial infrastructure costs, such as enterprise-scale chat assistants, content generation services, or API providers operating a portfolio of models ranging from compact open-source systems to state-of-the-art, expensive commercial APIs.

1. Motivation and Problem Formulation

Traditional LLM deployment uses either a fixed model for all queries or "router" frameworks that select a single model per request based on query characteristics. However, small/inexpensive models often cannot match the quality of the largest ones with a single sample. This leads prior routers to overuse the largest, most expensive models, missing available cost savings. Empirically, generating multiple responses from small models and selecting the best can bridge much of the quality gap with large models at a lower cumulative cost.

BEST-Route formalizes the query routing problem as:

Given: A pool of $K$ LLMs $\mathcal{M} = \{M_1, ..., M_K\}$ of increasing cost/quality, a reference model $M_{ref}$ (typically the most expensive/highest-quality), and a user-defined quality match threshold $t$ .
Goal: For each input query $q$ , select the lowest-cost $(M_k, n)$ (model and sample count) s.t. the "best-of-n" response from $M_k$ matches or approaches $M_{ref}$ 's response quality, subject to estimated or predicted metrics.

This joint model/sample-count routing generalizes and sharpens the LLM routing task by including both which model to use and how many generations to attempt.

2. Core Methodology: Joint Model & n-Response Routing

BEST-Route employs three main technical components:

a) Proxy Reward Model

Because running reference LLM evaluators or obtaining human labels is costly during inference, BEST-Route trains an efficient "proxy reward model" $R_{proxy}(s)$ using supervised (pairwise ranking) learning to approximate the gold reference reward (e.g., high-performing reward model like armoRM or human votes). Given a set of candidate responses $\mathcal{S} = \{s_1, ..., s_n\}$ from a small model, $R_{proxy}$ assigns a score to each and the best is selected:

$s^* = \arg\max_{s \in \mathcal{S}} R_{proxy}(s)$

This enables best-of-n selection in a computationally efficient manner.

b) Adaptive Multi-Head Router

A multi-head classifier is trained, taking as input an encoding (using a pre-trained transformer) of $q$ , and outputting per-model/per- $n$ probabilities $p_{k, n}(q)$ . This predicts, for each candidate combination, the likelihood that best-of- $n$ from $M_k$ will meet or exceed the reference model's response quality given $q$ :

$p_{k, n}(q) = \sigma\left(\mathbf{w}_{k, n}^\top \mathbf{h}_q + b_{k, n}\right)$

where $\sigma(\cdot)$ is the sigmoid.

Training targets are empirical: for each $(q, M_k, n)$ , assign match label 1 if best-of- $n$ from $M_k$ met the reference's quality, else 0. The router is supervised by cross-entropy loss.

c) Test-Time Optimal Selection

At inference, for query $q$ , BEST-Route evaluates all combinations $(M_k, n)$ :

Discards any for which $p_{k,n}(q) < t$ (the user-chosen match threshold).
For the remaining, computes estimated cost:

$\text{cost}(M_k, n) = n \times \text{avg\_output\_length}[M_k] \times \text{output\_token\_price}[M_k] + \text{input\_length} \times \text{input\_token\_price}[M_k]$

Selects the valid combination with lowest estimated cost; if no combination passes, uses the reference model.

See Algorithm 1 (in the paper) for pseudocode of the end-to-end process.

3. Experimental Validation

Datasets and Models

10,000 mixed queries from question-answering, coding, and safety data.
Models tested include small open weights (Phi-3-mini, Llama-3.1–8B, Mistral-7B), large open weights (Mixtral-8x22B, Codestral-22B), and commercial API models (GPT-3.5, GPT-4o).

Metrics

Quality: armoRM reward model score (correlates with human judgments).
Cost: Real token and call-based API costs as charged by model vendors.

Results

Cost reduction: Up to 60% reduction in total deployment cost at less than 1% quality drop compared to always using GPT-4o.
Trade-off controllability: Users select a match threshold $t$ to define their quality/cost balance. A higher $t$ leads to more GPT-4o usage for higher quality; lower $t$ favors greater savings.
Traffic assignment: BEST-Route shifts most queries to the cheapest possible $(M_k,n)$ meeting the user's target. For "easy" queries, a single response from a small model suffices. "Hard" queries route to best-of-n sampling or, if needed, GPT-4o.
Comparisons: BEST-Route outperforms N-class routers and model cascade approaches, which either overuse large models or exhibit larger quality drops at the same cost.

Cost vs. Quality Table:

Cost Reduction (%)	Quality Drop (BEST-Route, %)	Baseline N-Label Drop (%)
10	0.19	0.63
20	0.21	1.17
40	0.47	3.26
60	0.80	5.08

Specialization: When including a domain model like Codestral-22B for coding tasks, BEST-Route can route relevant queries to it, achieving negative quality drop (i.e., outperforming GPT-4o on those queries).

4. Technical and Operational Details

Latency overhead: The router and proxy reward model add minimal latency compared to LLM inference (e.g., $\sim$ 0.6 seconds for $n=20$ ).
Token cost estimation is highly accurate, with a mean error of less than $0.003 per query.
No need for LLM-based judges or humans at inference: The proxy reward model allows automated best-of-n selection and routing.
Deployment: The approach is model-agnostic and can integrate new or specialized models as they become available. It is suitable for both API aggregation and in-house multi-model hosting.

5. Implications, Limitations, and Future Work

Industrial impact: Enables significant cost savings at scale with flexible control over quality targets, useful for cloud AI service providers and private LLM clusters.
Extensibility: The router can incorporate specialized models, additional metrics (e.g., latency), and domain-specific reward models.
Limitations: The accuracy of routing hinges on the fidelity of the proxy reward model to the chosen reference; mismatches may misroute queries. Scaling the router to very large model pools (e.g., $\sim$100 models) may require further architectural improvements.
Potential research directions: Enhancements to proxy reward reliability, increased sample efficiency, real-time latency awareness, and broader task generalization.

6. Decision Processes and Key Equations

Multi-head router decision:

$p_{k, n}(q) = \sigma(\mathbf{w}_{k, n}^\top \mathbf{h}_q + b_{k, n}) $ Best-of-n selection: $ s^{*} = \arg\max_{s \in \mathcal{S}} R_{\text{proxy}}(s) $ Cost estimation: $ \text{cost}(M, n) = n \times \text{avg\_output\_length}[M] \times \text{output\_token\_price}[M] + \text{input\_length} \times \text{input\_token\_price}[M] $ Routing selection: Choose the cheapest (model,$ n $) such that$ p_{k,n}(q) \geq t $, else use$ M_{\text{ref}}$.

BEST-Route introduces a principled, empirically validated solution for LLM query routing that exploits best-of-n sampling with adaptive test-time selection of both model and sample count, resulting in orders-of-magnitude cost savings while preserving top-tier model quality for virtually all requests. Its innovations in multi-head routing, proxy reward models, and dataset-driven, cost-aware selection position it as a foundational approach for the next generation of cost-efficient, high-quality LLM deployments.

PDF Markdown Chat (Upgrade)