Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive LLM Routing under Budget Constraints

Published 28 Aug 2025 in cs.LG | (2508.21141v1)

Abstract: LLMs have revolutionized natural language processing, but their varying capabilities and costs pose challenges in practical applications. LLM routing addresses this by dynamically selecting the most suitable LLM for each query/task. Previous approaches treat this as a supervised learning problem, assuming complete knowledge of optimal query-LLM pairings. However, real-world scenarios lack such comprehensive mappings and face evolving user queries. We thus propose to study LLM routing as a contextual bandit problem, enabling adaptive decision-making using bandit feedback without requiring exhaustive inference across all LLMs for all queries (in contrast to supervised routing). To address this problem, we develop a shared embedding space for queries and LLMs, where query and LLM embeddings are aligned to reflect their affinity. This space is initially learned from offline human preference data and refined through online bandit feedback. We instantiate this idea through Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), a novel extension of LinUCB. To handle diverse user budgets for model routing, we introduce an online cost policy modeled as a multi-choice knapsack problem, ensuring resource-efficient routing.

Summary

  • The paper proposes PILOT, which integrates human preference priors with contextual bandit feedback to enable efficient LLM routing under budget constraints.
  • PILOT achieves 93% of GPT-4's performance at only 25% of its cost in multi-LLM routing, demonstrating significant improvements over traditional approaches.
  • The study validates PILOT via extensive experiments on diverse tasks, highlighting its robustness, rapid adaptation, and low computational overhead.

Adaptive LLM Routing under Budget Constraints: A Technical Analysis

Problem Formulation and Motivation

The paper addresses the challenge of deploying multiple LLMs in real-world systems where both performance and cost constraints are critical. The central problem is dynamic LLM routing: selecting the most suitable LLM for each incoming query, given a pool of models with varying capabilities and costs. Unlike prior work that treats routing as a supervised learning problem requiring exhaustive query-LLM pairings, this work reformulates routing as a contextual bandit problem, leveraging only bandit feedback (i.e., user evaluation of the selected model's response) and enforcing budget constraints via an online cost policy.

Methodology: Preference-Prior Informed LinUCB (PILOT)

The proposed solution, PILOT, extends the LinUCB algorithm by incorporating human preference priors and an online cost policy. The approach consists of three main components:

  1. Shared Embedding Space Pretraining: Queries and LLMs are embedded into a shared space, pretrained using human preference data. Query embeddings are projected via a learned linear transformation, and LLM embeddings are optimized to align with preferred responses.
  2. Online Bandit Feedback Adaptation: The router adapts LLM embeddings online using contextual bandit feedback. The expected reward for a query-LLM pair is modeled as the cosine similarity between their normalized embeddings. PILOT initializes the bandit algorithm with preference-based priors, theoretically achieving lower regret bounds than standard LinUCB/OFUL when the prior is close to the true reward vector.
  3. Budget-Constrained Routing via Online Cost Policy: The cost policy is formulated as an online multi-choice knapsack problem, using the ZCL algorithm to allocate budget across queries. The policy dynamically selects eligible LLMs based on estimated reward-to-cost ratios and current budget utilization, with binning to avoid underutilization in finite-horizon settings. Figure 1

    Figure 1: Overview of the two-phase pretraining process for query and LLM embeddings using human preference data.

    Figure 2

    Figure 2: Bandit router framework: the router receives user queries, cost constraints, and a model pool, adapting LLM selection based on user feedback.

Experimental Setup

Experiments are conducted on the Routerbench dataset, which includes 36,497 samples across 64 tasks and 11 LLMs (both open-source and proprietary). The evaluation simulates online learning with a split into tuning, learning, and deployment buckets. Baselines include all-to-one routers, HybridLLM (supervised), and several contextual bandit algorithms (LinUCB, Epoch-Greedy, Explore Only, Random Policy). The cost policy is uniformly applied across all baselines for fair comparison.

Results

Performance vs. Cost

PILOT achieves 93% of GPT-4's performance at only 25% of its cost in multi-LLM routing, and 86% of GPT-4's performance at 27% of its cost in single-task routing. Across all cost budgets and learning bucket sizes, PILOT consistently outperforms bandit and supervised baselines in both deployment set performance and cumulative regret. Figure 3

Figure 3: Performance vs cost, learning bucket size, and cumulative regret for PILOT and baselines on single-task and multi-task settings.

Qualitative Routing Analysis

PILOT demonstrates task-aware routing: for complex reasoning tasks (MMLU, ARC Challenge), it routes ~90% of queries to GPT-4; for coding (MBPP), Claude models handle 28% of queries; for math (GSM8K), Claude-v1 is selected for 94% of queries, reflecting cost-effective exploitation of model strengths.

Cost Policy Evaluation

The online cost policy outperforms simple per-query budget allocation and even an offline policy with perfect hindsight, as shown by higher mean reciprocal rank and deployment set performance. Figure 4

Figure 4: Comparison of cost policies: mean reciprocal rank and performance across budgets.

Computational Overhead

PILOT's routing time is negligible compared to LLM inference: 0.065–0.239s for routing vs. 2.5s for GPT-4 inference, ensuring minimal latency impact.

Embedding Model Sensitivity

PILOT maintains superior performance over baselines when using alternative embedding models (Instructor-XL), indicating robustness to the choice of embedder. Figure 5

Figure 5: Sensitivity analysis of PILOT's performance with different embedding models.

Binary LLM Routing and Adaptability

PILOT matches or surpasses HybridLLM in binary routing (GPT-4 vs. Mistral-7b/Mixtral-8x7b), despite requiring only bandit feedback. It adapts rapidly to shifts in query distribution, increasing exploration during drift and stabilizing post-drift. Figure 6

Figure 6

Figure 6: Performance vs cost for PILOT and HybridLLM in binary routing scenarios.

Figure 7

Figure 7: Binary LLM routing evaluation: performance, learning bucket size, and cumulative regret for different LLM pairs.

Theoretical Implications

The preference-prior informed initialization in PILOT is shown to yield lower cumulative regret than standard bandit algorithms when the prior is close to the true reward vector. This provides a formal justification for leveraging human preference data in contextual bandit settings for LLM routing.

Practical Implications and Future Directions

PILOT enables adaptive, cost-efficient LLM deployment in dynamic environments, requiring only partial supervision and minimal annotation. The decoupling of bandit learning and cost policy allows for robust, user-controllable budget management. Limitations include the lack of budget constraints during online learning and focus on single-turn queries; future work should address budget-aware online learning and multi-turn conversational routing.

Conclusion

The paper presents a principled, empirically validated approach to adaptive LLM routing under budget constraints, combining preference-informed contextual bandit learning with an efficient online cost policy. PILOT achieves near state-of-the-art performance at a fraction of the cost, adapts to evolving query distributions, and is robust to embedding model choices, making it suitable for practical LLM deployment in cost-sensitive, dynamic settings.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper is about choosing the best AI model to answer each question while keeping costs under control. Different LLMs have different strengths and prices. The authors build a smart “router” that looks at a user’s question, decides which model is the best fit, learns from simple thumbs up/down feedback, and makes sure the total cost stays within a budget.

What questions did the researchers ask?

  • How can we automatically pick the right LLM for each query without testing every model each time (which is too expensive)?
  • How can the system learn and improve over time from simple feedback like “good answer” or “bad answer”?
  • How can we keep performance high while staying under a money or token budget?

How did they study it?

The authors treat model routing like three everyday ideas: matching, learning from feedback, and budgeting.

1) Matching questions to models (shared space)

Imagine each question and each model has a “fingerprint” (called an embedding) in the same map. If a question’s point is close to a model’s point, that model is likely a good match. To build this map:

  • Pretraining with human preferences: They used public data where people chose which model gave the better answer for the same question (like head-to-head matchups). This helps the router learn which kinds of questions go well with which models before going live.

In simple terms: they first learn a “sense of taste” from people’s past choices.

2) Learning from feedback as it goes (contextual bandit)

When the router is live, it only sees how the chosen model did (thumbs up or down), not how all models would have done. This setup is called a “contextual bandit” problem:

  • Context = the question’s embedding (the question’s fingerprint).
  • Arms = the different LLMs to choose from.
  • Reward = did the chosen model do well (a score between 0 and 1)?

The router uses a strategy similar to “try your best guess, but sometimes explore new options.” Their version is called PILOT (Preference-prior Informed LinUCB). “LinUCB” is a method that picks the model with the best mix of:

  • Expected quality (“exploitation”—choose what looks best),
  • And uncertainty (“exploration”—occasionally try something else in case it’s even better).

Because the router already learned from human preferences, it starts with a good “prior,” so it improves faster.

In simple terms: it’s like picking the player most likely to score, but still giving other players a few chances, especially when you’re not sure yet.

3) Sticking to a budget (online knapsack)

They also add a budget manager. Think of your total budget like a backpack with limited space. Each model’s answer “weighs” a certain amount (cost). You want to pack as much value (good answers) as possible without overfilling the backpack. The system:

  • Estimates the cost for each model per query,
  • Picks a model that both looks promising and fits the remaining budget,
  • Spreads spending wisely across many queries (so you don’t spend too much too early).

They use a known approach for this kind of problem (an “online knapsack” policy) and apply it in small chunks (“bins”) to make sure the budget is used well over time.

What did they find?

  • Strong performance for less money:
    • On a large, mixed set of tasks (RouterBench), their router reached about 93% of GPT-4’s performance at only 25% of GPT-4’s cost.
    • On a single-task set (MMLU), it reached about 86% of GPT-4’s performance at 27% of the cost.
  • Learns efficiently: Compared to other learning methods (like standard LinUCB, Epoch-Greedy, random choices), PILOT learned faster and made better routing decisions over time (lower “regret,” which means fewer missed opportunities).
  • Smart choices by task:
    • It sent tough reasoning questions to stronger (but pricier) models like GPT-4.
    • For coding or math where cheaper models did well, it used those to save money.
  • Low overhead: Deciding which model to pick is quick—dozens of times faster than the time it usually takes a big model like GPT-4 to produce an answer.
  • Budget policy works: Their budgeting method beat simple baselines (like splitting the same budget per question) and performed similarly to or better than an “offline” method that gets to tune itself with hindsight.

Why does this matter?

This work shows a practical way to run AI systems that are:

  • Adaptive: They improve from simple thumbs up/down feedback without needing expensive labels for every model.
  • Cost-aware: They deliver high-quality answers while using money wisely.
  • Ready for the real world: They can handle changing questions over time and different user budgets.

A note on limitations and future ideas

  • During learning, the main router doesn’t directly optimize under the budget; budgeting is applied during deployment. A future step is to learn and budget at the same time.
  • The study focuses on single-turn questions. Extending this to multi-turn conversations would make it even more practical for chatbots and assistants.

Knowledge Gaps

Below is a concise, actionable list of the knowledge gaps, limitations, and open questions the paper leaves unresolved.

  • Learning without budget constraints: the bandit is trained ignoring budget and only constrained at deployment; develop budget-aware learning (e.g., bandits with knapsacks, constrained LinUCB/Thompson) and quantify end-to-end regret under budget.
  • Reward signal realism: the method assumes immediate, binary, and unbiased feedback; study robustness to sparse, delayed, noisy, and biased user ratings and design debiasing or off-policy correction.
  • Reward model misspecification: expected reward is modeled as a linear cosine similarity; benchmark against generalized linear, neural, or bilinear bandits and test calibration between predicted cosine and true success probability.
  • Frozen query projection: only arm embeddings are updated online; evaluate joint online adaptation of both query projection and arm embeddings with stability controls (e.g., trust-region, elastic weight consolidation).
  • Preference pretraining transferability: quantify domain shift between Chatbot Arena preferences and RouterBench tasks; test cross-domain pretraining, task-balanced pretraining, and ablations on negative sampling strategies.
  • Cold-start for new LLMs: provide a principled way to initialize embeddings for unseen or updated models (meta-learning, side-information features, prompt probes) and measure cold-start regret.
  • Assumptions in cost policy (UB/LB): specify how upper/lower bounds on reward-to-cost ratios are estimated; analyze sensitivity and provide guarantees under mis-specification.
  • Output token cost prediction: current approach uses per-LLM mean output length; build per-query output-length predictors with uncertainty and assess downstream impact on budget adherence and performance.
  • Uncertainty-aware cost policy: the budget allocator uses point reward estimates; integrate UCB/Thompson confidence into value-to-cost decisions to mitigate overconfident misallocations.
  • Missing theory for PILOT: provide formal regret bounds for the preference-prior LinUCB variant under cosine reward, and extend theory to the budgeted setting and to the binning heuristic.
  • Decoupled routing and budgeting: study joint optimization (e.g., Bandits with Knapsacks) and compare to the proposed decoupled policy in terms of regret, budget utilization, and stability.
  • Single-objective constraint: only token cost is considered; extend to multi-objective or constrained routing (latency/SLA, energy, carbon, privacy), and evaluate trade-offs.
  • Personalization gap: the method mentions personalization but uses only query embeddings; incorporate user/context features and evaluate per-user improvements and privacy implications.
  • Safe exploration: exploration may degrade user experience; design safe or conservatively constrained exploration and quantify cost-quality trade-offs.
  • Multi-turn routing: only single-turn inputs are handled; develop stateful/contextual routing that incorporates conversation history and measures cumulative conversation quality.
  • Delayed/missing feedback: account for asynchronous feedback and partial observability; evaluate bandit updates under delays and propose correction mechanisms.
  • Evaluation bias: RouterBench relies on GPT-4 judging and task-specific metrics; validate with human raters, alternative judges, and scorer-agnostic metrics to test robustness to evaluation bias.
  • Embedder dependence and budget: only two embedders are tested and embedder API costs are excluded from budgets; evaluate more embedders (including on-prem) and include embedding latency and cost in the budget accounting.
  • Scalability to large model pools: per-arm independent regressions may be sample-inefficient; explore shared-parameter or factorized models across arms and benchmark runtime/memory scaling.
  • Non-stationarity handling: model capabilities and query distributions evolve; add forgetting/sliding windows, change-point detection, and analyze performance under distribution drift.
  • OOD and adversarial queries: detect and handle out-of-distribution or adversarial prompts; integrate abstention/escalation and measure robustness.
  • Single-call routing vs cascades: compare to cascaded strategies (e.g., FrugalGPT escalation) learned via bandits; design hybrid policies that combine routing and escalation.
  • Fairness and service guarantees: the binning/spillover policy may starve later queries; add fairness constraints or service-level guarantees and evaluate worst-case per-query outcomes.
  • Privacy/security: online learning stores user-derived embeddings and may rely on external APIs; analyze privacy risks and test on-device embeddings and differential privacy methods.
  • Hyperparameter auto-tuning: exploration parameter α and prior strength λa are tuned offline; design online tuning/adaptation strategies and test sensitivity in non-stationary settings.
  • Negative sampling bias: hard negatives depend on model size, potentially biasing against larger models; evaluate alternative negative mining strategies and their effect on routing.
  • Task/tool generality: the approach is text-only; extend to tool-use, function-calling, or multimodal tasks, and measure generalization.
  • Reward normalization across tasks: unify and calibrate task-specific scores into [0,1] to avoid skew; study multi-task reward calibration impacts on routing.
  • Baseline coverage: add strong bandit baselines (e.g., Thompson Sampling, GLM/Neural Linear bandits, BwK algorithms) under the same cost policy for a more comprehensive comparison.
  • Real-world budget dynamics: handle indefinite horizons, time-varying or per-user budgets, and multi-tenant settings; analyze stability and fairness in shared-resource deployments.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 21 tweets with 47 likes about this paper.

HackerNews

  1. Adaptive LLM routing under budget constraints (206 points, 78 comments) 

alphaXiv