Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 89 tok/s
GPT OSS 120B 457 tok/s Pro
Kimi K2 169 tok/s Pro
2000 character limit reached

Adaptive LLM Routing under Budget Constraints (2508.21141v1)

Published 28 Aug 2025 in cs.LG

Abstract: LLMs have revolutionized natural language processing, but their varying capabilities and costs pose challenges in practical applications. LLM routing addresses this by dynamically selecting the most suitable LLM for each query/task. Previous approaches treat this as a supervised learning problem, assuming complete knowledge of optimal query-LLM pairings. However, real-world scenarios lack such comprehensive mappings and face evolving user queries. We thus propose to study LLM routing as a contextual bandit problem, enabling adaptive decision-making using bandit feedback without requiring exhaustive inference across all LLMs for all queries (in contrast to supervised routing). To address this problem, we develop a shared embedding space for queries and LLMs, where query and LLM embeddings are aligned to reflect their affinity. This space is initially learned from offline human preference data and refined through online bandit feedback. We instantiate this idea through Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), a novel extension of LinUCB. To handle diverse user budgets for model routing, we introduce an online cost policy modeled as a multi-choice knapsack problem, ensuring resource-efficient routing.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper proposes PILOT, which integrates human preference priors with contextual bandit feedback to enable efficient LLM routing under budget constraints.
  • PILOT achieves 93% of GPT-4's performance at only 25% of its cost in multi-LLM routing, demonstrating significant improvements over traditional approaches.
  • The study validates PILOT via extensive experiments on diverse tasks, highlighting its robustness, rapid adaptation, and low computational overhead.

Adaptive LLM Routing under Budget Constraints: A Technical Analysis

Problem Formulation and Motivation

The paper addresses the challenge of deploying multiple LLMs in real-world systems where both performance and cost constraints are critical. The central problem is dynamic LLM routing: selecting the most suitable LLM for each incoming query, given a pool of models with varying capabilities and costs. Unlike prior work that treats routing as a supervised learning problem requiring exhaustive query-LLM pairings, this work reformulates routing as a contextual bandit problem, leveraging only bandit feedback (i.e., user evaluation of the selected model's response) and enforcing budget constraints via an online cost policy.

Methodology: Preference-Prior Informed LinUCB (PILOT)

The proposed solution, PILOT, extends the LinUCB algorithm by incorporating human preference priors and an online cost policy. The approach consists of three main components:

  1. Shared Embedding Space Pretraining: Queries and LLMs are embedded into a shared space, pretrained using human preference data. Query embeddings are projected via a learned linear transformation, and LLM embeddings are optimized to align with preferred responses.
  2. Online Bandit Feedback Adaptation: The router adapts LLM embeddings online using contextual bandit feedback. The expected reward for a query-LLM pair is modeled as the cosine similarity between their normalized embeddings. PILOT initializes the bandit algorithm with preference-based priors, theoretically achieving lower regret bounds than standard LinUCB/OFUL when the prior is close to the true reward vector.
  3. Budget-Constrained Routing via Online Cost Policy: The cost policy is formulated as an online multi-choice knapsack problem, using the ZCL algorithm to allocate budget across queries. The policy dynamically selects eligible LLMs based on estimated reward-to-cost ratios and current budget utilization, with binning to avoid underutilization in finite-horizon settings. Figure 1

    Figure 1: Overview of the two-phase pretraining process for query and LLM embeddings using human preference data.

    Figure 2

    Figure 2: Bandit router framework: the router receives user queries, cost constraints, and a model pool, adapting LLM selection based on user feedback.

Experimental Setup

Experiments are conducted on the Routerbench dataset, which includes 36,497 samples across 64 tasks and 11 LLMs (both open-source and proprietary). The evaluation simulates online learning with a split into tuning, learning, and deployment buckets. Baselines include all-to-one routers, HybridLLM (supervised), and several contextual bandit algorithms (LinUCB, Epoch-Greedy, Explore Only, Random Policy). The cost policy is uniformly applied across all baselines for fair comparison.

Results

Performance vs. Cost

PILOT achieves 93% of GPT-4's performance at only 25% of its cost in multi-LLM routing, and 86% of GPT-4's performance at 27% of its cost in single-task routing. Across all cost budgets and learning bucket sizes, PILOT consistently outperforms bandit and supervised baselines in both deployment set performance and cumulative regret. Figure 3

Figure 3: Performance vs cost, learning bucket size, and cumulative regret for PILOT and baselines on single-task and multi-task settings.

Qualitative Routing Analysis

PILOT demonstrates task-aware routing: for complex reasoning tasks (MMLU, ARC Challenge), it routes ~90% of queries to GPT-4; for coding (MBPP), Claude models handle 28% of queries; for math (GSM8K), Claude-v1 is selected for 94% of queries, reflecting cost-effective exploitation of model strengths.

Cost Policy Evaluation

The online cost policy outperforms simple per-query budget allocation and even an offline policy with perfect hindsight, as shown by higher mean reciprocal rank and deployment set performance. Figure 4

Figure 4: Comparison of cost policies: mean reciprocal rank and performance across budgets.

Computational Overhead

PILOT's routing time is negligible compared to LLM inference: 0.065–0.239s for routing vs. 2.5s for GPT-4 inference, ensuring minimal latency impact.

Embedding Model Sensitivity

PILOT maintains superior performance over baselines when using alternative embedding models (Instructor-XL), indicating robustness to the choice of embedder. Figure 5

Figure 5: Sensitivity analysis of PILOT's performance with different embedding models.

Binary LLM Routing and Adaptability

PILOT matches or surpasses HybridLLM in binary routing (GPT-4 vs. Mistral-7b/Mixtral-8x7b), despite requiring only bandit feedback. It adapts rapidly to shifts in query distribution, increasing exploration during drift and stabilizing post-drift. Figure 6

Figure 6

Figure 6: Performance vs cost for PILOT and HybridLLM in binary routing scenarios.

Figure 7

Figure 7: Binary LLM routing evaluation: performance, learning bucket size, and cumulative regret for different LLM pairs.

Theoretical Implications

The preference-prior informed initialization in PILOT is shown to yield lower cumulative regret than standard bandit algorithms when the prior is close to the true reward vector. This provides a formal justification for leveraging human preference data in contextual bandit settings for LLM routing.

Practical Implications and Future Directions

PILOT enables adaptive, cost-efficient LLM deployment in dynamic environments, requiring only partial supervision and minimal annotation. The decoupling of bandit learning and cost policy allows for robust, user-controllable budget management. Limitations include the lack of budget constraints during online learning and focus on single-turn queries; future work should address budget-aware online learning and multi-turn conversational routing.

Conclusion

The paper presents a principled, empirically validated approach to adaptive LLM routing under budget constraints, combining preference-informed contextual bandit learning with an efficient online cost policy. PILOT achieves near state-of-the-art performance at a fraction of the cost, adapts to evolving query distributions, and is robust to embedding model choices, making it suitable for practical LLM deployment in cost-sensitive, dynamic settings.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com

alphaXiv