Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cost- and Performance-Aware Routing

Updated 8 June 2026
  • Cost- and performance-aware model routing is a method that assigns inference queries to large language models while balancing accuracy and resource expenditure.
  • It employs joint optimization of model selection and batching, using proxy utility modeling and Pareto frontier scheduling to enhance cost-effectiveness.
  • Empirical results demonstrate significant cost reductions and superior performance compared to routing-only and batching-only strategies across multiple benchmarks.

Cost- and Performance-Aware Model Routing

Cost- and performance-aware model routing is a family of algorithmic techniques for optimally assigning inference queries to LLMs from a pool of candidate models, with the explicit goal of balancing utility (such as answer accuracy or task completion rate) against economic or resource expenditure (monetary cost, latency, or FLOPs). This paradigm generalizes classic “model routing” by incorporating both statistical prediction of model utility on a per-query basis and dynamic adaptation to cost, batch size, and other operational constraints. Recent work has established a unified framework for joint optimization over model selection and batching, clarified the theoretical complexity of the resulting decision problems, and developed scalable algorithms that outperform prior routing-only and batching-only pipelines (Xu et al., 27 May 2026).

1. Problem Formulation and Theoretical Landscape

The fundamental setting consists of a query workload Q={q1,,qn}Q = \{q_1,\ldots,q_n\} and a set of KK LLMs M={m1,,mK}M = \{m_1,\ldots,m_K\}, each characterized by distinct cost and utility profiles. Every model mkm_k incurs a per-token input cost ckinc_k^{in}, output cost ckoutc_k^{out}, and has a fixed system-prompt cost. Valid batch sizes BkB_k are determined by each model’s context-window constraints and desired amortization. For any routing assignment of qiq_i to (mk,b)(m_k, b) (model and batch size), the amortized per-query cost is: Cqi(mk,b)=Csys(mk)b+tqiinckin+tqioutckoutC_{q_i}(m_k, b) = \frac{C_{sys}(m_k)}{b} + t_{q_i}^{in} c_k^{in} + t_{q_i}^{out} c_k^{out} where KK0 accounts for fixed prompt overhead. The expected utility KK1 (e.g., probability of correct answer) and cost KK2 define a two-dimensional trade-off landscape.

The “Route with Batching Problem,” which seeks to maximize total expected utility subject to a global budget KK3, is formulated as an integer program: KK4 where KK5 indicates the assignment of KK6 to model KK7 in batch size KK8.

The problem is NP-hard by reduction from Maximum Coverage, highlighting the complexity of exact solutions for realistic workloads (Xu et al., 27 May 2026).

2. Proxy Utility Modeling and Cost-Utility Factorization

Exhaustively evaluating utility KK9 for all triplets M={m1,,mK}M = \{m_1,\ldots,m_K\}0 via LLM calls is computationally prohibitive. Efficient estimation is achieved by decomposing utility into single-query and batch-size-dependent factors: M={m1,,mK}M = \{m_1,\ldots,m_K\}1 Here, M={m1,,mK}M = \{m_1,\ldots,m_K\}2 is predicted (by MLP/kNN routers) for M={m1,,mK}M = \{m_1,\ldots,m_K\}3 routed to M={m1,,mK}M = \{m_1,\ldots,m_K\}4 alone, while M={m1,,mK}M = \{m_1,\ldots,m_K\}5 is a model-specific batch decay curve, empirically fitted (piecewise-linear or power-law) from a small coreset. This factorization reduces profiling calls from M={m1,,mK}M = \{m_1,\ldots,m_K\}6 to M={m1,,mK}M = \{m_1,\ldots,m_K\}7, with M={m1,,mK}M = \{m_1,\ldots,m_K\}8 and monotonic decay as M={m1,,mK}M = \{m_1,\ldots,m_K\}9 increases.

Effective batch size mkm_k0 is calibrated to minimize the cost-to-utility ratio: mkm_k1 Ternary search exploits unimodality of mkm_k2 for efficient optimization.

3. Unified Scheduling and Pareto-Optimal Routing

The heart of cost- and performance-aware routing is a two-stage algorithm, exemplified by RoBatch (Xu et al., 27 May 2026):

  • Modeling Stage: For each mkm_k3 up to mkm_k4, estimate cost/utility, prune dominated states (higher cost, no higher utility), and retain the individual cost–utility Pareto frontier mkm_k5 for each query.
  • Routing Stage: Globally allocate budget using a greedy scheduling algorithm. Each query is initially assigned its lowest-cost state mkm_k6, with the option to “upgrade” to better (but more expensive) assignments by following the steepest utility-gain-per-cost (inverse-RCU) slope. States are selected from the union of all per-query Pareto frontiers by progressive allocation, continuing until either the budget is exhausted or all queries reach their highest-utility state.

This design guarantees lossless pruning—dominated states can be discarded without affecting optimality—and achieves high computational scalability: mkm_k7 for mkm_k8 queries, mkm_k9 batch sizes, up to ckinc_k^{in}0 Pareto states per query.

4. Comparative Empirical Evaluation

Experiments on six benchmarks (AGNews, IMDB, MRPC, SNLI, MMLU, GSM8K) with LLM pools Qwen3-{4B,14B,32B} and Gemma3-{4B,12B,27B} demonstrate that joint routing and batching yields strictly superior cost–accuracy Pareto frontiers compared to routing-only (FrugalGPT, RouteLLM), batch-only (Optimized Batch Prompting, similarity/diversity batching), and all single-model baselines (Xu et al., 27 May 2026).

Method class Cost (reasoning) Cost (classification) Utility curve Comments
Routing-only 25–40% higher 10–20% higher Sub-Pareto misses batch-effects
Batch-only 25–40% higher 10–20% higher Sub-Pareto ignores query heterogeneity
RoBatch (joint) Lowest Lowest Strictly dominates joint optimality

RoBatch achieves up to 40% cost reduction for reasoning tasks, 10–20% for classification, at fixed utility. Ablations verify that neither routing nor batching alone recovers this frontier. Sensitivity studies confirm robustness to coreset choice, model architecture, scaling-function, and batch size.

End-to-end scheduling time is linear in ckinc_k^{in}1 and typically under 50s for ckinc_k^{in}2 queries on CPU.

5. Key Design Principles and Practitioner Guidelines

Several design lessons are established:

  • Joint optimization is essential: Loosely coupling batch prompting and routing can forfeit up to 30% of possible cost savings.
  • Proxy decomposition reduces profiling cost: Factorizing utility as the product of single-query accuracy and batch decay dramatically reduces calibration workload.
  • Batch size calibration must be data-driven: Effective batch sizes should be computed by minimizing empirical RCU on representative queries, not by heuristics.
  • Scheduling along Pareto frontiers: Global greedy upgrades, tracking the best utility-gain-to-cost slope, consistently approach the optimal budget-utility trade-off.
  • Model and decay-curve estimator choice is flexible: Simple architectures (e.g., shallow MLPs, piecewise linear scalers) suffice for practical deployment.
  • Lossless pruning of dominated states: Discarding states with both higher cost and no higher utility preserves global optimality under per-query amortization.

6. Theoretical Hardness and Algorithmic Guarantees

The full Route with Batching Problem is proven NP-hard by reduction from Maximum Coverage: the assignment of queries to model-batch pairs within a fixed budget encodes a combinatorially hard selection. Nonetheless, the two-stage RoBatch design achieves near-optimal results in practice, leveraging the decomposed proxy model and prioritized greedy scheduling to traverse the cost–utility landscape efficiently.

Moreover, the per-query pruning of dominated assignments is guaranteed to be lossless: every globally optimal routing can be constructed solely from undominated states along each query’s Pareto frontier.

7. Scope, Limitations, and Future Directions

While the unified cost- and performance-aware routing framework subsumes prior art on query-level LLM selection and batching, several open directions remain. The current approach is agnostic to the choice of routing predictor or decay-curve estimator, supporting rapid adaptation to diverse LLM pools and workloads. However, generalization to settings with multi-dimensional or non-monetary cost constraints, dynamically changing model pools, or more complex utility functions (e.g., response latency, energy, or privacy metrics) is not addressed.

Further, while RoBatch focuses on single-turn, stateless inference, extensions to multi-turn, long-horizon, or interactive task settings will require explicit treatment of cross-turn dependencies and context-effected utility degradation.

In summary, cost- and performance-aware model routing via joint optimization of model selection and batch prompting provides a provably hard but practically tractable solution to minimizing inference cost while preserving task utility in modern LLM-serving systems (Xu et al., 27 May 2026). This paradigm establishes a new standard for cost-effective deployment, inviting refinement and extension across broader multi-dimensional inference optimization settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cost- and Performance-Aware Model Routing.