Cost- and Performance-Aware Routing
- Cost- and performance-aware model routing is a method that assigns inference queries to large language models while balancing accuracy and resource expenditure.
- It employs joint optimization of model selection and batching, using proxy utility modeling and Pareto frontier scheduling to enhance cost-effectiveness.
- Empirical results demonstrate significant cost reductions and superior performance compared to routing-only and batching-only strategies across multiple benchmarks.
Cost- and Performance-Aware Model Routing
Cost- and performance-aware model routing is a family of algorithmic techniques for optimally assigning inference queries to LLMs from a pool of candidate models, with the explicit goal of balancing utility (such as answer accuracy or task completion rate) against economic or resource expenditure (monetary cost, latency, or FLOPs). This paradigm generalizes classic “model routing” by incorporating both statistical prediction of model utility on a per-query basis and dynamic adaptation to cost, batch size, and other operational constraints. Recent work has established a unified framework for joint optimization over model selection and batching, clarified the theoretical complexity of the resulting decision problems, and developed scalable algorithms that outperform prior routing-only and batching-only pipelines (Xu et al., 27 May 2026).
1. Problem Formulation and Theoretical Landscape
The fundamental setting consists of a query workload and a set of LLMs , each characterized by distinct cost and utility profiles. Every model incurs a per-token input cost , output cost , and has a fixed system-prompt cost. Valid batch sizes are determined by each model’s context-window constraints and desired amortization. For any routing assignment of to (model and batch size), the amortized per-query cost is: where 0 accounts for fixed prompt overhead. The expected utility 1 (e.g., probability of correct answer) and cost 2 define a two-dimensional trade-off landscape.
The “Route with Batching Problem,” which seeks to maximize total expected utility subject to a global budget 3, is formulated as an integer program: 4 where 5 indicates the assignment of 6 to model 7 in batch size 8.
The problem is NP-hard by reduction from Maximum Coverage, highlighting the complexity of exact solutions for realistic workloads (Xu et al., 27 May 2026).
2. Proxy Utility Modeling and Cost-Utility Factorization
Exhaustively evaluating utility 9 for all triplets 0 via LLM calls is computationally prohibitive. Efficient estimation is achieved by decomposing utility into single-query and batch-size-dependent factors: 1 Here, 2 is predicted (by MLP/kNN routers) for 3 routed to 4 alone, while 5 is a model-specific batch decay curve, empirically fitted (piecewise-linear or power-law) from a small coreset. This factorization reduces profiling calls from 6 to 7, with 8 and monotonic decay as 9 increases.
Effective batch size 0 is calibrated to minimize the cost-to-utility ratio: 1 Ternary search exploits unimodality of 2 for efficient optimization.
3. Unified Scheduling and Pareto-Optimal Routing
The heart of cost- and performance-aware routing is a two-stage algorithm, exemplified by RoBatch (Xu et al., 27 May 2026):
- Modeling Stage: For each 3 up to 4, estimate cost/utility, prune dominated states (higher cost, no higher utility), and retain the individual cost–utility Pareto frontier 5 for each query.
- Routing Stage: Globally allocate budget using a greedy scheduling algorithm. Each query is initially assigned its lowest-cost state 6, with the option to “upgrade” to better (but more expensive) assignments by following the steepest utility-gain-per-cost (inverse-RCU) slope. States are selected from the union of all per-query Pareto frontiers by progressive allocation, continuing until either the budget is exhausted or all queries reach their highest-utility state.
This design guarantees lossless pruning—dominated states can be discarded without affecting optimality—and achieves high computational scalability: 7 for 8 queries, 9 batch sizes, up to 0 Pareto states per query.
4. Comparative Empirical Evaluation
Experiments on six benchmarks (AGNews, IMDB, MRPC, SNLI, MMLU, GSM8K) with LLM pools Qwen3-{4B,14B,32B} and Gemma3-{4B,12B,27B} demonstrate that joint routing and batching yields strictly superior cost–accuracy Pareto frontiers compared to routing-only (FrugalGPT, RouteLLM), batch-only (Optimized Batch Prompting, similarity/diversity batching), and all single-model baselines (Xu et al., 27 May 2026).
| Method class | Cost (reasoning) | Cost (classification) | Utility curve | Comments |
|---|---|---|---|---|
| Routing-only | 25–40% higher | 10–20% higher | Sub-Pareto | misses batch-effects |
| Batch-only | 25–40% higher | 10–20% higher | Sub-Pareto | ignores query heterogeneity |
| RoBatch (joint) | Lowest | Lowest | Strictly dominates | joint optimality |
RoBatch achieves up to 40% cost reduction for reasoning tasks, 10–20% for classification, at fixed utility. Ablations verify that neither routing nor batching alone recovers this frontier. Sensitivity studies confirm robustness to coreset choice, model architecture, scaling-function, and batch size.
End-to-end scheduling time is linear in 1 and typically under 50s for 2 queries on CPU.
5. Key Design Principles and Practitioner Guidelines
Several design lessons are established:
- Joint optimization is essential: Loosely coupling batch prompting and routing can forfeit up to 30% of possible cost savings.
- Proxy decomposition reduces profiling cost: Factorizing utility as the product of single-query accuracy and batch decay dramatically reduces calibration workload.
- Batch size calibration must be data-driven: Effective batch sizes should be computed by minimizing empirical RCU on representative queries, not by heuristics.
- Scheduling along Pareto frontiers: Global greedy upgrades, tracking the best utility-gain-to-cost slope, consistently approach the optimal budget-utility trade-off.
- Model and decay-curve estimator choice is flexible: Simple architectures (e.g., shallow MLPs, piecewise linear scalers) suffice for practical deployment.
- Lossless pruning of dominated states: Discarding states with both higher cost and no higher utility preserves global optimality under per-query amortization.
6. Theoretical Hardness and Algorithmic Guarantees
The full Route with Batching Problem is proven NP-hard by reduction from Maximum Coverage: the assignment of queries to model-batch pairs within a fixed budget encodes a combinatorially hard selection. Nonetheless, the two-stage RoBatch design achieves near-optimal results in practice, leveraging the decomposed proxy model and prioritized greedy scheduling to traverse the cost–utility landscape efficiently.
Moreover, the per-query pruning of dominated assignments is guaranteed to be lossless: every globally optimal routing can be constructed solely from undominated states along each query’s Pareto frontier.
7. Scope, Limitations, and Future Directions
While the unified cost- and performance-aware routing framework subsumes prior art on query-level LLM selection and batching, several open directions remain. The current approach is agnostic to the choice of routing predictor or decay-curve estimator, supporting rapid adaptation to diverse LLM pools and workloads. However, generalization to settings with multi-dimensional or non-monetary cost constraints, dynamically changing model pools, or more complex utility functions (e.g., response latency, energy, or privacy metrics) is not addressed.
Further, while RoBatch focuses on single-turn, stateless inference, extensions to multi-turn, long-horizon, or interactive task settings will require explicit treatment of cross-turn dependencies and context-effected utility degradation.
In summary, cost- and performance-aware model routing via joint optimization of model selection and batch prompting provides a provably hard but practically tractable solution to minimizing inference cost while preserving task utility in modern LLM-serving systems (Xu et al., 27 May 2026). This paradigm establishes a new standard for cost-effective deployment, inviting refinement and extension across broader multi-dimensional inference optimization settings.