LLMRouterBench: LLM Routing Benchmark

Updated 19 January 2026

LLMRouterBench is a comprehensive benchmark for assessing language model routing algorithms across 21 datasets and multiple LLMs.
It standardizes evaluation using metrics such as accuracy, cost curves, Pareto analysis, and latency to enable rigorous method comparisons.
Empirical insights include model complementarity, cost-efficiency, and diminishing returns, guiding future research in tri-objective LLM routing.

LLMRouterBench is a large-scale benchmark and open framework for evaluating and comparing LLM routing algorithms. In LLM routing, a policy assigns each user query to an optimal choice from a pool of models, trading off performance (e.g., accuracy, judged response quality) and resource cost. LLMRouterBench provides a standardized, reproducible testbed comprising over 400,000 query–model instances, 21 datasets across diverse domains, 33 open and proprietary LLMs, and 10 representative routing methods. By enabling rigorous comparative analysis, the benchmark has established itself as a primary reference for both the empirical characterization and systematic advancement of @@@@1@@@@ (Li et al., 12 Jan 2026).

1. Formalization of the LLM Routing Problem

LLMRouterBench formalizes the LLM routing problem as a decision function in which each input query $q$ is mapped to a specific candidate LLM $m_k$ from the set $\mathcal{M} = \{m_1, \ldots, m_K\}$ , each with a fixed inference cost $c(m_k)$ . The routed model produces output $y = \mathrm{LLM}(m_k,q)$ , scored against a ground truth $y^*$ by a binary or graded loss function $\ell(y, y^*)$ (Li et al., 12 Jan 2026). The per-query utility is $u(m_k, q) = 1 - \ell(y, y^*)$ .

Two primary settings are supported:

Performance-oriented routing: maximize expected utility (accuracy) over a benchmark distribution:

$\max_{a} \mathbb{E}_{q \sim \mathcal{D}} [u(f_a(q), q)]$

Performance–cost trade-off routing: for each parameterization $\theta$ (e.g., threshold), induce a function $f_{a,\theta}$ and evaluate trade-off curves:

$\mathrm{AvgAcc}(\theta) = \frac{1}{|\mathcal{D}|} \sum_{q \in \mathcal{D}} u(f_{a,\theta}(q), q), \quad \mathrm{Cost}(\theta) = \sum_{q \in \mathcal{D}} c(f_{a,\theta}(q))$

Classical baselines include Random (uniform selection), BestSingle (globally best-in-hindsight model), and Oracle (cheapest correct model per query).

Key summary statistics supported:

Gain@b relative to baseline $b$ .
Gap@Oracle: remaining accuracy gap to instance-wise optimal routing.
PerfGain/CostSave: gain or savings relative to BestSingle at matched performance.
ParetoDist: 1-norm distance to the empirical (accuracy, cost) Pareto frontier.

2. Benchmark Structure and Dataset Composition

LLMRouterBench assembles a comprehensive dataset with two distinct model pools (Li et al., 12 Jan 2026):

Performance-oriented pool: 20 open-source models (e.g., Qwen3-8B, Intern-S1-mini, GLM-Z1) focused on tasks solvable by lightweight LLMs.
Performance–cost pool: 13 flagship/proprietary models (GPT-5, Claude-4, Gemini-Pro, Qwen3-235B, etc.) accessed through commercial APIs.

The benchmark spans 21 datasets, including 15 tasks solvable by lightweight LLMs (covering mathematics, code, logic, affective reasoning, etc.) and 10 high-difficulty or tool-use tasks. The raw data comprises 23,945 user prompts, expanded to 391,645 query–model tuples (∼1.8 billion tokens), each annotated with ground-truth references, extracted answers, token counts, and cost per inference. This modular design supports plug-and-play addition of new LLMs, domains, or evaluation protocols.

All data and code are open-sourced for reproducibility and community extension.

3. Framework Architecture and Evaluation Methodology

LLMRouterBench is architected as three principal modules (Li et al., 12 Jan 2026):

Collector: orchestrates and caches model calls, tracks cost, retries failures.
Evaluator: applies domain-specific scoring, such as Pass@1 for code or LLM-judged scores for open-ended tasks.
Adaptor: auto-converts the unified table into features compatible with diverse routing algorithms.

Primary evaluation metrics are:

Per-dataset and overall accuracy
Cost curves and Pareto analysis
Aggregated performance–cost frontier
Latency (through token-level throughput and provider stats; not a primary metric but supported)

Methodologies include fixed-split evaluation (shared train/test splits for all routers), controlled embedding backbones, and baseline comparison against upper/lower bounds (Oracle, BestSingle, Random). Cross-validation and ablation studies (including embedding backbone swaps) are supported.

4. Baseline Routing Algorithms

LLMRouterBench integrates a suite of 10 routing methods, encompassing both classical and recent architectures (Li et al., 12 Jan 2026):

Method	Core Technique	Notable Features
RouterDC	Dual-contrastive encoder	Selects by predicted score
EmbedLLM	Query/model embedding + probe	Pairwise correctness prediction
MODEL-SAT	Task-instruction tuning	Embeddings for capability align
GraphRouter	Heterogeneous graph edge prediction	Performance/Balance/Cost modes
Avengers	Query clustering + best expert	No neural training
HybridLLM	Binary cascade with difficulty pred	Small/large model cascade
FrugalGPT	Cascaded inference, learned stop
RouteLLM	Preference data, matrix factorization	Estimates win rates
Avengers-Pro	Expanded Avengers, tunable cost	Performance-cost coefficient
OpenRouter	Commercial black-box API	30+ commercial LLMs

These methods are benchmarked on identical splits and features, enabling controlled comparison. Plug-in adaptors ensure experimental uniformity across approaches.

5. Empirical Results and Key Insights

LLMRouterBench reveals several fundamental properties of LLM routing systems (Li et al., 12 Jan 2026):

Model Complementarity: No single model achieves dominance across all tasks. Ensemble routing consistently outperforms all fixed single-model policies by approximately 7% in average accuracy, but a persistent ∼19% (relative) gap to the Oracle remains.
Performance Convergence: The best open-source routers (EmbedLLM, GraphRouter, Avengers, MODEL-SAT) report nearly identical performance, with gains over BestSingle but nontrivial recall failures on specialist or rare queries.
Cost-Efficiency: In mixed or cost-first regimes, flagship routers (GraphRouter PF, Avengers-Pro) match Oracle-like trade-offs, achieving up to ∼32% cost reduction with no accuracy loss. Several prominent baselines and commercial APIs (HybridLLM, FrugalGPT, OpenRouter) fail to outperform BestSingle.
Diminishing Returns & Curation: Oracle performance plateaus after roughly 10 models; curated subsets outperform larger, randomly assembled pools.
Latency Impact: Empirical example (Qwen3-Thinking vs GLM-4.6) demonstrates large latency variation at matched cost–accuracy, highlighting the importance of tri-objective optimization.

Ablation studies show that embedding backbone choice (gte-qwen2-7B-instruct vs. nli-bert-base or all-MiniLM-L6-v2) affects accuracy by less than 1%, indicating that bottlenecks lie elsewhere in the routing stack.

6. Limitations and Future Directions

Present limitations of LLMRouterBench (Li et al., 12 Jan 2026):

The current pool comprises 9 open-source and 1 commercial router; coverage of recent methods (e.g., IRT-Router, ProxRouter, RADAR) is incomplete.
The 21-task suite omits very long-context, multimodal, or specialized vertical domains.
Latency is estimated rather than directly measured and may not reflect real-world deployment environments.
Model-recall failures, especially on rare/specialist queries, remain unsolved, contributing to the persistent gap to Oracle performance.

Recommended research avenues include: development of uncertainty/difficulty-aware routers; joint pool curation and selection protocols; fully tri-objective frameworks (accuracy, cost, latency); and extension to multilingual, multimodal, or domain-adaptive routing tasks.

LLMRouterBench consolidates and advances prior router benchmarks such as RouterBench (Hu et al., 2024) and RouterEval (Huang et al., 8 Mar 2025), which focus on smaller pools, different metrics, or lack large-scale, multi-domain, and cost-aware evaluation. It extends classical methodology from binary and preference-based classification (Shnitzer et al., 2023, Kassem et al., 20 Mar 2025), incorporates empirical insights from Router-Dataset paradigms (e.g., RouteMix in InferenceDynamics (Shi et al., 22 May 2025)), and supports both supervised and heuristic routing families. PersonaeRoute-Bench (Dai et al., 21 Nov 2025) introduces user-personalized routing, which is orthogonal and complementary. LLMRouterBench’s unified protocol and modular design have become the standard backbone for empirical router comparison, reproducible ablation studies, and scalable integration of new LLMs and tasks.

Key references: (Li et al., 12 Jan 2026, Hu et al., 2024, Huang et al., 8 Mar 2025, Dai et al., 21 Nov 2025, Shnitzer et al., 2023, Jin et al., 4 Jun 2025, Shi et al., 22 May 2025).