RouterBench Dataset for LLM Routing
- RouterBench Dataset is a large-scale, curated benchmark that systematically evaluates multi-LLM routing systems using over 405,000 inference outcomes across 64 tasks.
- It provides a detailed cost–quality analytic framework using metrics like AIQ and convex hull operations to compare routing strategies and model selection.
- The dataset offers actionable insights for both theoretical research and practical deployments by balancing cost efficiency with quality in LLM responses.
RouterBench Dataset is a large-scale, curated benchmark and evaluation framework designed to systematically assess the efficacy of LLM routing systems. Routing systems select from among multiple LLMs to optimize for cost, accuracy, or other resource-constrained objectives. RouterBench formalizes the evaluation of these systems via a comprehensive dataset containing over 405,000 inference outcomes, a rigorous cost–quality analytic framework, and standardized metrics, providing both theoretical and practical tools for the field of multi-LLM routing.
1. Dataset Composition and Structure
The RouterBench Dataset comprises more than 405,000 inference outcomes generated by 11 different LLMs evaluated on 8 representative datasets spanning 64 distinct tasks. These benchmarks cover key domains such as:
- Commonsense reasoning: Hellaswag, Winogrande, ARC Challenge
- Knowledge-based language understanding: MMLU
- Conversational intelligence: MT-Bench
- Mathematical reasoning: GSM8K
- Coding: MBPP
- Retrieval-augmented generation: RAG
Each entry in the dataset is annotated with:
- Sample identifier (e.g.,
mmlu-astronomy.val.5
) encoding the sub-task, split, and index - Model name (e.g., GPT-4, Llama-70B-chat)
- Source dataset name (e.g.,
hellaswag.dev.v0
) - Prompt text
- Model response
- Performance indicator (binary accuracy, normalized score, etc.)
- Cost estimation (USD-equivalent; API pricing for proprietary, Together AI estimates for open-source)
- Ground truth label
This organization enables multi-dimensional analysis of cost, quality, and task-specific performance. The dataset is suited specifically for the evaluation of routing strategies whose goal is to balance output quality with monetary expense (Hu et al., 18 Mar 2024).
Attribute | Type/Example | Description |
---|---|---|
Identifier | mmlu-astronomy.val.5 | Encodes dataset, split, and index |
Model Name | GPT-4 | LLM identity |
Source Dataset | hellaswag.dev.v0 | Benchmark dataset reference |
Cost Estimate | $0.002 | API or ledger-based per-query cost (USD) |
Performance | 1 / 0 / float | Binary or graded quality score |
2. Evaluation Metrics and Analytic Framework
RouterBench implements a joint cost–quality evaluation schema. Performance and cost are both empirically estimated over the dataset, yielding metrics:
- Expected cost for an LLM $mc_m = \mathbb{E}[c(\text{LLM}_m(x)) \mid x \in D]mq_m = \mathbb{E}[q(\text{LLM}_m(x)) \mid x \in D]c_{R_\theta} = \mathbb{E}[c(R_\theta(x)) \mid x \in D]cq\text{AIQ}(R_\theta) = \frac{1}{c_{\text{max}}-c_{\text{min}}} \int_{c_{\text{min}}}^{c_{\text{max}}} \widetilde{R}_\theta(c) \, dc\widetilde{R}_\theta(c)c$.
Linear interpolation and extrapolation operations model router behaviors between existing cost–quality pairs. The Zero router baseline—selecting only by convex hull coordinates—serves as a lower bound for router performance.
3. Theoretical Model of LLM Routing
RouterBench formalizes the routing problem by representing individual LLMs and routing systems as points or curves in a two-dimensional cost–quality space. Key theoretical constructs include:
- Linear interpolation: Allows construction of router policies that probabilistically mix between two existing routing strategies, resulting in intermediate cost–quality trade-offs.
- Extrapolation: Allows router coverage of different cost domains through redundant processing (increasing cost without quality gain).
- Non-decreasing convex hull: Extracts the set of Pareto-optimal cost–quality pairs from possibly complex router outputs, yielding a succinct cost–quality frontier for comparative analysis.
This framework rigorously defines how routing systems may be evaluated against the landscape of available LLMs, capturing the essential trade-off that governs multi-LLM deployments. The envelope perspective clarifies whether new routing algorithms truly advance the practical frontier or merely interpolate between existing approaches.
4. Comparative Analysis of Routing Approaches
Empirical investigations using RouterBench include both predictive routers (e.g., k-nearest neighbors, multi-layer perceptrons) and non-predictive/cascading routers (sequential evaluation until quality threshold met).
Findings include:
- Simple predictive routers can match or slightly outperform the best monolithic LLMs on quality with lower/inferior costs for certain tasks.
- The “Oracle” router—selecting per-query the cheapest LLM meeting the highest quality—demonstrates substantial cost savings, revealing that expensive models are frequently unnecessary for high response quality.
- Cascading routers with relaxed quality thresholds may degrade rapidly in performance, underscoring parameter sensitivity.
- These comparative studies suggest which designs generalize robustly across diverse task families.
A plausible implication is that cost-aware routing, especially with properly tuned predictive methodologies, facilitates efficient use of resources without a significant sacrifice in output quality. The convex hull analysis quantifies where each router lies in the overall trade-off spectrum.
5. Use in Subsequent Research and Benchmarks
RouterBench has been adopted as a primary validation bed for later routing methods, notably CARROT (Cost AwaRe Rate Optimal rouTer) (Somerstep et al., 5 Feb 2025). In these studies:
- CARROT uses plug-in estimators for both cost and accuracy given a query , implements risk-weighted selection of LLMs, and empirically achieves minimax optimal tradeoffs on RouterBench prompts.
- RouterBench’s structure (single-correct-answer prompts, explicit cost metrics, diverse but classical task representation) supports rigorous risk calculations and permits direct comparison to other routers and to the cost–accuracy curves of individual models.
- Other benchmarks, such as SPROUT (Smart Price-aware ROUTing), extend the diversity (e.g., chat-style prompts, additional models, automated evaluation), but RouterBench remains the high-stakes, zero-shot baseline for model selection algorithms.
A plausible implication is that RouterBench’s clarity and standardization have contributed substantively to the axiomatic and empirical development of LLM routing theory.
6. Practical Implications and Accessibility
RouterBench provides actionable assets for both theoretical and applied development of multi-LLM routing systems:
- Quantifies trade-offs between monetary cost and response quality, informing model selection policies for resource-constrained deployments.
- Facilitates economically viable LLM architectures by enabling selective invocation of competitive open-source models when possible, reserving higher-cost options for tasks that strictly require them.
- Reduces the barrier to empirical research, providing researchers with pre-generated inference outcomes (obviating the need for online expensive model queries).
- The extensible dataset and clear documentation support further integration of new metrics (e.g., latency, throughput) and additional tasks or models.
Access to the dataset and code is openly provided at https://github.com/withmartian/routerbench. Users should be proficient with Python and modern data analysis tools to manipulate, extend, and apply the dataset to emerging routing strategies and model pools.
7. Position within the LLM Routing Landscape
RouterBench establishes both a practical and theoretical standard for multi-LLM routing system evaluation. Its cost–quality framework formalizes critical analytic dimensions; its dataset supports reproducible experiments in cost-aware decision-making; its impact is evident in subsequent research that uses it as a baseline and validation suite. RouterBench’s existence enables new algorithmic innovations to be quantitatively compared and rigorously benchmarked, fostering the development of scalable, efficient, and robust LLM routing systems suitable for large-scale deployment and future research (Hu et al., 18 Mar 2024, Somerstep et al., 5 Feb 2025).