Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 89 tok/s
GPT OSS 120B 457 tok/s Pro
Kimi K2 169 tok/s Pro
2000 character limit reached

RouterBench Dataset for LLM Routing

Updated 2 September 2025
  • RouterBench Dataset is a large-scale, curated benchmark that systematically evaluates multi-LLM routing systems using over 405,000 inference outcomes across 64 tasks.
  • It provides a detailed cost–quality analytic framework using metrics like AIQ and convex hull operations to compare routing strategies and model selection.
  • The dataset offers actionable insights for both theoretical research and practical deployments by balancing cost efficiency with quality in LLM responses.

RouterBench Dataset is a large-scale, curated benchmark and evaluation framework designed to systematically assess the efficacy of LLM routing systems. Routing systems select from among multiple LLMs to optimize for cost, accuracy, or other resource-constrained objectives. RouterBench formalizes the evaluation of these systems via a comprehensive dataset containing over 405,000 inference outcomes, a rigorous cost–quality analytic framework, and standardized metrics, providing both theoretical and practical tools for the field of multi-LLM routing.

1. Dataset Composition and Structure

The RouterBench Dataset comprises more than 405,000 inference outcomes generated by 11 different LLMs evaluated on 8 representative datasets spanning 64 distinct tasks. These benchmarks cover key domains such as:

  • Commonsense reasoning: Hellaswag, Winogrande, ARC Challenge
  • Knowledge-based language understanding: MMLU
  • Conversational intelligence: MT-Bench
  • Mathematical reasoning: GSM8K
  • Coding: MBPP
  • Retrieval-augmented generation: RAG

Each entry in the dataset is annotated with:

  • Sample identifier (e.g., mmlu-astronomy.val.5) encoding the sub-task, split, and index
  • Model name (e.g., GPT-4, Llama-70B-chat)
  • Source dataset name (e.g., hellaswag.dev.v0)
  • Prompt text
  • Model response
  • Performance indicator (binary accuracy, normalized score, etc.)
  • Cost estimation (USD-equivalent; API pricing for proprietary, Together AI estimates for open-source)
  • Ground truth label

This organization enables multi-dimensional analysis of cost, quality, and task-specific performance. The dataset is suited specifically for the evaluation of routing strategies whose goal is to balance output quality with monetary expense (Hu et al., 18 Mar 2024).

Attribute Type/Example Description
Identifier mmlu-astronomy.val.5 Encodes dataset, split, and index
Model Name GPT-4 LLM identity
Source Dataset hellaswag.dev.v0 Benchmark dataset reference
Cost Estimate $0.002 API or ledger-based per-query cost (USD)
Performance 1 / 0 / float Binary or graded quality score

2. Evaluation Metrics and Analytic Framework

RouterBench implements a joint cost–quality evaluation schema. Performance and cost are both empirically estimated over the dataset, yielding metrics:

  • Expected cost for an LLM $m:: c_m = \mathbb{E}[c(\text{LLM}_m(x)) \mid x \in D]</li><li><strong>Expectedquality</strong>foranLLM</li> <li><strong>Expected quality</strong> for an LLM m:: q_m = \mathbb{E}[q(\text{LLM}_m(x)) \mid x \in D]</li><li><strong>Routercost</strong>:</li> <li><strong>Router cost</strong>: c_{R_\theta} = \mathbb{E}[c(R_\theta(x)) \mid x \in D]</li></ul><p>Tradeoffsarevisualizedinthecostquality(</li> </ul> <p>Trade-offs are visualized in the cost-quality (cq)plane,withconvexhulloperationsappliedtohighlightnondominatedroutingstrategies(ensuringqualitydoesnotdecreasewithincreasingcost).</p><p>Toenablequantitativecomparison,RouterBenchintroducesthe<strong>AIQ(AverageImprovementinQuality)</strong>metric:</p><p>) plane, with convex hull operations applied to highlight non-dominated routing strategies (ensuring quality does not decrease with increasing cost).</p> <p>To enable quantitative comparison, RouterBench introduces the <strong>AIQ (Average Improvement in Quality)</strong> metric:</p> <p>\text{AIQ}(R_\theta) = \frac{1}{c_{\text{max}}-c_{\text{min}}} \int_{c_{\text{min}}}^{c_{\text{max}}} \widetilde{R}_\theta(c) \, dc</p><p>where</p> <p>where \widetilde{R}_\theta(c)istheinterpolatedqualityatcost is the interpolated quality at cost c$.

    Linear interpolation and extrapolation operations model router behaviors between existing cost–quality pairs. The Zero router baseline—selecting only by convex hull coordinates—serves as a lower bound for router performance.

    3. Theoretical Model of LLM Routing

    RouterBench formalizes the routing problem by representing individual LLMs and routing systems as points or curves in a two-dimensional cost–quality space. Key theoretical constructs include:

    • Linear interpolation: Allows construction of router policies that probabilistically mix between two existing routing strategies, resulting in intermediate cost–quality trade-offs.
    • Extrapolation: Allows router coverage of different cost domains through redundant processing (increasing cost without quality gain).
    • Non-decreasing convex hull: Extracts the set of Pareto-optimal cost–quality pairs from possibly complex router outputs, yielding a succinct cost–quality frontier for comparative analysis.

    This framework rigorously defines how routing systems may be evaluated against the landscape of available LLMs, capturing the essential trade-off that governs multi-LLM deployments. The envelope perspective clarifies whether new routing algorithms truly advance the practical frontier or merely interpolate between existing approaches.

    4. Comparative Analysis of Routing Approaches

    Empirical investigations using RouterBench include both predictive routers (e.g., k-nearest neighbors, multi-layer perceptrons) and non-predictive/cascading routers (sequential evaluation until quality threshold met).

    Findings include:

    • Simple predictive routers can match or slightly outperform the best monolithic LLMs on quality with lower/inferior costs for certain tasks.
    • The “Oracle” router—selecting per-query the cheapest LLM meeting the highest quality—demonstrates substantial cost savings, revealing that expensive models are frequently unnecessary for high response quality.
    • Cascading routers with relaxed quality thresholds may degrade rapidly in performance, underscoring parameter sensitivity.
    • These comparative studies suggest which designs generalize robustly across diverse task families.

    A plausible implication is that cost-aware routing, especially with properly tuned predictive methodologies, facilitates efficient use of resources without a significant sacrifice in output quality. The convex hull analysis quantifies where each router lies in the overall trade-off spectrum.

    5. Use in Subsequent Research and Benchmarks

    RouterBench has been adopted as a primary validation bed for later routing methods, notably CARROT (Cost AwaRe Rate Optimal rouTer) (Somerstep et al., 5 Feb 2025). In these studies:

    • CARROT uses plug-in estimators for both cost and accuracy given a query XX, implements risk-weighted selection of LLMs, and empirically achieves minimax optimal tradeoffs on RouterBench prompts.
    • RouterBench’s structure (single-correct-answer prompts, explicit cost metrics, diverse but classical task representation) supports rigorous risk calculations and permits direct comparison to other routers and to the cost–accuracy curves of individual models.
    • Other benchmarks, such as SPROUT (Smart Price-aware ROUTing), extend the diversity (e.g., chat-style prompts, additional models, automated evaluation), but RouterBench remains the high-stakes, zero-shot baseline for model selection algorithms.

    A plausible implication is that RouterBench’s clarity and standardization have contributed substantively to the axiomatic and empirical development of LLM routing theory.

    6. Practical Implications and Accessibility

    RouterBench provides actionable assets for both theoretical and applied development of multi-LLM routing systems:

    • Quantifies trade-offs between monetary cost and response quality, informing model selection policies for resource-constrained deployments.
    • Facilitates economically viable LLM architectures by enabling selective invocation of competitive open-source models when possible, reserving higher-cost options for tasks that strictly require them.
    • Reduces the barrier to empirical research, providing researchers with pre-generated inference outcomes (obviating the need for online expensive model queries).
    • The extensible dataset and clear documentation support further integration of new metrics (e.g., latency, throughput) and additional tasks or models.

    Access to the dataset and code is openly provided at https://github.com/withmartian/routerbench. Users should be proficient with Python and modern data analysis tools to manipulate, extend, and apply the dataset to emerging routing strategies and model pools.

    7. Position within the LLM Routing Landscape

    RouterBench establishes both a practical and theoretical standard for multi-LLM routing system evaluation. Its cost–quality framework formalizes critical analytic dimensions; its dataset supports reproducible experiments in cost-aware decision-making; its impact is evident in subsequent research that uses it as a baseline and validation suite. RouterBench’s existence enables new algorithmic innovations to be quantitatively compared and rigorously benchmarked, fostering the development of scalable, efficient, and robust LLM routing systems suitable for large-scale deployment and future research (Hu et al., 18 Mar 2024, Somerstep et al., 5 Feb 2025).

    Definition Search Book Streamline Icon: https://streamlinehq.com
    References (2)