Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 178 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

RouterBench Benchmark

Updated 11 November 2025
  • RouterBench is a benchmark and dataset designed to assess multi-LLM routing systems by providing over 405k precomputed inference records and standardized evaluation metrics.
  • It covers diverse application domains such as commonsense reasoning, QA, conversation, math, coding, and retrieval-augmented generation with 11 representative LLMs.
  • The framework leverages cost–quality analysis, convex-hull evaluation, and the AIQ metric to compare routing policies without incurring real-time compute costs.

RouterBench is a benchmark and dataset designed to systematically assess multi-LLM routing systems by providing a large pool of precomputed inference outcomes, model metadata, and a formal evaluation framework (Hu et al., 18 Mar 2024). Its primary goal is to catalyze methodological rigor and reproducibility in LLM router research by enabling train/evaluation of routing policies under diverse cost–quality regimes without incurring real-time compute costs. RouterBench should not be confused with approaches that repurpose existing benchmarks for LLM selection; it is a purpose-built artifact with well-defined data structures, representative tasks, and standardized metrics.

1. Scope, Motivation, and Design Principles

The core motivation for RouterBench is the proliferation of LLMs with heterogeneous performance profiles: models differ substantially in cost (e.g., API pricing, inference-time latency) and accuracy (task-specific quality). No single LLM dominates across all real-world tasks, creating demand for serving strategies that dispatch user prompts to the optimal model for each case. RouterBench addresses the previously missing ingredient—a unified benchmark and dataset for rigorous router design, testing, and statistical comparison.

Design goals:

  • To furnish over 405k pre-computed LLM outputs augmented with cost, performance, and ground-truth metadata
  • To span major application domains: commonsense reasoning, knowledge QA, conversation, math, code generation, and retrieval-augmented generation (RAG)
  • To provide a theoretical evaluation framework that enables analytical comparison between routing policies, leveraging cost–quality trade-offs, convex-hull analysis, and the Average Improvement in Quality (AIQ) metric

Intended use cases include:

  • Developing predictive routing models (e.g., K-Nearest Neighbor, MLP, transformer-based classifiers) that select the model per instance
  • Evaluating non-predictive strategies (e.g., cascades, rule-based selectors)
  • Benchmarking novel routing methodologies against established oracles, baseline static selectors, and theoretical bounds

2. Data Collection, Contents, and Structure

RouterBench comprises 405,467 inference records, sampled from eight standard NLP/math/programming datasets plus an in-house RAG set. It covers 11 representative LLMs—ranging from open-source models (Llama-70B-chat, Mixtral-8x7B, Yi-34B-chat, Code-Llama-34B, Mistral-7B, WizardLM-13B) to paying API models (GPT-4, GPT-3.5-turbo, Claude-v1/v2, You.com API, sonar-small/medium-online).

Domains and dataset breakdown:

Domain Datasets / Task Example sizes
Commonsense Reasoning HellaSwag, Winogrande, ARC-Challenge 10k / 40k / 2.5k
Knowledge-based QA MMLU (57 subject splits) ~3k total
Conversation MT-Bench (GPT-4 judged) 80 prompts
Math GSM8K 8k
Coding MBPP 1k
Retrieval-augmented Gen. In-house RAG queries 800

Metadata per inference record includes:

  • sample_id: formatted as {dataset}.{split}.{index}
  • eval_name: dataset source tag
  • model_name: LLM identifier
  • prompt: full text input
  • model_response: LLM output text
  • true_label: gold answer(s)
  • performance: binary (exact match) or continuous ([0,1] normalized rating)
  • cost: float, USD estimate per inference

All entries are stored in the JSON Lines format, facilitating streaming access and efficient querying.

3. Annotation, Labeling, and Performance Statistics

Ground-truth labels are directly sourced from the underlying benchmark (exact-match for classification/QA, correctness test-suites for code, etc.). Performance scoring varies by domain: exact-match accuracy is used for structured tasks, while normalized GPT-4 ratings are used for conversational, code, and RAG sets.

RouterBench does not store a fixed “best-LLM” label for each example; researchers must compute the oracle choice per record, typically selecting the highest quality (with cost tie-breaking).

Statistical summaries:

  • Per-model accuracy across tasks: ranges from ~38% (Claude-instant) to ~83% (GPT-4); standard deviation ≈15–20pp
  • Per-model mean cost: from $0.08 (Mistral-7B) to$4.09 (GPT-4); σ≈1.2
  • Oracle router: overall mean perf ≈0.96, mean cost ≈$0.30; routinely picks mid-range open-source models on specific tasks
  • Cost–quality correlation: ρ≈0.7, but variance is substantial—some cheaper models can match or exceed costly ones on select domains
  • Task-wise difficulty: ARC-Challenge yields lowest oracle accuracy, GSM8K at ~75% with notably lower cost

Performance scoring for the non-exact-match domains relies on automated GPT-4 judging, which may introduce bias due to a singular evaluation source.

4. Theoretical Framework: Cost–Quality Analysis and AIQ Metric

RouterBench formalizes router evaluation using the cost–quality plane. Let $c_m = \mathbb{E}_{x \in D}[c(\text{LLM}_m(x))]denoteexpectedcostand denote expected cost and q_m = \mathbb{E}_{x \in D}[q(\text{LLM}_m(x))]denoteexpectedqualityforLLM denote expected quality for LLM m.</p><p>Routers.</p> <p>Routers R_\theta : X \rightarrow Larescoredbyexpectedcost are scored by expected cost c_{R_\theta} = \mathbb{E}_{x \in D}[c(R_\theta(x))]andquality and quality q_{R_\theta} = \mathbb{E}_{x \in D}[q(R_\theta(x))].Mixingtworouters. Mixing two routers R_1, R_2at at t \in [0,1]yieldsinterpolatedpoliciesandassociatedcost/quality.</p><p>Thenondecreasingconvexhull(NDCH)isconstructedovercostqualitypointstodefinetheParetoenvelope.TheAverageImprovementinQuality(AIQ)iscomputedasfollows:</p><p> yields interpolated policies and associated cost/quality.</p> <p>The non-decreasing convex hull (NDCH) is constructed over cost–quality points to define the Pareto envelope. The Average Improvement in Quality (AIQ) is computed as follows:</p> <p>\text{AIQ}(R) = \frac{1}{c_{\max} - c_{\min}} \int_{c_{\min}}^{c_{\max}} \widetilde{R}(c) \, \mathrm{d}c</p><p>where</p> <p>where \widetilde{R}(c)istheNDCHinterpolatedqualityatcost is the NDCH-interpolated quality at cost c$. Higher AIQ quantifies strictly superior cost–performance trade-off.

5. Access, Usage, and Repository Details

RouterBench is hosted at https://github.com/withmartian/routerbench. The data is organized into per-dataset JSONL files (≈100 MB each), supplemented by evaluation scripts, routing baselines, and analysis notebooks. The compressed dataset size is ≈800 MB.

Installation prerequisites:

  • Python ≥ 3.8, pandas, numpy, scikit-learn, transformers
  • Clone repository; install dependencies via pip install -r requirements.txt

Common access pattern:

1
2
3
4
5
6
7
import json
from itertools import islice

f = open("data/jsonl/hellaswag.dev.jsonl","r")
for line in islice(f, 5):
    rec = json.loads(line)
    print(rec["sample_id"], rec["model_name"], rec["performance"], f"{rec['cost']:.5f}")

Licensing is Apache 2.0. Users are expected to cite Q.J. Hu et al., “RouterBench: A Benchmark for Multi-LLM Routing System,” ICML 2024.

6. Applications, Limitations, and Context

RouterBench supports a range of investigations:

  • Training of predictive routers using precomputed labels
  • Benchmarking hybrid or non-predictive routing approaches without incurring further inference cost
  • Comparative paper of routing formulations under AIQ and NDCH theoretical metrics
  • Analysis of LLM complementarity and domain specialization

Key limitations:

  • No latency or throughput metrics are included; the dataset quantifies only cost and accuracy
  • Dataset focuses on 11 LLMs; coverage must be updated manually as new models emerge
  • Tasks are drawn from general benchmarks; niche domains (e.g., low-resource translation) are not addressed
  • RAG set is limited to 800 queries; generalizability to broader retrieval systems is uncertain
  • Quality signals on some tasks derive from automated GPT-4 judging, which may introduce systematic bias

A plausible implication is that RouterBench enables reproducible, large-scale routing studies but should be augmented or extended by researchers addressing real-time, low-resource, or latency-sensitive domains.

7. Common Misconceptions and Relationship to Precursor Work

RouterBench is distinct from prior approaches that repurpose existing evaluation suites for router training. For example, “LLM Routing with Benchmark Datasets” (Shnitzer et al., 2023) leverages per-instance results from HELM and MixInstruct as training sources for correctness predictors, but does not define or create a packaged dataset named “RouterBench,” nor does it provide standardized splits, metadata, or usage APIs. All evaluation in that work is batched on benchmarks originally designed for other purposes.

In summary, RouterBench constitutes the first rigorously constructed dataset and benchmark specifically targeting the multi-LLM routing problem with cost–quality analysis, unified task coverage, and standardized access protocols.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RouterBench Dataset.