RouterBench Benchmark
- RouterBench is a benchmark and dataset designed to assess multi-LLM routing systems by providing over 405k precomputed inference records and standardized evaluation metrics.
- It covers diverse application domains such as commonsense reasoning, QA, conversation, math, coding, and retrieval-augmented generation with 11 representative LLMs.
- The framework leverages cost–quality analysis, convex-hull evaluation, and the AIQ metric to compare routing policies without incurring real-time compute costs.
RouterBench is a benchmark and dataset designed to systematically assess multi-LLM routing systems by providing a large pool of precomputed inference outcomes, model metadata, and a formal evaluation framework (Hu et al., 18 Mar 2024). Its primary goal is to catalyze methodological rigor and reproducibility in LLM router research by enabling train/evaluation of routing policies under diverse cost–quality regimes without incurring real-time compute costs. RouterBench should not be confused with approaches that repurpose existing benchmarks for LLM selection; it is a purpose-built artifact with well-defined data structures, representative tasks, and standardized metrics.
1. Scope, Motivation, and Design Principles
The core motivation for RouterBench is the proliferation of LLMs with heterogeneous performance profiles: models differ substantially in cost (e.g., API pricing, inference-time latency) and accuracy (task-specific quality). No single LLM dominates across all real-world tasks, creating demand for serving strategies that dispatch user prompts to the optimal model for each case. RouterBench addresses the previously missing ingredient—a unified benchmark and dataset for rigorous router design, testing, and statistical comparison.
Design goals:
- To furnish over 405k pre-computed LLM outputs augmented with cost, performance, and ground-truth metadata
- To span major application domains: commonsense reasoning, knowledge QA, conversation, math, code generation, and retrieval-augmented generation (RAG)
- To provide a theoretical evaluation framework that enables analytical comparison between routing policies, leveraging cost–quality trade-offs, convex-hull analysis, and the Average Improvement in Quality (AIQ) metric
Intended use cases include:
- Developing predictive routing models (e.g., K-Nearest Neighbor, MLP, transformer-based classifiers) that select the model per instance
- Evaluating non-predictive strategies (e.g., cascades, rule-based selectors)
- Benchmarking novel routing methodologies against established oracles, baseline static selectors, and theoretical bounds
2. Data Collection, Contents, and Structure
RouterBench comprises 405,467 inference records, sampled from eight standard NLP/math/programming datasets plus an in-house RAG set. It covers 11 representative LLMs—ranging from open-source models (Llama-70B-chat, Mixtral-8x7B, Yi-34B-chat, Code-Llama-34B, Mistral-7B, WizardLM-13B) to paying API models (GPT-4, GPT-3.5-turbo, Claude-v1/v2, You.com API, sonar-small/medium-online).
Domains and dataset breakdown:
| Domain | Datasets / Task | Example sizes |
|---|---|---|
| Commonsense Reasoning | HellaSwag, Winogrande, ARC-Challenge | 10k / 40k / 2.5k |
| Knowledge-based QA | MMLU (57 subject splits) | ~3k total |
| Conversation | MT-Bench (GPT-4 judged) | 80 prompts |
| Math | GSM8K | 8k |
| Coding | MBPP | 1k |
| Retrieval-augmented Gen. | In-house RAG queries | 800 |
Metadata per inference record includes:
sample_id: formatted as{dataset}.{split}.{index}eval_name: dataset source tagmodel_name: LLM identifierprompt: full text inputmodel_response: LLM output texttrue_label: gold answer(s)performance: binary (exact match) or continuous ([0,1] normalized rating)cost: float, USD estimate per inference
All entries are stored in the JSON Lines format, facilitating streaming access and efficient querying.
3. Annotation, Labeling, and Performance Statistics
Ground-truth labels are directly sourced from the underlying benchmark (exact-match for classification/QA, correctness test-suites for code, etc.). Performance scoring varies by domain: exact-match accuracy is used for structured tasks, while normalized GPT-4 ratings are used for conversational, code, and RAG sets.
RouterBench does not store a fixed “best-LLM” label for each example; researchers must compute the oracle choice per record, typically selecting the highest quality (with cost tie-breaking).
Statistical summaries:
- Per-model accuracy across tasks: ranges from ~38% (Claude-instant) to ~83% (GPT-4); standard deviation ≈15–20pp
- Per-model mean cost: from $0.08 (Mistral-7B) to$4.09 (GPT-4); σ≈1.2
- Oracle router: overall mean perf ≈0.96, mean cost ≈$0.30; routinely picks mid-range open-source models on specific tasks
- Cost–quality correlation: ρ≈0.7, but variance is substantial—some cheaper models can match or exceed costly ones on select domains
- Task-wise difficulty: ARC-Challenge yields lowest oracle accuracy, GSM8K at ~75% with notably lower cost
Performance scoring for the non-exact-match domains relies on automated GPT-4 judging, which may introduce bias due to a singular evaluation source.
4. Theoretical Framework: Cost–Quality Analysis and AIQ Metric
RouterBench formalizes router evaluation using the cost–quality plane. Let $c_m = \mathbb{E}_{x \in D}[c(\text{LLM}_m(x))]q_m = \mathbb{E}_{x \in D}[q(\text{LLM}_m(x))]mR_\theta : X \rightarrow Lc_{R_\theta} = \mathbb{E}_{x \in D}[c(R_\theta(x))]q_{R_\theta} = \mathbb{E}_{x \in D}[q(R_\theta(x))]R_1, R_2t \in [0,1]\text{AIQ}(R) = \frac{1}{c_{\max} - c_{\min}} \int_{c_{\min}}^{c_{\max}} \widetilde{R}(c) \, \mathrm{d}c\widetilde{R}(c)c$. Higher AIQ quantifies strictly superior cost–performance trade-off.
5. Access, Usage, and Repository Details
RouterBench is hosted at https://github.com/withmartian/routerbench. The data is organized into per-dataset JSONL files (≈100 MB each), supplemented by evaluation scripts, routing baselines, and analysis notebooks. The compressed dataset size is ≈800 MB.
Installation prerequisites:
- Python ≥ 3.8, pandas, numpy, scikit-learn, transformers
- Clone repository; install dependencies via
pip install -r requirements.txt
Common access pattern:
1 2 3 4 5 6 7 |
import json from itertools import islice f = open("data/jsonl/hellaswag.dev.jsonl","r") for line in islice(f, 5): rec = json.loads(line) print(rec["sample_id"], rec["model_name"], rec["performance"], f"{rec['cost']:.5f}") |
Licensing is Apache 2.0. Users are expected to cite Q.J. Hu et al., “RouterBench: A Benchmark for Multi-LLM Routing System,” ICML 2024.
6. Applications, Limitations, and Context
RouterBench supports a range of investigations:
- Training of predictive routers using precomputed labels
- Benchmarking hybrid or non-predictive routing approaches without incurring further inference cost
- Comparative paper of routing formulations under AIQ and NDCH theoretical metrics
- Analysis of LLM complementarity and domain specialization
Key limitations:
- No latency or throughput metrics are included; the dataset quantifies only cost and accuracy
- Dataset focuses on 11 LLMs; coverage must be updated manually as new models emerge
- Tasks are drawn from general benchmarks; niche domains (e.g., low-resource translation) are not addressed
- RAG set is limited to 800 queries; generalizability to broader retrieval systems is uncertain
- Quality signals on some tasks derive from automated GPT-4 judging, which may introduce systematic bias
A plausible implication is that RouterBench enables reproducible, large-scale routing studies but should be augmented or extended by researchers addressing real-time, low-resource, or latency-sensitive domains.
7. Common Misconceptions and Relationship to Precursor Work
RouterBench is distinct from prior approaches that repurpose existing evaluation suites for router training. For example, “LLM Routing with Benchmark Datasets” (Shnitzer et al., 2023) leverages per-instance results from HELM and MixInstruct as training sources for correctness predictors, but does not define or create a packaged dataset named “RouterBench,” nor does it provide standardized splits, metadata, or usage APIs. All evaluation in that work is batched on benchmarks originally designed for other purposes.
In summary, RouterBench constitutes the first rigorously constructed dataset and benchmark specifically targeting the multi-LLM routing problem with cost–quality analysis, unified task coverage, and standardized access protocols.