RouterBench Benchmark

Updated 11 November 2025

RouterBench is a benchmark and dataset designed to assess multi-LLM routing systems by providing over 405k precomputed inference records and standardized evaluation metrics.
It covers diverse application domains such as commonsense reasoning, QA, conversation, math, coding, and retrieval-augmented generation with 11 representative LLMs.
The framework leverages cost–quality analysis, convex-hull evaluation, and the AIQ metric to compare routing policies without incurring real-time compute costs.

RouterBench is a benchmark and dataset designed to systematically assess multi-LLM routing systems by providing a large pool of precomputed inference outcomes, model metadata, and a formal evaluation framework (Hu et al., 18 Mar 2024). Its primary goal is to catalyze methodological rigor and reproducibility in LLM router research by enabling train/evaluation of routing policies under diverse cost–quality regimes without incurring real-time compute costs. RouterBench should not be confused with approaches that repurpose existing benchmarks for LLM selection; it is a purpose-built artifact with well-defined data structures, representative tasks, and standardized metrics.

1. Scope, Motivation, and Design Principles

The core motivation for RouterBench is the proliferation of LLMs with heterogeneous performance profiles: models differ substantially in cost (e.g., API pricing, inference-time latency) and accuracy (task-specific quality). No single LLM dominates across all real-world tasks, creating demand for serving strategies that dispatch user prompts to the optimal model for each case. RouterBench addresses the previously missing ingredient—a unified benchmark and dataset for rigorous router design, testing, and statistical comparison.

Design goals:

To furnish over 405k pre-computed LLM outputs augmented with cost, performance, and ground-truth metadata
To span major application domains: commonsense reasoning, knowledge QA, conversation, math, code generation, and retrieval-augmented generation (RAG)
To provide a theoretical evaluation framework that enables analytical comparison between routing policies, leveraging cost–quality trade-offs, convex-hull analysis, and the Average Improvement in Quality (AIQ) metric

Intended use cases include:

Developing predictive routing models (e.g., K-Nearest Neighbor, MLP, transformer-based classifiers) that select the model per instance
Evaluating non-predictive strategies (e.g., cascades, rule-based selectors)
Benchmarking novel routing methodologies against established oracles, baseline static selectors, and theoretical bounds

2. Data Collection, Contents, and Structure

RouterBench comprises 405,467 inference records, sampled from eight standard NLP/math/programming datasets plus an in-house RAG set. It covers 11 representative LLMs—ranging from open-source models (Llama-70B-chat, Mixtral-8x7B, Yi-34B-chat, Code-Llama-34B, Mistral-7B, WizardLM-13B) to paying API models (GPT-4, GPT-3.5-turbo, Claude-v1/v2, You.com API, sonar-small/medium-online).

Domains and dataset breakdown:

Domain	Datasets / Task	Example sizes
Commonsense Reasoning	HellaSwag, Winogrande, ARC-Challenge	10k / 40k / 2.5k
Knowledge-based QA	MMLU (57 subject splits)	~3k total
Conversation	MT-Bench (GPT-4 judged)	80 prompts
Math	GSM8K	8k
Coding	MBPP	1k
Retrieval-augmented Gen.	In-house RAG queries	800

Metadata per inference record includes:

sample_id: formatted as {dataset}.{split}.{index}
eval_name: dataset source tag
model_name: LLM identifier
prompt: full text input
model_response: LLM output text
true_label: gold answer(s)
performance: binary (exact match) or continuous ([0,1] normalized rating)
cost: float, USD estimate per inference

All entries are stored in the JSON Lines format, facilitating streaming access and efficient querying.

3. Annotation, Labeling, and Performance Statistics

Ground-truth labels are directly sourced from the underlying benchmark (exact-match for classification/QA, correctness test-suites for code, etc.). Performance scoring varies by domain: exact-match accuracy is used for structured tasks, while normalized GPT-4 ratings are used for conversational, code, and RAG sets.

RouterBench does not store a fixed “best-LLM” label for each example; researchers must compute the oracle choice per record, typically selecting the highest quality (with cost tie-breaking).

Statistical summaries:

Per-model accuracy across tasks: ranges from ~38% (Claude-instant) to ~83% (GPT-4); standard deviation ≈15–20pp
Per-model mean cost: from $0.08 (Mistral-7B) to$4.09 (GPT-4); σ≈1.2
Oracle router: overall mean perf ≈0.96, mean cost ≈$0.30; routinely picks mid-range open-source models on specific tasks
Cost–quality correlation: ρ≈0.7, but variance is substantial—some cheaper models can match or exceed costly ones on select domains
Task-wise difficulty: ARC-Challenge yields lowest oracle accuracy, GSM8K at ~75% with notably lower cost

Performance scoring for the non-exact-match domains relies on automated GPT-4 judging, which may introduce bias due to a singular evaluation source.

4. Theoretical Framework: Cost–Quality Analysis and AIQ Metric

RouterBench formalizes router evaluation using the cost–quality plane. Let $c_m = \mathbb{E}_{x \in D}[c(\text{LLM}_m(x))] $denote expected cost and$ q_m = \mathbb{E}_{x \in D}[q(\text{LLM}_m(x))] $denote expected quality for LLM$ m $. Routers$ R_\theta : X \rightarrow L $are scored by expected cost$ c_{R_\theta} = \mathbb{E}_{x \in D}[c(R_\theta(x))] $and quality$ q_{R_\theta} = \mathbb{E}_{x \in D}[q(R_\theta(x))] $. Mixing two routers$ R_1, R_2 $at$ t \in [0,1] $yields interpolated policies and associated cost/quality. The non-decreasing convex hull (NDCH) is constructed over cost–quality points to define the Pareto envelope. The Average Improvement in Quality (AIQ) is computed as follows: $ \text{AIQ}(R) = \frac{1}{c_{\max} - c_{\min}} \int_{c_{\min}}^{c_{\max}} \widetilde{R}(c) \, \mathrm{d}c $ where$ \widetilde{R}(c) $is the NDCH-interpolated quality at cost$ c$. Higher AIQ quantifies strictly superior cost–performance trade-off.

5. Access, Usage, and Repository Details

RouterBench is hosted at https://github.com/withmartian/routerbench. The data is organized into per-dataset JSONL files (≈100 MB each), supplemented by evaluation scripts, routing baselines, and analysis notebooks. The compressed dataset size is ≈800 MB.

Installation prerequisites:

Python ≥ 3.8, pandas, numpy, scikit-learn, transformers
Clone repository; install dependencies via pip install -r requirements.txt

Common access pattern:

import json
from itertools import islice

f = open("data/jsonl/hellaswag.dev.jsonl","r")
for line in islice(f, 5):
    rec = json.loads(line)
    print(rec["sample_id"], rec["model_name"], rec["performance"], f"{rec['cost']:.5f}")

Licensing is Apache 2.0. Users are expected to cite Q.J. Hu et al., “RouterBench: A Benchmark for Multi-LLM Routing System,” ICML 2024.

6. Applications, Limitations, and Context

RouterBench supports a range of investigations:

Training of predictive routers using precomputed labels
Benchmarking hybrid or non-predictive routing approaches without incurring further inference cost
Comparative study of routing formulations under AIQ and NDCH theoretical metrics
Analysis of LLM complementarity and domain specialization

Key limitations:

No latency or throughput metrics are included; the dataset quantifies only cost and accuracy
Dataset focuses on 11 LLMs; coverage must be updated manually as new models emerge
Tasks are drawn from general benchmarks; niche domains (e.g., low-resource translation) are not addressed
RAG set is limited to 800 queries; generalizability to broader retrieval systems is uncertain
Quality signals on some tasks derive from automated GPT-4 judging, which may introduce systematic bias

A plausible implication is that RouterBench enables reproducible, large-scale routing studies but should be augmented or extended by researchers addressing real-time, low-resource, or latency-sensitive domains.

7. Common Misconceptions and Relationship to Precursor Work

RouterBench is distinct from prior approaches that repurpose existing evaluation suites for router training. For example, “LLM Routing with Benchmark Datasets” (Shnitzer et al., 2023) leverages per-instance results from HELM and MixInstruct as training sources for correctness predictors, but does not define or create a packaged dataset named “RouterBench,” nor does it provide standardized splits, metadata, or usage APIs. All evaluation in that work is batched on benchmarks originally designed for other purposes.

In summary, RouterBench constitutes the first rigorously constructed dataset and benchmark specifically targeting the multi-LLM routing problem with cost–quality analysis, unified task coverage, and standardized access protocols.