Papers
Topics
Authors
Recent
Search
2000 character limit reached

LR-bench: LLM Routing & Reviewer Matching

Updated 3 February 2026
  • LR-bench is a dual benchmark framework that evaluates LLM routing algorithms and reviewer-paper expertise ranking with high-fidelity, large-scale datasets.
  • LLMRouterBench rigorously tests routing policies by assessing model complementarity, performance–cost trade-offs, and key metrics like accuracy and latency.
  • LR-bench (RATE) employs expert self-ratings and dual-view training to fine-tune reviewer assignment, ensuring reliable and scalable peer review evaluation.

LR-bench designates two recent datasets for benchmarking in large-scale AI systems: (1) LLMRouterBench, a comprehensive testbed for LLM routing, and (2) LR-bench (RATE), a high-fidelity benchmark for reviewer-paper expertise ranking in peer review. Both are prominent references for evaluating routing algorithms in LLM ensembles and fine-grained reviewer assignment frameworks, respectively. This entry systematically reviews each LR-bench, their technical design, key evaluation paradigms, empirical findings, and implications for deployment.

1. Definitions and Scope

“LR-bench” is an ambiguous identifier encompassing two prominent benchmarks:

  • LLMRouterBench (LR-bench in the LLM routing literature): A unified, large-scale evaluation framework for automated LLM-to-query assignment, supporting both performance maximization and accuracy-cost trade-off paradigms in model routing (Li et al., 12 Jan 2026).
  • LR-bench (RATE peer review context): A high-coverage dataset for reviewer-paper expertise ranking, constructed from 2024–2025 arXiv AI/NLP corpora, with annotation via direct expert self-ratings, supporting objective assessment of reviewer assignment models (Liu et al., 27 Jan 2026).

Both benchmarks emphasize up-to-date, high-quantity, and high-fidelity ground truth for their respective domains.

2. LLMRouterBench: Unified Benchmark for Model Routing

LLMRouterBench (“LR-bench”) is primarily designed to rigorously evaluate LLM routing algorithms tasked with assigning each query to the most suitable model in an ensemble. The central goal is to empirically quantify model complementarity, routing strategies, and performance–cost trade-offs on a massive, modern task suite (Li et al., 12 Jan 2026).

2.1 Problem Setting and Formalism

The framework formalizes routing as mapping a query set Q={q1,,qN}Q = \{q_1,\ldots,q_N\} to a model ensemble M={m1,,mK}M = \{m_1,\ldots,m_K\}. Model performance is quantified by s(m,q){0,1}s(m, q) \in \{0,1\} (or [0,1][0,1]), and c(m)>0c(m)>0 specifies cost:

  • Performance-oriented routing seeks a=argmaxa:QMAcc(a)\displaystyle a^* = \arg\max_{a:Q\to M} \mathrm{Acc}(a), where Acc(a)=1NqQs(a(q),q)\mathrm{Acc}(a) = \frac1N \sum_{q\in Q} s(a(q),q).
  • Performance–cost trade-off routing addresses either a constrained budget (maximize accuracy under c(a(q))B\sum c(a(q))\leq B) or by maximizing Acc(a)λCost(a)\mathrm{Acc}(a) - \lambda \cdot \mathrm{Cost}(a).

Router methods are thus evaluated as policies a:QMa: Q \rightarrow M parameterized over a hyperparameter space Θ\Theta (e.g., via cost scaling, confidence thresholds).

2.2 Dataset and Model Coverage

LLMRouterBench spans:

  • 21 datasets, 23945 prompts, ~1.8 billion tokens, partitioned for both performance and cost-awareness.
  • 33 model endpoints: 20 open-source models (\sim7B parameters), 13 proprietary/flagship models (GPT-5, Gemini-2.5-Pro), with variation in size, latency, and token pricing.
  • Domains encompass mathematics, code, logic, knowledge-intensive tasks, affective reasoning, instruction following, and tool use.

Careful dataset curation excludes tasks that are saturated for top models or too difficult for lightweight models, ensuring meaningful variance across the ensemble.

2.3 Metrics

Evaluation metrics address both average-case and Pareto-optimal routing:

Metric Definition/Usage
AvgAcc(a)\mathrm{AvgAcc}(a) Average per-dataset accuracy, 1DdAcc(a,d)\frac{1}{|D|}\sum_{d}\mathrm{Acc}(a,d)
Gain@b Relative improvement over baseline bb, e.g., Gain@B\mathrm{Gain}@\mathcal{B} for best single model
Gap@Oracle Accuracy gap to perfect hindsight router O\mathcal{O}
Performance–cost Pareto frontier, cost savings at fixed accuracy, and ParetoDist\mathrm{ParetoDist} for frontier proximity

Reference routers for normalization include random choice, best single-model, and the Oracle policy (best per-query assignment).

3. LLMRouterBench Baseline Methods and Empirical Findings

3.1 Routing Baselines

Ten routers are integrated:

  • Performance-only: RouterDC (contrastive classifier), EmbedLLM (embedding nearest neighbor), MODEL-SAT (capability-instruction LLM), GraphRouter (graph-based GNN), Avengers (k-means cluster-to-model).
  • Performance–cost: HybridLLM (difficulty predictor), FrugalGPT (cascaded), RouteLLM (matrix-factorized pairwise wins), Avengers-Pro (cost-weighted clusters), OpenRouter (commercial black-box).

Simple unsupervised methods without neural training (e.g., Avengers) match the best performance of the neural/graph baselines.

3.2 Oracle Gap and Analysis

Significant findings include:

  • No single model dominates all domains; model complementarity is pronounced (Mathematics, Code, Logic, Affective each led by different models).
  • Routers cluster in performance: Sophisticated and naïve routers yield near-identical accuracy under unified evaluation.
  • Large, persistent Oracle gap (20%): Top routers AvgAcc\mathrm{AvgAcc}\sim70–72% vs Oracle \sim92%.
  • Recall on rare-expert queries is low: When only 1–3 models are correct, router recall drops to ~24%.
  • Ensemble size has diminishing returns: Oracle gains plateau beyond \sim10 models.
  • Latency and cost variance: Models with similar cost/accuracy may differ ×\times5 in real-world latency.

3.3 Deployment Implications

  • Simple clustering or non-parametric methods suffice for many production settings.
  • Improving per-query recall, especially in “specialist” cases, is the main avenue for closing the Oracle gap.
  • Model pool curation is critical—carefully chosen small ensembles outperform large, undifferentiated ones.

All code, data, and adapter implementations are openly available (Li et al., 12 Jan 2026).

4. LR-bench for Reviewer-Paper Assignment (RATE Context)

LR-bench, in the context of peer review, addresses reviewer-paper expertise ranking using modern, large-scale, and behaviorally anchored human labels (Liu et al., 27 Jan 2026).

4.1 Dataset Construction

  • Source corpus: 2024–2025 arXiv manuscripts in cs.AI, cs.CL, cs.CV, cs.IR, cs.LG; 161,228 unique papers, 513,877 disambiguated authors.
  • Labeling: 1,055 expert self-assessed ratings (BARS: 1–5 scale) via survey: 407 respondents, capped at 6 queries/reviewer.
  • Preference sets: Pointwise and triplet forms, supporting both pairwise and listwise evaluation.

The test split excludes held-out papers from all training-derived data; reviewer overlap is permitted.

4.2 Data Schema and Access

Each example is a JSON object with fields:

  • paper_id, title, abstract
  • reviewer_id (anonymized)
  • reviewer_profile (keyword-based, two-year pub list)
  • rating (integer 1–5)
  • Optional: paper_citation_ids

Distributed as JSONL, validated by schema; loadable out-of-the-box via Hugging Face (Gnociew/LR-bench).

4.3 Evaluation Protocol

The core evaluation metrics are:

Metric Definition
Normalized loss (L\mathcal{L}) Fraction of incorrectly ordered preference pairs, normalized by ϵxϵy|\epsilon_x-\epsilon_y|
Precision@k Fraction of correctly ordered pairs/lists at kk
Human win rate Experts compare top-3 reviewer lists from competing methods

5. Methodology and Key Results (RATE Framework)

5.1 Reviewer Profiling and Annotation-free Learning

  • Reviewer profiles: For each reviewer, aggregate keywords from two-year publications using LLM (Qwen3-Max, GLM-4.6); serialize as a natural-language description.
  • Dual-view preference supervision: Construct training triplets both paper-anchored (best/worst reviewer) and reviewer-anchored (best/worst paper), based on sparse BM25 retrieval scores.
  • Fine-tuning: Embedding model (e.g., Qwen3-Embedding-8B) fine-tuned via LoRA on these triplets, using cosine similarity as the score.

The dual-view approach yields significant gains (+2.15% precision over best single-view).

5.2 Empirical Performance

Algorithm Avg. Loss ↓ Avg. Precision ↑
TPMS (TF-IDF baseline) 0.2597 71.49%
SciNCL (max pooling) 0.2203 73.64%
SPECTER2 PRX (max pooling) 0.2074 75.17%
GLM-4.6 + RATE-8B 0.1965 76.78%
Qwen3-Max + RATE-8B 0.1904 77.41%

Human win rate for the best method is 35–44% against strong baselines, validating listwise improvements.

5.3 Pipeline Integration

A full pipeline for live reviewer assignment, able to use live reviewer publication lists and match new submissions by direct embedding comparisons, is supported via the released codebase.

6. Implications and Limitations

6.1 For LLM Routing

  • Main bottleneck is in the per-query decision mechanism, not model/embedding backbone.
  • Automated, large-scale, and latency-aware router evaluation is now feasible across diverse LLM pools.

6.2 For Peer Review Assignment

  • LR-bench (RATE) uniquely offers contemporary, expert-graded ground-truth spanning the LLM era.
  • Accurate profiling and embedding fine-tuning with dual-view training yields state-of-the-art reviewer–paper matching.
  • Reviewer-centric annotation-free frameworks efficiently scale as topical drift accelerates.

6.3 Commonalities

Both LR-bench frameworks exemplify modern benchmark development—transparent, broad coverage, reproducible methodology, and focus on operational deployment. They provide high-confidence reference points for both model selection and critical workflow automation in AI research.

7. Availability and Research Directions

Future research directions for both domains prioritize (i) improved recall of rare/single-expert cases, (ii) deeper usage of uncertainty and difficulty estimation, (iii) latency- and cost-aware deployment, and (iv) architecture and training that robustly generalize to new domains and query modalities.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LR-bench.