LR-bench: LLM Routing & Reviewer Matching
- LR-bench is a dual benchmark framework that evaluates LLM routing algorithms and reviewer-paper expertise ranking with high-fidelity, large-scale datasets.
- LLMRouterBench rigorously tests routing policies by assessing model complementarity, performance–cost trade-offs, and key metrics like accuracy and latency.
- LR-bench (RATE) employs expert self-ratings and dual-view training to fine-tune reviewer assignment, ensuring reliable and scalable peer review evaluation.
LR-bench designates two recent datasets for benchmarking in large-scale AI systems: (1) LLMRouterBench, a comprehensive testbed for LLM routing, and (2) LR-bench (RATE), a high-fidelity benchmark for reviewer-paper expertise ranking in peer review. Both are prominent references for evaluating routing algorithms in LLM ensembles and fine-grained reviewer assignment frameworks, respectively. This entry systematically reviews each LR-bench, their technical design, key evaluation paradigms, empirical findings, and implications for deployment.
1. Definitions and Scope
“LR-bench” is an ambiguous identifier encompassing two prominent benchmarks:
- LLMRouterBench (LR-bench in the LLM routing literature): A unified, large-scale evaluation framework for automated LLM-to-query assignment, supporting both performance maximization and accuracy-cost trade-off paradigms in model routing (Li et al., 12 Jan 2026).
- LR-bench (RATE peer review context): A high-coverage dataset for reviewer-paper expertise ranking, constructed from 2024–2025 arXiv AI/NLP corpora, with annotation via direct expert self-ratings, supporting objective assessment of reviewer assignment models (Liu et al., 27 Jan 2026).
Both benchmarks emphasize up-to-date, high-quantity, and high-fidelity ground truth for their respective domains.
2. LLMRouterBench: Unified Benchmark for Model Routing
LLMRouterBench (“LR-bench”) is primarily designed to rigorously evaluate LLM routing algorithms tasked with assigning each query to the most suitable model in an ensemble. The central goal is to empirically quantify model complementarity, routing strategies, and performance–cost trade-offs on a massive, modern task suite (Li et al., 12 Jan 2026).
2.1 Problem Setting and Formalism
The framework formalizes routing as mapping a query set to a model ensemble . Model performance is quantified by (or ), and specifies cost:
- Performance-oriented routing seeks , where .
- Performance–cost trade-off routing addresses either a constrained budget (maximize accuracy under ) or by maximizing .
Router methods are thus evaluated as policies parameterized over a hyperparameter space (e.g., via cost scaling, confidence thresholds).
2.2 Dataset and Model Coverage
LLMRouterBench spans:
- 21 datasets, 23945 prompts, ~1.8 billion tokens, partitioned for both performance and cost-awareness.
- 33 model endpoints: 20 open-source models (7B parameters), 13 proprietary/flagship models (GPT-5, Gemini-2.5-Pro), with variation in size, latency, and token pricing.
- Domains encompass mathematics, code, logic, knowledge-intensive tasks, affective reasoning, instruction following, and tool use.
Careful dataset curation excludes tasks that are saturated for top models or too difficult for lightweight models, ensuring meaningful variance across the ensemble.
2.3 Metrics
Evaluation metrics address both average-case and Pareto-optimal routing:
| Metric | Definition/Usage |
|---|---|
| Average per-dataset accuracy, | |
| Gain@b | Relative improvement over baseline , e.g., for best single model |
| Gap@Oracle | Accuracy gap to perfect hindsight router |
| Performance–cost | Pareto frontier, cost savings at fixed accuracy, and for frontier proximity |
Reference routers for normalization include random choice, best single-model, and the Oracle policy (best per-query assignment).
3. LLMRouterBench Baseline Methods and Empirical Findings
3.1 Routing Baselines
Ten routers are integrated:
- Performance-only: RouterDC (contrastive classifier), EmbedLLM (embedding nearest neighbor), MODEL-SAT (capability-instruction LLM), GraphRouter (graph-based GNN), Avengers (k-means cluster-to-model).
- Performance–cost: HybridLLM (difficulty predictor), FrugalGPT (cascaded), RouteLLM (matrix-factorized pairwise wins), Avengers-Pro (cost-weighted clusters), OpenRouter (commercial black-box).
Simple unsupervised methods without neural training (e.g., Avengers) match the best performance of the neural/graph baselines.
3.2 Oracle Gap and Analysis
Significant findings include:
- No single model dominates all domains; model complementarity is pronounced (Mathematics, Code, Logic, Affective each led by different models).
- Routers cluster in performance: Sophisticated and naïve routers yield near-identical accuracy under unified evaluation.
- Large, persistent Oracle gap (20%): Top routers 70–72% vs Oracle 92%.
- Recall on rare-expert queries is low: When only 1–3 models are correct, router recall drops to ~24%.
- Ensemble size has diminishing returns: Oracle gains plateau beyond 10 models.
- Latency and cost variance: Models with similar cost/accuracy may differ 5 in real-world latency.
3.3 Deployment Implications
- Simple clustering or non-parametric methods suffice for many production settings.
- Improving per-query recall, especially in “specialist” cases, is the main avenue for closing the Oracle gap.
- Model pool curation is critical—carefully chosen small ensembles outperform large, undifferentiated ones.
All code, data, and adapter implementations are openly available (Li et al., 12 Jan 2026).
4. LR-bench for Reviewer-Paper Assignment (RATE Context)
LR-bench, in the context of peer review, addresses reviewer-paper expertise ranking using modern, large-scale, and behaviorally anchored human labels (Liu et al., 27 Jan 2026).
4.1 Dataset Construction
- Source corpus: 2024–2025 arXiv manuscripts in cs.AI, cs.CL, cs.CV, cs.IR, cs.LG; 161,228 unique papers, 513,877 disambiguated authors.
- Labeling: 1,055 expert self-assessed ratings (BARS: 1–5 scale) via survey: 407 respondents, capped at 6 queries/reviewer.
- Preference sets: Pointwise and triplet forms, supporting both pairwise and listwise evaluation.
The test split excludes held-out papers from all training-derived data; reviewer overlap is permitted.
4.2 Data Schema and Access
Each example is a JSON object with fields:
paper_id,title,abstractreviewer_id(anonymized)reviewer_profile(keyword-based, two-year pub list)rating(integer 1–5)- Optional:
paper_citation_ids
Distributed as JSONL, validated by schema; loadable out-of-the-box via Hugging Face (Gnociew/LR-bench).
4.3 Evaluation Protocol
The core evaluation metrics are:
| Metric | Definition |
|---|---|
| Normalized loss () | Fraction of incorrectly ordered preference pairs, normalized by |
| Precision@k | Fraction of correctly ordered pairs/lists at |
| Human win rate | Experts compare top-3 reviewer lists from competing methods |
5. Methodology and Key Results (RATE Framework)
5.1 Reviewer Profiling and Annotation-free Learning
- Reviewer profiles: For each reviewer, aggregate keywords from two-year publications using LLM (Qwen3-Max, GLM-4.6); serialize as a natural-language description.
- Dual-view preference supervision: Construct training triplets both paper-anchored (best/worst reviewer) and reviewer-anchored (best/worst paper), based on sparse BM25 retrieval scores.
- Fine-tuning: Embedding model (e.g., Qwen3-Embedding-8B) fine-tuned via LoRA on these triplets, using cosine similarity as the score.
The dual-view approach yields significant gains (+2.15% precision over best single-view).
5.2 Empirical Performance
| Algorithm | Avg. Loss ↓ | Avg. Precision ↑ |
|---|---|---|
| TPMS (TF-IDF baseline) | 0.2597 | 71.49% |
| SciNCL (max pooling) | 0.2203 | 73.64% |
| SPECTER2 PRX (max pooling) | 0.2074 | 75.17% |
| GLM-4.6 + RATE-8B | 0.1965 | 76.78% |
| Qwen3-Max + RATE-8B | 0.1904 | 77.41% |
Human win rate for the best method is 35–44% against strong baselines, validating listwise improvements.
5.3 Pipeline Integration
A full pipeline for live reviewer assignment, able to use live reviewer publication lists and match new submissions by direct embedding comparisons, is supported via the released codebase.
6. Implications and Limitations
6.1 For LLM Routing
- Main bottleneck is in the per-query decision mechanism, not model/embedding backbone.
- Automated, large-scale, and latency-aware router evaluation is now feasible across diverse LLM pools.
6.2 For Peer Review Assignment
- LR-bench (RATE) uniquely offers contemporary, expert-graded ground-truth spanning the LLM era.
- Accurate profiling and embedding fine-tuning with dual-view training yields state-of-the-art reviewer–paper matching.
- Reviewer-centric annotation-free frameworks efficiently scale as topical drift accelerates.
6.3 Commonalities
Both LR-bench frameworks exemplify modern benchmark development—transparent, broad coverage, reproducible methodology, and focus on operational deployment. They provide high-confidence reference points for both model selection and critical workflow automation in AI research.
7. Availability and Research Directions
- LLMRouterBench: Released at https://github.com/ynulihao/LLMRouterBench, includes datasets, evaluation APIs, and router baselines (Li et al., 12 Jan 2026).
- LR-bench (RATE): Released at https://huggingface.co/datasets/Gnociew/LR-bench and https://github.com/Gnociew/RATE-Reviewer-Assign (Liu et al., 27 Jan 2026).
Future research directions for both domains prioritize (i) improved recall of rare/single-expert cases, (ii) deeper usage of uncertainty and difficulty estimation, (iii) latency- and cost-aware deployment, and (iv) architecture and training that robustly generalize to new domains and query modalities.