PeerReview Bench

Updated 22 May 2026

PeerReview Bench is a comprehensive framework that curates real-world datasets and standardized protocols for assessing automated peer review systems.
It encompasses diverse tasks such as review generation, reviewer assignment, and code review, benchmarked using rigorous metrics like BLEU, ROUGE-L, and F1.
The benchmarks reveal insights into LLM scaling, multimodal effectiveness, and robustness against adversarial challenges, guiding improvements in automated peer review.

PeerReview Bench is an umbrella term used for systematically curated, multi-dimensional benchmark suites for rigorous evaluation and development of automated peer review systems in academic publishing. The concept spans full-paper review generation, multi-turn reviewer–author dialogues, reviewer recommendation and assignment, and reviewer quality assessment. Benchmarks under this designation are characterized by standardized datasets, multi-faceted task definitions, and metrics that target not only LLM-generated review overlap with human judgment, but also correctness, completeness, robustness, and process alignment with real-world peer review.

1. Foundations and Benchmark Design

PeerReview Bench implementations integrate carefully curated datasets, standardized evaluation protocols, and task taxonomies spanning the peer review lifecycle. Benchmarks such as MMReview, Re² (Review & Rebuttal), EchoReview-16K, CoCoReviewBench, and full-context code review datasets define the state of the art.

Dataset Composition: Datasets are constructed from real academic papers and their associated review artifacts. For example, MMReview includes 240 peer-reviewed manuscripts covering 17 domains in AI, Natural Sciences, Engineering, and Social Sciences, each with human-written review comments and multimodal content (text, figures, tables, PDF images) (Gao et al., 19 Aug 2025). Re² collects 19,926 initial submissions, 70,668 review comments, and 53,818 rebuttal turns from 45 venues (Zhang et al., 12 May 2025). CoCoReviewBench increases correctness by structuring, filtering, and adjudicating 134.8K atomic reviewer opinions across 3,900 ICLR and NeurIPS papers (Deng et al., 8 May 2026).
Taxonomy of Tasks: Task sets span summary generation, strengths and weaknesses extraction, soundness and presentation scoring, outcome formulation (including direct, conditional, and chain-of-thought decisions), preference ranking, adversarial robustness evaluation, reviewer assignment, code review, and review quality annotation.

2. Representative Benchmarks and Datasets

The following table summarizes characteristic PeerReview Bench resources:

Benchmark	Domains/Content	Key Features	Reference
MMReview	240 papers (4 major disciplines)	13 stepwise LLM/MLLM tasks, multimodality	(Gao et al., 19 Aug 2025)
Re²	19,926 AI papers	Full-stage review + multi-turn rebuttal	(Zhang et al., 12 May 2025)
EchoReview-16K	16,306 AI papers (cross-conference)	Citation-context mined, CoT review synthesis	(Zhang et al., 31 Jan 2026)
CoCoReviewBench	3,900 ML papers	Category-aligned, correctness-oriented	(Deng et al., 8 May 2026)
SWE-PRBench	350 PRs (OSS code)	Multi-context, issue-aligned code review	(Kumar, 27 Mar 2026)
SWRBench	1,000 PRs (OSS Python code)	Full-project, LLM-verified, F1/precision/recall	(Zeng et al., 1 Sep 2025)
OmniReview	202,756 verified review records	Three-tier reviewer recommendation	(Huang et al., 9 Feb 2026)
LR-Bench	1,055 annot (AI/NLP, 2024–2025)	Reviewer expertise ranking, annotation-free	(Liu et al., 27 Jan 2026)

Each resource is designed for maximum coverage of realistic workflows and tasks. Multimodal and multidomain data are key for generalization (Gao et al., 19 Aug 2025).

3. Task Taxonomies and Evaluation Metrics

PeerReview Bench frameworks define rich task sets and scoring regimes:

Step-wise Review Generation: Models must generate summaries, bullet-point strengths/weaknesses, and provide scalar soundness and presentation ratings. Step-wise protocols match editorial review forms.
Outcome Formulation: Tasks require models to map from manuscript and/or review content to granular scores or binary accept/reject decisions, often in chain-of-thought steps to elicit reasoning (Gao et al., 19 Aug 2025).
Preference Alignment: Models rank candidate papers or choose which in a pair is superior, benchmarking fine-grained value alignment with human judgment.
Robustness Testing: Adversarial tasks probe susceptibility to prompt injection, inverted strengths/weaknesses, and score inflation.
Reviewer-Assignment and Profiling Tasks: Systems match reviewers with submissions based on semantic profiles, optimizing for recall, discrimination, and ranking metrics (OmniReview, RATE) (Huang et al., 9 Feb 2026, Liu et al., 27 Jan 2026).
Code Review Tasks: Models must detect human-flagged issues in real pull requests, stratified by context provision, issue type, and actionability (Kumar, 27 Mar 2026, Zeng et al., 1 Sep 2025).
Evaluation Protocols:
- Text similarity: BLEU, ROUGE-L, BERTScore, BARTScore, EmbedCos.
- Human-alignment: LLM-as-a-Judge, Likert/ordinal scoring, Cohen's/Fleiss' κ.
- Numeric predictions: MAE, MSE, accuracy.
- Detection tasks: precision, recall, F1.
- Completeness/correctness: category-level skipping only where references exist; meta-review adjudication to remove erroneous opinions (CoCoReviewBench).
- Reviewer assignment: MAP, nDCG, R-Precision, Reciprocal Rank (OmniReview); normalized ranking loss for expertise ranking (RATE).

4. Benchmarking Results and Insights

PeerReview Bench resources enable standardized, cross-model evaluations, which have revealed several domain-wide trends:

Model Scaling: LLM parameter count is positively correlated with decision- and ranking-task fidelity. Large/closed models (GPT-4o, Gemini, Claude) yield lowest MAE and highest agreement with human rankings (Gao et al., 19 Aug 2025).
Multimodality: Incorporating extracted figures, tables, or PDF-image inputs reduces model vulnerability to prompt injection and increases review robustness (Gao et al., 19 Aug 2025).
Chain-of-Thought: CoT methods show consistent improvements in score prediction and overall review quality (Gao et al., 19 Aug 2025, Zhang et al., 31 Jan 2026).
Correctness over Overlap: Traditional overlap metrics systematically overestimate AI reviewer performance. Robust benchmarks use adjudicated, filtered, and category-aligned reference sets, resulting in stricter correctness metrics and lower but more meaningful scores (Deng et al., 8 May 2026).
Domain-dependence: Task difficulty and achievable accuracy vary by discipline; engineering papers are scored more accurately than those in natural sciences by LLMs (Gao et al., 19 Aug 2025).
Attention Dilution in Code Review: In software peer review, expanding context beyond the diff collapses contextual issue detection ("attention dilution"), with all models monotonically degrading from diff-only to full-project context (Kumar, 27 Mar 2026).
Review Assignment: Modern reviewer recommendation models (Pro-MMoE, RATE) anchor assignment quality with semantic profiling, mixed gating architectures, or annotation-free contrastive-preference training, beating TF-IDF and previous embedding baselines on recent benchmarks (Huang et al., 9 Feb 2026, Liu et al., 27 Jan 2026).

5. Robustness, Adversarial Protocols, and Error Analysis

Rigorous PeerReview Bench protocols stress test reviewer systems for stability and adversarial resistance.

Prompt Injection: MMReview and SWE-PRBench measure the degradation in performance when hidden instruction strings (e.g., “IGNORE ALL PREVIOUS INSTRUCTIONS”) are injected. Multimodal models restrict score inflation to <1 point, compared to 1–2 points for language-only models (Gao et al., 19 Aug 2025).
Fake Strengths/Weaknesses: GPT-4o-generated antonyms probe vulnerability to reverse sentiment; MAE increase and hallucination rates are monitored (Gao et al., 19 Aug 2025).
Meta-review Adjudication: CoCoReviewBench's ground truth is filtered via meta-review opinion to resolve reviewer–author and reviewer–reviewer conflicts, reducing propagation of erroneous references (Deng et al., 8 May 2026).
Code Review Hallucination: SWE-PRBench and SWRBench report hallucination rates, actionability, and semantic alignment, identifying critical failure points in issue detection and diagnosis (Kumar, 27 Mar 2026, Zeng et al., 1 Sep 2025).

6. Integration, Extensibility, and Applications

PeerReview Bench resources are modular and extensible, enabling integration across research workflows and system development:

Open Data/Code: All major resources (MMReview, Re², EchoReview-16K, SWE-PRBench, SWRBench, OmniReview, LR-Bench, PeeriScope) release open-source datasets, task harnesses, and evaluation scripts, with Dockerized and cloud deployment support (Ebrahimi et al., 27 Apr 2026).
Evaluation Harnesses: Standardized APIs and JSON prompts facilitate model evaluation, cross-comparison, and dashboard generation (Ebrahimi et al., 27 Apr 2026).
Downstream Use Cases: Editorial triage (flagging low-quality reviews), reviewer self-assessment, large-scale audit for program chairs, dynamic reviewer–author dialogue simulation, and peer review process simulation are typical applications (Ebrahimi et al., 27 Apr 2026, Zhang et al., 12 May 2025).
Extensions: Future work includes incorporating multidisciplinary data, integrating multimodal signals, bias/fairness auditing, and dynamic updating of reviewer profiles (Huang et al., 9 Feb 2026, Liu et al., 27 Jan 2026).

7. Future Directions and Open Challenges

Despite rapid advances, PeerReview Bench analysis identifies ongoing limitations and areas for methodological refinement:

Generalization: While benchmarks exist for AI, ML, and computer science, multidisciplinary expansion remains incomplete, especially for humanities and medical domains (Zhang et al., 31 Jan 2026).
Bias and Hallucination: LLM-based judges and reference construction can propagate artifact bias or incomplete coverage. Systematic bias audits and active learning are needed for robust model calibration (Zhang et al., 31 Jan 2026, Deng et al., 8 May 2026).
Completeness vs. Specificity: AI reviewers now cover more categories per paper than single humans but generally score lower per-category in thoroughness; achieving both breadth and depth is an ongoing challenge (Deng et al., 8 May 2026).
Reviewer Assignment Complexity: Author-order significance, co-authorship, and social/citation networks are incompletely modeled for reviewer assignment (Liu et al., 27 Jan 2026).
Simulation Fidelity: Simulation-based benchmarks (e.g., CI/MBC-based ranking) must be instantiated using real submission/reviewer data to validate assumptions; current analytic results are concentrated in grant panel contexts (Steppi et al., 2018).

A plausible implication is that PeerReview Bench will continue to evolve in multidomain coverage, robustness audits, and benchmark scenarios, supporting iterative improvement for academic peer review automation and reviewer assignment pipelines.