ReviewBench: Automated Peer Review Benchmark

Updated 4 July 2026

ReviewBench is a benchmark framework for evaluating LLM-generated peer reviews using both human-alignment and rubric compliance metrics.
It comprises two lineages: one focusing on similarity to human reviews (ReviewAgents) and another on evidence-grounded, rubric-driven analysis (ReviewGrounder).
Key metrics include language diversity, semantic and sentiment consistency, and paired win rate, offering actionable insights into review quality.

Searching arXiv for papers on ReviewBench and related review benchmarks. to=arxiv_search 񎔋json_string {"query":"ReviewBench peer review benchmark LLM review generation", "max_results": 10} to=arxiv_search ฝ่ายขายออนไลน์ ็ตทรู code {"query":"ReviewBench peer review benchmark LLM review generation", "max_results": 10} to=arxiv_search ҷкәын 尚度￣影音先锋 code {"query":"ReviewBench peer review benchmark LLM review generation", "max_results": 10} REVIEWBENCH is a name used for benchmark formulations in the literature on automated scholarly reviewing with LLMs. In "ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews" (Gao et al., 11 Mar 2025), ReviewBench is a downstream evaluation suite built from recent ICLR and NeurIPS papers and their human reviews, designed to measure how close model-generated reviews are to human reviews along language diversity, semantic consistency, sentiment consistency, and pairwise preference. In "ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents" (Li et al., 15 Apr 2026), REVIEWBENCH denotes a benchmark built from DeepReview-13K that evaluates review text with paper-specific rubrics derived from official guidelines, the paper’s content, and human-written reviews, with an explicit focus on substantiveness, grounding, and rubric compliance.

1. Two benchmark lineages

The term REVIEWBENCH does not denote a single immutable benchmark. In the current literature, it refers to at least two closely related but methodologically distinct benchmark designs for LLM-generated peer review. The first centers on similarity to human reviews and comparative preference; the second centers on rubric-based assessment of review quality, evidence-grounding, and correctness. This distinction matters because the two benchmarks ask different evaluation questions and reward different reviewer behaviors (Gao et al., 11 Mar 2025, Li et al., 15 Apr 2026).

Variant	Source and scale	Core evaluation logic
ReviewBench	100 ICLR 2024 and NeurIPS 2024 papers	Human-reference alignment through diversity, overlap, sentiment, and Review Arena
REVIEWBENCH	≈1,300 ICLR 2024–2025 papers from DeepReview-13K	Paper-specific rubrics over 8 dimensions plus numeric-field prediction

The first formulation was introduced together with ReviewAgents and was explicitly presented as a benchmark for quantitatively evaluating LLM-generated paper reviews. The second was introduced together with ReviewGrounder and was presented as a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper’s content, and human-written reviews. A common misconception is that these are merely different releases of the same benchmark. The published descriptions instead support reading them as distinct benchmark frameworks sharing a name but differing in source data, scoring protocol, and target notion of review quality.

2. ReviewBench in ReviewAgents: construction and task setup

In the ReviewAgents formulation, ReviewBench is constructed from recent AI conference papers and their reviews, specifically ICLR 2024 and NeurIPS 2024, with source platforms OpenReview for ICLR and NeurIPS Proceedings with reviews obtained via OpenReview where available for NeurIPS (Gao et al., 11 Mar 2025). These years were chosen strategically because ICLR 2024 and NeurIPS 2024 are after the pretraining cutoffs of major models such as GPT-4o and Llama-3.1, so the benchmark’s test data should not be present in most LLMs’ pretraining corpora.

The benchmark contains 100 papers in total, with an accept:reject ratio of 3:7, matching the approximate acceptance rates of these conferences. The 100 papers are explicitly removed from the Review-CoT training data, so ReviewBench acts as a held-out test set. For each test paper, the benchmark uses the full paper, parsed from PDF into structured JSON with SciPDF Parser, and human review comments from the conference review system. Human reviews serve as the reference texts against which model-generated reviews or meta-reviews are compared.

The task setup in the paper focuses on meta-review generation quality. For ReviewAgents, reviewer agents first generate individual reviews using the full paper and retrieved relevant papers, and an area chair agent then aggregates these into a meta-review. For baseline models, the paper states that it evaluates the meta-reviews generated by different models. In this usage, the model input is the paper, including title, abstract, and full text, and the output is a single structured review or meta-review text. The benchmark itself is evaluation-only and does not provide an official training split; training is done on the separate Review-CoT dataset.

This design makes ReviewBench a held-out, contamination-resistant evaluation suite for review generation. It is explicitly intended to provide standardized, model-agnostic metrics for a task that had historically been evaluated mostly via ad-hoc model scores or human ratings.

3. ReviewBench in ReviewAgents: metrics, arena evaluation, and findings

ReviewBench evaluates model-generated review text along four dimensions: Language Diversity, Semantic Consistency, Sentiment Consistency, and Review Arena (Gao et al., 11 Mar 2025). The metric set is:

Language Diversity: $\text{Distinct}_4$ , Inverse Self-BLEU@4
Semantic Consistency: ROUGE-1, ROUGE-L, SPICE
Sentiment Consistency: BERTScore, VADER Score
Review Arena: Win Rate

For language diversity, the benchmark uses the proportion of unique 4-grams,

$\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$

and an inverted Self-BLEU measure,

$\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$

Higher values indicate less repetition and more varied language.

For semantic consistency, ROUGE-1 and ROUGE-L are computed as F1 scores against human reviews, and SPICE is repurposed as a semantic similarity metric through semantic graph matching. For sentiment consistency, BERTScore is framed as viewpoint alignment and VADER Score measures closeness of sentiment signals between human and model-generated reviews. Review Arena complements these automatic metrics with pairwise preference judgments: for the same paper, two systems’ reviews are shown to GPT-4o together with the human review(s), and the judge chooses which system’s review is more similar in content and form to the human review. Each pair is evaluated twice with reversed ordering to reduce position bias. Win Rate is then

$\text{Win Rate} = \frac{\text{number of pairwise matches the model wins}}{\text{total number of pairwise matches involving the model}}.$

The reported quantitative results establish a sizable human–AI gap. Human reviews, as reference upper bound, receive an Overall score of 98.39. Among evaluated LLM systems, ReviewAgents achieves the strongest Overall score at 54.72, with $\text{Distinct}_4$ 96.57, Inverse Self-BLEU@4 77.60, ROUGE-1 42.88, ROUGE-L 19.27, SPICE 15.75, BERTScore 44.71, and VADER Score 86.26. This surpasses Claude-3.5-sonnet at 53.40 and DeepSeek-R1 at 53.35, while substantially improving over its base Llama-3.1-8B-Instruct model, which scores 43.31 overall. Figure 1 in the paper further shows that ReviewAgents achieves higher win rates in Review Arena than all evaluated closed-source and open-source LLMs.

ReviewBench also supports ablation on the number of reviewer agents $N$ . In the reported experiments, $N=3$ yields the highest Overall score of 54.72, while $N=4$ is very close at 54.50. The paper attributes degradation for $N>4$ to longer context and conflicting opinions hindering the area chair agent, and notes that Review-CoT typically contains 3–4 human reviews per paper. This suggests that the benchmark is sensitive not only to model scale but also to review-system design choices such as structured reasoning, relevant-paper-aware conditioning, and multi-agent aggregation.

4. REVIEWBENCH in ReviewGrounder: rubric-driven design

In the ReviewGrounder formulation, REVIEWBENCH is defined as both a dataset and an evaluation protocol. For each paper $p$ , it provides the paper text $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 0, a set of normalized human reviews $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 1, an aggregated reference review $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 2, a set of paper-specific rubrics $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 3, and a fixed evaluator $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 4 that scores a candidate review $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 5 along 8 dimensions and outputs an overall content score $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 6 (Li et al., 15 Apr 2026).

The source dataset is DeepReview-13K, containing ICLR submissions and reviews for ICLR 2024–2025. After filtering out empty or incomplete submissions, desk-rejected or withdrawn papers, papers with fewer than three complete human reviews, and papers missing mandatory textual fields or numeric scores, the authors obtain a pool of approximately 12,000 papers. They then sample approximately 10% using a fixed random seed, yielding approximately 1.3K papers for the core ReviewBench. Human reviews are normalized into a standardized schema aligned with the official ICLR review template, with textual fields Summary, Strengths, Weaknesses, and Questions, and numeric fields Overall rating, Soundness, Presentation, Contribution, Confidence, and Final decision.

For each paper, the aggregated reference review $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 7 is constructed with DeepSeek-R1-Distill-Qwen-32B by consolidating the textual parts of all human reviews into a single structured review. The ground-truth rating is the mean overall rating,

$\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 8

and the ground-truth decision $\text{Distinct}_4 = \frac{\text{number of unique 4-grams in the generated text}}{\text{total number of 4-grams in the generated text}},$ 9 is taken from metadata.

A central innovation is the use of paper-specific rubrics instantiated from eight paper-agnostic meta-rubrics derived from ICML, ICLR, and NeurIPS official reviewer guidelines plus expert feedback and iteration. The eight dimensions are Core Contribution Accuracy, Results Interpretation, Comparative Analysis, Evidence-Based Critique, Critique Clarity, Completeness Coverage, Constructive Tone, and False or Contradictory Claims. Paper-specific rubric construction is performed by GPT-OSS-120B conditioned on the paper text, the meta-rubrics, and the aggregated reference review. Each rubric item must be grounded in verifiable evidence and independently checkable.

Given paper $\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 0, candidate review $\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 1, and paper-specific rubric dimension $\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 2, the evaluator produces a discrete score

$\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 3

For the seven positive dimensions, $\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 4; for the pitfall dimension False or Contradictory Claims, $\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 5. The overall content score is

$\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 6

The benchmark also evaluates numeric-field prediction with MSE and MAE for ratings, and ACC and F1 for Accept/Reject decisions. As in the earlier ReviewBench formulation, it is primarily a held-out evaluation suite rather than a benchmark with internal train/validation/test splits.

5. REVIEWBENCH in ReviewGrounder: validation and empirical use

A defining feature of the ReviewGrounder formulation is explicit validation of its rubric-based evaluator against human experts (Li et al., 15 Apr 2026). The human study uses 120 ReviewBench papers and experts with strong publication records, approximately 2000 citations, who rate ReviewGrounder’s reviews according to the meta-rubrics. Comparing human overall scores with LLM-evaluator scores yields Mean Absolute Error 0.0969, Pearson correlation $\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 7, Spearman correlation $\text{Inverse Self-BLEU} = 2 - \text{Self-BLEU}.$ 8, and pairwise error 0.1494. This is presented as evidence that the paper-specific rubric scoring is closely aligned with expert judgments.

On rubric-based content evaluation, ReviewGrounder achieves Core Contribution Accuracy 1.8507, Results Interpretation 1.4075, Comparative Analysis 0.9059, Evidence-Based Critique 1.4831, Critique Clarity 1.9191, Completeness Coverage 1.3289, Constructive Tone 1.9992, False/Contradictory Claims −0.1245, and an overall content score of approximately 10.77. The paper reports that this surpasses Qwen3-32B at approximately 7.80, GPT-4.1 at approximately 7.66, GPT-4o at approximately 4.58, DeepReviewer-14B at approximately 7.90, AI Scientist at approximately 3.68, and AgentReview at approximately 4.87.

On numeric-field evaluation, ReviewGrounder attains Decision ACC 0.6939, Decision F1 0.6699, Rating MSE 1.1607, and Rating MAE 0.8597. Under adversarial instruction injection into paper text, the benchmark’s rubric-based evaluation shows that DeepReviewer-14B drops from 7.70 to 7.30 overall, whereas ReviewGrounder drops only minimally from 10.70 to 10.65.

These results position REVIEWBENCH not merely as a dataset but as an analysis framework for review substantiveness, evidence-grounding, and robustness. The benchmark is also intended as a diagnostic tool: because it provides the full 8D score vector, it can reveal whether a model is strong on Constructive Tone but weak on Comparative Analysis, or whether it produces critiques that are clear but insufficiently grounded.

6. Interpretation, limitations, and significance

The two REVIEWBENCH lineages reflect two different evaluation philosophies. The ReviewAgents version asks how closely generated reviews resemble human reviews in language, semantics, sentiment, and comparative preference. The ReviewGrounder version asks whether generated reviews are substantive, evidence-grounded, and rubric-compliant under paper-specific criteria. This suggests a methodological shift from evaluating resemblance to evaluating review utility and grounding, while still anchoring the task in human-written reviews and venue guidelines (Gao et al., 11 Mar 2025, Li et al., 15 Apr 2026).

Each benchmark also carries explicit limitations. In the ReviewAgents formulation, the benchmark is built from AI conference papers, so generalization to other fields is not guaranteed; most metrics measure similarity to human reviews rather than absolute usefulness; Review Arena relies on GPT-4o as a judge; and the benchmark size is 100 papers. In the ReviewGrounder formulation, core ReviewBench covers only ICLR 2024–2025 from DeepReview-13K; paper-specific rubrics partially inherit biases from human reviewers through the aggregated reference review; LLM-as-a-judge risks remain despite validation; and the curated cross-venue extension is described as partial and uneven. Both formulations therefore evaluate human alignment under specific institutional and disciplinary conditions rather than defining a field-independent standard.

A plausible implication is that REVIEWBENCH has become a site of methodological debate inside automated peer review research. One line of work treats human review texts as the primary reference object and asks whether models can approach human-like review behavior. Another treats human reviews, official guidelines, and paper content as ingredients for constructing paper-specific evaluation rubrics, then asks whether models produce accurate, evidence-based, and actionable critique. Taken together, these benchmark formulations have made peer-review evaluation more structured, more diagnostic, and more explicit about what exactly is being measured when an LLM is said to act as a reviewer.

Markdown Report Issue Upgrade to Chat

References (2)

ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews (2025)

ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to REVIEWBENCH.