MMReview: Multimodal Peer Review Benchmark

Updated 5 April 2026

MMReview is a multidisciplinary benchmark that rigorously evaluates automated peer review systems by integrating text, figures, and tables from diverse scientific fields.
It simulates the full scholarly review process with specific tasks such as step-wise reasoning, meta decision making, and adversarial robustness assessment.
Empirical results show that model scale, chain-of-thought prompting, and multimodal inputs significantly improve review quality and alignment with expert human judgments.

MMReview is a multidisciplinary, multimodal benchmark specifically designed to evaluate and advance LLMs and multimodal LLMs (MLLMs) for automated academic peer review tasks. It was developed in response to significant limitations in prior datasets and evaluation schemes, particularly a lack of modality coverage beyond text and inadequate simulation of the end-to-end scholarly review pipeline. MMReview provides a comprehensive testbed incorporating diverse disciplines, multimodal manuscript content, and expert-written reviews, and establishes rigorous evaluation on step-wise reasoning, human alignment, and robustness to adversarial manipulation (Gao et al., 19 Aug 2025).

1. Motivation and Benchmark Scope

MMReview addresses a critical need in the scholarly review community, where rapid growth in academic publications has rendered peer review a major bottleneck due to reviewer workload and the complexity of evaluating rich multimodal content (e.g., figures, tables). Prior LLM-based review studies were limited to AI or computer science domains, focused solely on textual inputs, and employed ad-hoc or narrowly tailored metrics. MMReview is designed to (a) expand coverage across scientific fields, (b) introduce all manuscript modalities commonly found in research papers, and (c) provide a unified task and evaluation suite that probes both outputs and reasoning processes in peer review automation (Gao et al., 19 Aug 2025).

The benchmark encompasses four major academic disciplines—Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences—spanning 17 research subfields. It targets the evaluation of both monomodal LLMs and MLLMs capable of processing combinations of text, figures, tables, and full PDF page images.

2. Dataset Composition and Structure

MMReview consists of 240 curated samples selected from an initial pool of over 50,000 papers with open peer reviews (sources include OpenReview, NeurIPS, and Nature Communications). Each sample includes:

Textual content: Abstract and key main-body excerpts.
Multimodal elements: Cropped figures, diagrams, data tables, and full PDF page images.
Human reviews: Expert-written pros/cons, integer scores for soundness and presentation (scales: 1–4), and final binary accept/reject decisions.

The disciplinary composition is summarized below:

Discipline	No. of Papers	Subfields Included
Artificial Intelligence	115	ML, Vision, NLP, RL, GNNs, Signal Proc., AI Apps
Natural Sciences	63	Biology/Medicine, Physics, Chemistry, Earth/Env, Math/Stat
Engineering Sciences	38	Materials, Control, Electronic Info, Energy
Social Sciences	24	Society, Economics, Finance

This structure enables evaluation of models in realistic, diverse peer-review scenarios.

3. Task Suite and Categories

MMReview defines 13 core tasks organized into four thematic categories, designed to simulate the stages of human peer review:

a) Step-wise Review Generation

Summary (S): Manuscript summarization.
Strengths Evaluation (SE) / Weaknesses Evaluation (WE): Listing merits and limitations along Quality, Clarity, Significance, Originality.
Soundness/Presetation Scoring (SS/PS): Integer ratings (1–4).

b) Outcome Formulation

Conditional Decision (CD): Predicting a 1–10 quality score given human-written pros/cons.
Direct Decision (DD): Scoring from scratch.
Chain-of-Thought Decision (CoD): Stepwise reasoning—summary, pros/cons, SS, PS, then score.
Meta Decision (MD): Simulating an area chair’s accept/reject verdict by synthesizing multiple reviews.

c) Alignment with Human Preferences

Pairwise Rank (PR): Selecting which of two papers deserves a higher acceptance tier, reflecting human-like ranking.

d) Robustness to Adversarial Manipulation

Fake Strengths (FS) / Fake Weaknesses (FW): Detecting polarity-flipped reviewer comments.
Prompt Injection (PI): Measuring susceptibility to hidden instructions that can bias score outputs.

All tasks are carefully mapped to real peer review workflow steps, from summary and scoring to meta-level synthesis and adversarial response.

4. Evaluation Metrics and Methodology

MMReview employs a mixture of automatic and human-centered metrics, with a focus on both outcome accuracy and alignment with expert human judgments. Key metrics include:

Text generation similarity (S, SE, WE): BARTScore, “LLM-as-Judge” human-likeness scoring.
Classification accuracy (MD, PR): e.g., area chair decisions, pairwise rankings.
Score deviation (SS, PS, CD, DD, CoD, FS, FW, PI): Mean Absolute Error (MAE).
Standard Generation and Classification Metrics:
- BLEU
- ROUGE-L ( $\text{ROUGE-L} = \frac{(1+\beta^2)\,R\,P}{R+\beta^2\,P}$ , with R = recall, P = precision of the LCS)
- F1-score ( $2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision}+\text{Recall}}$ )
Human-alignment: Likert-scale matching by held-out LLM judge.
Adversarial robustness: Changes in MAE and score deviation under polarities or prompt-injection attacks.

Multimodal input (text + figures/tables or PDF page images) is specifically tested for performance and robustness, and benchmark scripts are provided for systematic evaluation (Gao et al., 19 Aug 2025).

5. Experimental Results and Empirical Insights

Benchmarking on MMReview covered 16 open-source and 5 closed-source LLM/MLLMs, including both small (<7B parameters) and large (GPT-4o, Claude-4, Gemini-2.5) models. Empirical findings include:

Scale effects: Larger (often closed-source) models achieve the highest accuracy on high-level decisions (MD up to ≈85%, PR above 75%).
Mid-size model advantage: On fine-grained rating tasks (SS, PS), mid-sized LLMs sometimes outperform the largest models (lower MAE), indicating diminishing returns for micro-rating accuracy at extreme scale.
Chain-of-Thought benefit: Chain-of-Thought (CoD) decision prompts consistently reduce MAE compared to direct (DD) scoring, evidencing superior alignment with human review processes.
Multimodal robustness: Including visual modalities reduces vulnerability to prompt injection attacks: forced score increases drop from ≈1.2 (text-only) to ≈0.7 (multimodal).
Ongoing limitations: All models struggle with factual consistency, hallucinations, and susceptibility to adversarial polarity flips, revealing limitations in current LLM/MLLM peer review alignment.

6. Limitations and Research Directions

The MMReview benchmark, while the most realistic and complete to date, has explicit limitations:

Domain gaps: Coverage of Humanities and biological fields outside Nature Communications is limited.
Modality scope: Presently focused on text, figures, tables, and static PDF; does not incorporate code snippets or video supplements.
Length bias: Tendency for models to reward longer submissions; absence of length-normalized evaluation.
Adversarial vulnerabilities: Despite improvements, models remain sensitive to prompt injection and data-poisoning attacks.

Suggested extensions include expanding to under-represented disciplines, incorporating new modalities, constructing length-normalized scoring protocols, and developing more sophisticated adversarial detection schemes.

7. Significance and Impact

MMReview establishes a critical, standardized foundation for evaluating and comparing LLM/MLLM-based automated peer review systems beyond purely textual or single-stage evaluation (Gao et al., 19 Aug 2025). By:

Enabling systematic comparison across academic disciplines, modalities, and review tasks.
Simulating a realistic end-to-end peer review process with stepwise and meta evaluation.
Requiring robust model behavior in the presence of adversarial manipulation and multimodal complexity.

A plausible implication is that MMReview will accelerate both empirical benchmarking and methodological innovation in automated scholarly peer review, moving the field toward systems that are more transparent, reliable, and aligned with expert human judgment. All data, tasks, prompts, and scripts are openly available, facilitating reproducibility and community advancement.

Markdown Report Issue Upgrade to Chat

References (1)

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MMReview.