DeepReview-13K: Structured Review Dataset

Updated 16 April 2026

DeepReview-13K is a curated benchmark comprising 13,378 paper-review instances with detailed reviewer annotations and explicit chain-of-thought reasoning steps.
The dataset integrates full paper texts, multi-stage review data including reviewer scores, author rebuttals, and meta-reviews, offering deep insights into the peer review process.
It employs an automated pipeline with robust quality control to ensure logical consistency and comprehensive evaluation of scientific papers for training LLMs.

DeepReview-13K is a curated benchmark for multi-stage, structured scientific paper reviewing, pairing complete academic manuscripts with exhaustive review information, fine-grained reviewer annotations, and explicit chain-of-thought reasoning traces. This dataset consists of 13,378 samples—each capturing an ICLR 2024 or 2025 submission with its full structured review data—intended to train and evaluate LLMs in expert-level, evidence-based paper assessment and meta-review (Zhu et al., 11 Mar 2025).

1. Dataset Composition and Scope

DeepReview-13K comprises 13,378 unique paper–review instances. Each sample includes the following elements:

Full paper text in parseable Markdown (from arXiv source or PDF converted with MinerU)
A review set $R$ $R$ , an array of reviewer entries, formally:
- Strengths, weaknesses, and questions (free text)
- Author rebuttal excerpts (dialogue text)
- Three fine-grained scores per reviewer: soundness $\in \{1,2,3,4\}$ , presentation $\in \{1,2,3,4\}$ , contribution $\in \{1,2,3,4\}$
- Overall rating $R \in [1, 10]$
- Accept/reject outcome (binary)
Meta-review and final meta-rating ( $\in [1,10]$ ), meta-decision (accept/reject)
Three explicit “thinking” steps (intermediate rationales):
- $z_1$ : Novelty verification (summary of literature comparison)
- $z_2$ : Multi-dimension review (reconstruction integrating reviewer commentary and rebuttal)
- $z_3$ : Reliability verification (validation against the manuscript, with confidence attribution)

This structure is designed to provide a comprehensive, end-to-end view of the peer review decision process, including both human and model-synthesized rationale (Zhu et al., 11 Mar 2025). The average combined length per sample (paper plus review) is approximately 10,178 tokens.

2. Collection and Quality Control Pipeline

Samples are sourced by cross-linking ICLR 2024 and 2025 submission records from arXiv PDFs/LaTeX (converted to Markdown) and reviews, rebuttals, and meta-reviews from OpenReview. Papers are included only if all the following are present:

Parsed Markdown manuscript
At least one full reviewer entry (with required categories and scores)
Author rebuttal text
Published meta-review and final decision

For each included paper, automated agents (LLMs) generate $z_1$ , $\in \{1,2,3,4\}$ 0, $\in \{1,2,3,4\}$ 1, and a finalized meta-review using explicit system prompts. After generation, all samples undergo automated QC with Qwen-2.5-72B-Instruct:

Logical Consistency: Sanity-checking for contradictions across the $\in \{1,2,3,4\}$ 2– $\in \{1,2,3,4\}$ 3 chain and final scores.
Completeness: Filtering out missing fields or incomplete entries.

Samples failing these gates are excluded, ensuring high internal dataset quality (Zhu et al., 11 Mar 2025).

3. Annotation Schema and Rationale Modeling

Each review in DeepReview-13K is decomposed into the following annotation fields:

Strengths, Weaknesses, Questions: Reviewer-written free-form text.
Soundness, Presentation, Contribution: Integer ratings, each $\in \{1,2,3,4\}$ 4.
Overall Rating: $\in \{1,2,3,4\}$ 5 (floating/integer).
Accept/Reject: Final recommendation per review.
Author Rebuttal: Excerpted dialogue.
Meta-review, Meta-rating, Meta-decision: Synthesis by area chairs or model agent.
Chain-of-Thought (z₁, z₂, z₃): Explicit fields capturing structured intermediate reasoning:
- $\in \{1,2,3,4\}$ 6 (Novelty Verification): Systematic literature retrieval and originality analysis.
- $\in \{1,2,3,4\}$ 7 (Multi-dimension Review): Professional reformulation and enrichment, integrating reviewer/author dialogue, adhering to requirements for depth, actionable feedback, and technical rigor.
- $\in \{1,2,3,4\}$ 8 (Reliability Verification): Stepwise evidence validation for each reviewer point, referring back to the manuscript and scoring comment strength.

The review process is formalized as follows:

$\in \{1,2,3,4\}$ 9

where $\in \{1,2,3,4\}$ 0 is the paper, $\in \{1,2,3,4\}$ 1 the review outcome, and $\in \{1,2,3,4\}$ 2 the rationale chain (Zhu et al., 11 Mar 2025).

4. Structural Format and Dataset Access

DeepReview-13K is distributed as a line-delimited JSON archive. Each object includes:

"id" (string): Unique identifier
"paper_text" (string)
"reviews" (array of objects; each as above)
"meta_review" (string), "meta_rating" (integer), "meta_decision" (string)
"z1_novelty_verification", "z2_multi_dimension_review", "z3_reliability_verification" (strings)
"final_summary" (synthesized review; string)
"final_scores" (object; keys for each fine-grained score and "overall")
"final_decision" (string)

All code, pre-trained model checkpoints, and the complete DeepReview-13K set are open access at http://ai-researcher.net. Licensing adheres to originating sources: OpenReview under CC BY 4.0, arXiv texts under their respective Creative Commons (BY, BY-SA, or CC0) terms (Zhu et al., 11 Mar 2025).

5. Statistical Properties and Evaluation Splits

The dataset is split as follows:

Subset	Samples	Notes
Train (ICLR 2024, 2025)	13,378	4,131 (2024), 9,247 (2025)
Test (DeepReview-Bench)	1,286	652 (2024), 634 (2025); ~10% holdout

Key statistics:

Mean paper+review token count: $\in \{1,2,3,4\}$ 310,178
Label accept rate: 33.24%
Mean overall rating: 5.18

Reported benchmark metrics include mean squared error:

$\in \{1,2,3,4\}$ 4

and mean absolute error:

$\in \{1,2,3,4\}$ 5

alongside accept rate (Zhu et al., 11 Mar 2025).

6. Prompts, Agent Pipeline, and Quality Control

All $\in \{1,2,3,4\}$ 6– $\in \{1,2,3,4\}$ 7 and final summary fields are generated via LLM agents using carefully engineered prompts:

Prompt 1: Multi-dimension review synthesis—incorporates author rebuttals, preserves technical references, and converts criticisms into actionable suggestions.
Prompt 2: Novelty assessment—asset-based literature retrieval, explicit distinction from prior work, expansion of claimed contributions.
Prompt 3: Reliability verification—links comments directly to manuscript evidence (experimental, theoretical), assigns confidence levels.

No manual inter-annotator agreement is reported; pipeline quality is ensured through systematic LLM-based sanity-checking for logical consistency and completeness using Qwen-2.5-72B-Instruct.

7. Benchmarks, Use Cases, and Responsible Use

DeepReview-13K underpins the training of DeepReviewer-14B and the DeepReview-Bench suite for end-to-end reviewing and rating prediction. Evaluated tasks include:

Rating regression (overall, category-wise)
Accept/reject decision modeling
Chain-of-thought generation for reviewer emulation
Full review and meta-review drafting

Use of DeepReview-13K for automated peer reviewing is intended to support (not replace) human judgment, with recommendations for human-in-the-loop assessment and adherence to ethical guidelines outlined in the associated appendix (Zhu et al., 11 Mar 2025).

References:

"DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process" (Zhu et al., 11 Mar 2025)

Markdown Report Issue Upgrade to Chat

References (1)

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepReview-13K Dataset.