DeepReviewer-14B: LLM Academic Peer Review
- The paper introduces a structured, multi-stage review system that emulates expert evaluation via novelty verification, multi-dimension critique, and reliability checks.
- It fine-tunes a 14B-parameter transformer on DeepReview-13K, achieving state-of-the-art metrics in decision accuracy and Spearman correlation compared to baselines.
- The model leverages innovative LongRoPE embeddings for extended context handling, enabling robust review generation for long scientific manuscripts.
DeepReviewer-14B is a LLM-based academic paper review system developed as part of the DeepReview project to address limitations in automated scientific research assessment. Designed to systematically emulate expert reviewers, DeepReviewer-14B integrates structured evaluative stages and evidence-based reasoning, setting new state-of-the-art (SOTA) metrics for LLM-driven peer review (Zhu et al., 11 Mar 2025).
1. Model Architecture
DeepReviewer-14B is derived by full-parameter supervised fine-tuning of Phi-4 14B, an open-source transformer LLM. The model preserves the original Phi-4 architectural specifications:
- Parameter count: 14 billion (B)
- Transformer depth: 40 layers
- Hidden dimension: 8192
- Attention heads: 32
To accommodate extended input sequences from long manuscripts and complex reasoning traces, rotary positional embeddings are replaced with LongRoPE, yielding a context window of up to 256k tokens in inference and 40k tokens during training. No adapters, LoRA, or external modules are introduced; all backbone parameters are trained end-to-end under DeepSpeed+ZeRO3 optimization for computational efficiency (Zhu et al., 11 Mar 2025).
2. Data and Fine-tuning Regimen
The model is fine-tuned on DeepReview-13K, a curated dataset comprising structured review traces collected from ICLR 2024/2025 OpenReview and arXiv submissions. The dataset includes:
- Total submissions: 18,976 (filtered to 13,378 valid samples)
- Annotations per sample:
- Reviewer comments (strengths, weaknesses, queries)
- Rebuttal dialogues
- Fine-grained scores: overall (1–10), soundness/presentation/contribution (1–4 each)
- Meta-review text, final recommendation, binary (accept/reject) decision
- Split: 90% training (≈12,091), 10% test (1,286, DeepReview-Bench)
- Data conversion: OpenReview and arXiv PDF-to-Markdown via MinerU
The supervised objective uses next-token cross-entropy loss over concatenated review chains , where is novelty verification, is multi-dimension review, is reliability verification, is meta-review text, and is the accept/reject decision. Fine-tuning is conducted over 23,500 steps with batch size 16, constant learning rate , AdamW optimizer, and random-truncated 40k-token context windows (Zhu et al., 11 Mar 2025).
3. Multi-Stage Review Framework
DeepReviewer-14B operationalizes academic peer review via three explicit, sequentially chained sub-tasks:
3.1. Novelty Verification ()
- Goal: Verify research originality through retrieval of semantically similar prior work.
- Process:
- Semantic retrieval via Semantic Scholar/OpenScholar.
- Query/paper embedding: .
- Similarity: .
- Retrieval probability: 0.
- Output: Top-1 candidate papers and textual novelty assessment 2.
3.2. Multi-Dimension Review (3)
- Input: Review set 4 and associated rebuttals.
- Action: Model (Qwen-2.5-72B-Instruct) reconstructs actionable suggestions and synthesizes multi-axis critiques (soundness, novelty, clarity, ethics) into 5.
3.3. Reliability Verification (6)
- Self-reflection: Four-stage verification (methodology, experiment, consistency, overall) using Gemini-2.0-Flash-Thinking.
- Confidence assignment: Each comment in 7 is evaluated for evidentiary support and assigned a scalar 8.
- Meta-review Fusion: Qwen model integrates 9, 0, 1, and the original meta-review to yield 2.
- Implicit evidence fusion: Weighted sum of comment embeddings:
3
This explicit structuring is reflected both in fine-tuning supervision and inference pipeline (Zhu et al., 11 Mar 2025).
4. Objectives and Optimization
The loss combines autoregressive cross-entropy terms for each sub-stage:
4
with all 5 in practice. This structure enforces sequential, evidence-based, and multidimensional reasoning throughout the review generation process. Optimization relies on DeepSpeed+ZeRO3 with the aforementioned hyperparameters for efficient large-scale training (Zhu et al., 11 Mar 2025).
5. Evaluation and Empirical Performance
Evaluation is conducted on DeepReview-Bench (1,286 samples) by four principal tasks:
| Metric | Value (DeepReviewer-14B, ICLR 2024) | Notes |
|---|---|---|
| Rating MSE | 1.3137 | –44.8% vs. CycleReviewer-70B |
| Decision Accuracy | 0.6406 | +1.0 pt vs. baseline |
| Spearman Value | 0.3559 | +6.04% vs. CycleReviewer-70B |
| Pairwise Accuracy | 0.6242 | |
| Win Rate vs. GPT-o1 | 88.21% | LLM-as-judge assessment |
| Win Rate vs. DeepSeek-R1 | 80.20% |
Further, in test-time scaling (Fast/Standard/Best), the model produces 3k/8k/14.5k output tokens, respectively, with corresponding Spearman correlation increasing from 0.326 to 0.355 (+8.97%). Even the fast mode (3k tokens) surpasses CycleReviewer (6k output tokens). Against adversarial attacks, DeepReviewer-14B's score shifts only by +0.31, compared to +4.26 for Gemini (Zhu et al., 11 Mar 2025).
6. Resources and Reproducibility
All resources are open-source and publicly released:
- Model weights: https://huggingface.co/WestlakeNLP/DeepReviewer-7B /14B
- Code, dataset, demonstration: http://ai-researcher.net
- Licensing: Models, code and dataset adhere to open usage policies, with dataset incorporating CC BY 4.0 OpenReview content.
DeepReviewer-14B is reproducible in full according to the project’s releases, and serves as a new LLM-based benchmark for structured, high-fidelity scientific paper review (Zhu et al., 11 Mar 2025).