DeepReviewer-14B: LLM Academic Peer Review

Updated 13 May 2026

The paper introduces a structured, multi-stage review system that emulates expert evaluation via novelty verification, multi-dimension critique, and reliability checks.
It fine-tunes a 14B-parameter transformer on DeepReview-13K, achieving state-of-the-art metrics in decision accuracy and Spearman correlation compared to baselines.
The model leverages innovative LongRoPE embeddings for extended context handling, enabling robust review generation for long scientific manuscripts.

DeepReviewer-14B is a LLM-based academic paper review system developed as part of the DeepReview project to address limitations in automated scientific research assessment. Designed to systematically emulate expert reviewers, DeepReviewer-14B integrates structured evaluative stages and evidence-based reasoning, setting new state-of-the-art (SOTA) metrics for LLM-driven peer review (Zhu et al., 11 Mar 2025).

1. Model Architecture

DeepReviewer-14B is derived by full-parameter supervised fine-tuning of Phi-4 14B, an open-source transformer LLM. The model preserves the original Phi-4 architectural specifications:

Parameter count: 14 billion (B)
Transformer depth: 40 layers
Hidden dimension: 8192
Attention heads: 32

To accommodate extended input sequences from long manuscripts and complex reasoning traces, rotary positional embeddings are replaced with LongRoPE, yielding a context window of up to 256k tokens in inference and 40k tokens during training. No adapters, LoRA, or external modules are introduced; all backbone parameters are trained end-to-end under DeepSpeed+ZeRO3 optimization for computational efficiency (Zhu et al., 11 Mar 2025).

2. Data and Fine-tuning Regimen

The model is fine-tuned on DeepReview-13K, a curated dataset comprising structured review traces collected from ICLR 2024/2025 OpenReview and arXiv submissions. The dataset includes:

Total submissions: 18,976 (filtered to 13,378 valid samples)
Annotations per sample:
- Reviewer comments (strengths, weaknesses, queries)
- Rebuttal dialogues
- Fine-grained scores: overall (1–10), soundness/presentation/contribution (1–4 each)
- Meta-review text, final recommendation, binary (accept/reject) decision
Split: 90% training (≈12,091), 10% test (1,286, DeepReview-Bench)
Data conversion: OpenReview and arXiv PDF-to-Markdown via MinerU

The supervised objective uses next-token cross-entropy loss over concatenated review chains $(z_1 \rightarrow z_2 \rightarrow z_3 \rightarrow (s, a))$ , where $z_1$ is novelty verification, $z_2$ is multi-dimension review, $z_3$ is reliability verification, $s$ is meta-review text, and $a$ is the accept/reject decision. Fine-tuning is conducted over 23,500 steps with batch size 16, constant learning rate $5 \times 10^{-6}$ , AdamW optimizer, and random-truncated 40k-token context windows (Zhu et al., 11 Mar 2025).

3. Multi-Stage Review Framework

DeepReviewer-14B operationalizes academic peer review via three explicit, sequentially chained sub-tasks:

3.1. Novelty Verification ( $z_1$ )

Goal: Verify research originality through retrieval of semantically similar prior work.
Process:
- Semantic retrieval via Semantic Scholar/OpenScholar.
- Query/paper embedding: $E(\cdot)$ .
- Similarity: $\mathrm{sim}(q, d) = \langle E(q), E(d) \rangle / (\|E(q)\|\|E(d)\|)$ .
- Retrieval probability: $z_1$ 0.
Output: Top- $z_1$ 1 candidate papers and textual novelty assessment $z_1$ 2.

3.2. Multi-Dimension Review ( $z_1$ 3)

Input: Review set $z_1$ 4 and associated rebuttals.
Action: Model (Qwen-2.5-72B-Instruct) reconstructs actionable suggestions and synthesizes multi-axis critiques (soundness, novelty, clarity, ethics) into $z_1$ 5.

3.3. Reliability Verification ( $z_1$ 6)

Self-reflection: Four-stage verification (methodology, experiment, consistency, overall) using Gemini-2.0-Flash-Thinking.
Confidence assignment: Each comment in $z_1$ 7 is evaluated for evidentiary support and assigned a scalar $z_1$ 8.
Meta-review Fusion: Qwen model integrates $z_1$ 9, $z_2$ 0, $z_2$ 1, and the original meta-review to yield $z_2$ 2.
Implicit evidence fusion: Weighted sum of comment embeddings:

$z_2$ 3

This explicit structuring is reflected both in fine-tuning supervision and inference pipeline (Zhu et al., 11 Mar 2025).

4. Objectives and Optimization

The loss combines autoregressive cross-entropy terms for each sub-stage:

$z_2$ 4

with all $z_2$ 5 in practice. This structure enforces sequential, evidence-based, and multidimensional reasoning throughout the review generation process. Optimization relies on DeepSpeed+ZeRO3 with the aforementioned hyperparameters for efficient large-scale training (Zhu et al., 11 Mar 2025).

5. Evaluation and Empirical Performance

Evaluation is conducted on DeepReview-Bench (1,286 samples) by four principal tasks:

Metric	Value (DeepReviewer-14B, ICLR 2024)	Notes
Rating MSE	1.3137	–44.8% vs. CycleReviewer-70B
Decision Accuracy	0.6406	+1.0 pt vs. baseline
Spearman Value	0.3559	+6.04% vs. CycleReviewer-70B
Pairwise Accuracy	0.6242
Win Rate vs. GPT-o1	88.21%	LLM-as-judge assessment
Win Rate vs. DeepSeek-R1	80.20%

Further, in test-time scaling (Fast/Standard/Best), the model produces 3k/8k/14.5k output tokens, respectively, with corresponding Spearman correlation increasing from 0.326 to 0.355 (+8.97%). Even the fast mode (3k tokens) surpasses CycleReviewer (6k output tokens). Against adversarial attacks, DeepReviewer-14B's score shifts only by +0.31, compared to +4.26 for Gemini (Zhu et al., 11 Mar 2025).

6. Resources and Reproducibility

All resources are open-source and publicly released:

Model weights: https://huggingface.co/WestlakeNLP/DeepReviewer-7B /14B
Code, dataset, demonstration: http://ai-researcher.net
Licensing: Models, code and dataset adhere to open usage policies, with dataset incorporating CC BY 4.0 OpenReview content.

DeepReviewer-14B is reproducible in full according to the project’s releases, and serves as a new LLM-based benchmark for structured, high-fidelity scientific paper review (Zhu et al., 11 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepReviewer-14B Model.