Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepReviewer-14B: LLM Academic Peer Review

Updated 13 May 2026
  • The paper introduces a structured, multi-stage review system that emulates expert evaluation via novelty verification, multi-dimension critique, and reliability checks.
  • It fine-tunes a 14B-parameter transformer on DeepReview-13K, achieving state-of-the-art metrics in decision accuracy and Spearman correlation compared to baselines.
  • The model leverages innovative LongRoPE embeddings for extended context handling, enabling robust review generation for long scientific manuscripts.

DeepReviewer-14B is a LLM-based academic paper review system developed as part of the DeepReview project to address limitations in automated scientific research assessment. Designed to systematically emulate expert reviewers, DeepReviewer-14B integrates structured evaluative stages and evidence-based reasoning, setting new state-of-the-art (SOTA) metrics for LLM-driven peer review (Zhu et al., 11 Mar 2025).

1. Model Architecture

DeepReviewer-14B is derived by full-parameter supervised fine-tuning of Phi-4 14B, an open-source transformer LLM. The model preserves the original Phi-4 architectural specifications:

  • Parameter count: 14 billion (B)
  • Transformer depth: 40 layers
  • Hidden dimension: 8192
  • Attention heads: 32

To accommodate extended input sequences from long manuscripts and complex reasoning traces, rotary positional embeddings are replaced with LongRoPE, yielding a context window of up to 256k tokens in inference and 40k tokens during training. No adapters, LoRA, or external modules are introduced; all backbone parameters are trained end-to-end under DeepSpeed+ZeRO3 optimization for computational efficiency (Zhu et al., 11 Mar 2025).

2. Data and Fine-tuning Regimen

The model is fine-tuned on DeepReview-13K, a curated dataset comprising structured review traces collected from ICLR 2024/2025 OpenReview and arXiv submissions. The dataset includes:

  • Total submissions: 18,976 (filtered to 13,378 valid samples)
  • Annotations per sample:
    • Reviewer comments (strengths, weaknesses, queries)
    • Rebuttal dialogues
    • Fine-grained scores: overall (1–10), soundness/presentation/contribution (1–4 each)
    • Meta-review text, final recommendation, binary (accept/reject) decision
  • Split: 90% training (≈12,091), 10% test (1,286, DeepReview-Bench)
  • Data conversion: OpenReview and arXiv PDF-to-Markdown via MinerU

The supervised objective uses next-token cross-entropy loss over concatenated review chains (z1→z2→z3→(s,a))(z_1 \rightarrow z_2 \rightarrow z_3 \rightarrow (s, a)), where z1z_1 is novelty verification, z2z_2 is multi-dimension review, z3z_3 is reliability verification, ss is meta-review text, and aa is the accept/reject decision. Fine-tuning is conducted over 23,500 steps with batch size 16, constant learning rate 5×10−65 \times 10^{-6}, AdamW optimizer, and random-truncated 40k-token context windows (Zhu et al., 11 Mar 2025).

3. Multi-Stage Review Framework

DeepReviewer-14B operationalizes academic peer review via three explicit, sequentially chained sub-tasks:

3.1. Novelty Verification (z1z_1)

  • Goal: Verify research originality through retrieval of semantically similar prior work.
  • Process:
    • Semantic retrieval via Semantic Scholar/OpenScholar.
    • Query/paper embedding: E(â‹…)E(\cdot).
    • Similarity: sim(q,d)=⟨E(q),E(d)⟩/(∥E(q)∥∥E(d)∥)\mathrm{sim}(q, d) = \langle E(q), E(d) \rangle / (\|E(q)\|\|E(d)\|).
    • Retrieval probability: z1z_10.
  • Output: Top-z1z_11 candidate papers and textual novelty assessment z1z_12.

3.2. Multi-Dimension Review (z1z_13)

  • Input: Review set z1z_14 and associated rebuttals.
  • Action: Model (Qwen-2.5-72B-Instruct) reconstructs actionable suggestions and synthesizes multi-axis critiques (soundness, novelty, clarity, ethics) into z1z_15.

3.3. Reliability Verification (z1z_16)

  • Self-reflection: Four-stage verification (methodology, experiment, consistency, overall) using Gemini-2.0-Flash-Thinking.
  • Confidence assignment: Each comment in z1z_17 is evaluated for evidentiary support and assigned a scalar z1z_18.
  • Meta-review Fusion: Qwen model integrates z1z_19, z2z_20, z2z_21, and the original meta-review to yield z2z_22.
  • Implicit evidence fusion: Weighted sum of comment embeddings:

z2z_23

This explicit structuring is reflected both in fine-tuning supervision and inference pipeline (Zhu et al., 11 Mar 2025).

4. Objectives and Optimization

The loss combines autoregressive cross-entropy terms for each sub-stage:

z2z_24

with all z2z_25 in practice. This structure enforces sequential, evidence-based, and multidimensional reasoning throughout the review generation process. Optimization relies on DeepSpeed+ZeRO3 with the aforementioned hyperparameters for efficient large-scale training (Zhu et al., 11 Mar 2025).

5. Evaluation and Empirical Performance

Evaluation is conducted on DeepReview-Bench (1,286 samples) by four principal tasks:

Metric Value (DeepReviewer-14B, ICLR 2024) Notes
Rating MSE 1.3137 –44.8% vs. CycleReviewer-70B
Decision Accuracy 0.6406 +1.0 pt vs. baseline
Spearman Value 0.3559 +6.04% vs. CycleReviewer-70B
Pairwise Accuracy 0.6242
Win Rate vs. GPT-o1 88.21% LLM-as-judge assessment
Win Rate vs. DeepSeek-R1 80.20%

Further, in test-time scaling (Fast/Standard/Best), the model produces 3k/8k/14.5k output tokens, respectively, with corresponding Spearman correlation increasing from 0.326 to 0.355 (+8.97%). Even the fast mode (3k tokens) surpasses CycleReviewer (6k output tokens). Against adversarial attacks, DeepReviewer-14B's score shifts only by +0.31, compared to +4.26 for Gemini (Zhu et al., 11 Mar 2025).

6. Resources and Reproducibility

All resources are open-source and publicly released:

DeepReviewer-14B is reproducible in full according to the project’s releases, and serves as a new LLM-based benchmark for structured, high-fidelity scientific paper review (Zhu et al., 11 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepReviewer-14B Model.