REMOR: AI-Driven Peer Review Framework
- REMOR is an automated peer review generation framework that uses LLM chain-of-thought reasoning and reinforcement learning to produce multi-aspect scientific reviews.
- It employs a three-stage pipeline including supervised fine-tuning on the PeerRT dataset, advanced reward modeling with HPRR, and GRPO-based optimization for precise human-aligned feedback.
- REMOR delivers focused and reliable reviews with faster generation times, outperforming traditional review systems in both efficiency and quality.
REMOR is an automated peer review generation framework that leverages LLM reasoning and multi-objective reinforcement learning to synthesize high-quality scientific reviews. Designed to address the limitations of AI-generated feedback—typically characterized by superficial praise and lack of substantive criticism—REMOR integrates chain-of-thought-based reasoning with explicit multi-aspect optimization. Its core advances include a multi-aspect reward function aligned to human evaluation, a domain-specific curated reasoning-enriched dataset (PeerRT), and a Group Relative Policy Optimization (GRPO) training procedure yielding two principal model variants: REMOR-H (human-aligned) and REMOR-U (uniform-reward driven).
1. System Architecture and Reasoning Integration
REMOR is structured as a three-stage pipeline:
- Supervised Fine-Tuning (SFT): The base LLM (DeepSeek-R1-Distill-Qwen-7B) is fine-tuned—using LoRA (rank 8, 32k context tokens)—on PeerRT, a dataset of peer reviews and synthetic reasoning traces.
- Reward Modeling: A multi-aspect, sentence-normalized reward function, termed the Human-aligned Peer Review Reward (HPRR), is defined. It assesses review quality in criticism, examples, suggestions, importance/relevance, materials/methods, presentation/reporting, results/discussion, and manuscript-review relevance (measured by METEOR).
- Multi-Objective Optimization via GRPO: The model undergoes reinforcement learning, with GRPO aggregating group-based reward comparisons over batches and updating policies toward either human-aligned (REMOR-H) or uniform (REMOR-U) objectives.
Central to REMOR is the explicit generation and optimization of reasoning traces. During both SFT and RL phases, the LLM produces chain-of-thought style commentary, incentivizing not just surface-level review generation but deep analytical coverage across all reward facets.
2. Multi-Aspect Reward Function Construction
The HPRR is defined as:
Where is the normalized metric for aspect (criticism, example, ... suggestion/solution) and represents the METEOR-based manuscript-review relevance. Weights are derived in two regimes:
- Uniform (REMOR-U): are set equally for all aspects.
- Human-aligned (REMOR-H): are obtained from human reviewer preference data using Adapted Bradley-Terry and Constrained Reward Models.
Constraints for in the human-aligned setting are formulated as:
with strict inequalities enforced using Laplace smoothing and L1 regularization. Aspects such as importance/relevance and suggestion/solution receive dominant weights when approximating human preferences.
3. Model Training, Datasets, and Efficiency
Supervised fine-tuning uses approximately 1,700 PeerRT samples (ICLR 2017–2020 reviews enriched via synthetic reasoning by Claude Sonnet 3.7) over three epochs at 0.0001 learning rate, followed by one GRPO RL epoch (~864 steps, 100 hours, temperature 0.9) with 4 generations per prompt. The cost/time comparison shows REMOR generation (~1 minute per review) is dramatically lower than multi-agent review systems (~20–30 minutes). Released models include SFT-only baselines and GRPO-optimized REMOR-U/REMOR-H on Hugging Face.
PeerRT construction integrates extracted paper texts (via GROBID), metric scores, and synthesized thought traces, providing both linguistic and reward feature supervision. The MEETER score aligns review-content overlapping with manuscript, incentivizing relevance.
4. Model Performance and Feedback Characteristics
REMOR-H and REMOR-U respectively maximize human-aligned and uniform synthetic rewards:
Model Variant | Uniform Reward | Human-aligned Reward | Review Quality Characteristics |
---|---|---|---|
REMOR-U | 3.884 | Lower than REMOR-H | Concise, substantive, broad aspect coverage |
REMOR-H | Lower than U | 0.670 | Redundant, manuscript-matching, METEOR weighted |
Human reviews | ~1.8 | ~0.33 | Highly variable; long tail of low-quality scores |
Both REMOR variants consistently outperform human reviews, non-reasoning LLMs, and state-of-the-art agentic AI review systems by over a factor of two in average reward. REMOR-U generates focused and direct feedback addressing key paper dimensions; REMOR-H sacrifices some aspect richness for higher METEOR relevance. Variance analyses indicate REMOR-U provides more reliable (lower variance) review quality compared to human reviewers.
5. Implications for Peer Review and Scholarly Automation
REMOR demonstrates explicit reasoning and targeted multi-objective RL can correct the superficiality seen in prior AI reviewer models. By operationalizing a balanced reward across all peer review dimensions and training on context-rich, reasoning-enriched examples, REMOR avoids the "long tail" of low-quality reviews and matches best-in-class human feedback. The released HPRR, PeerRT, and REMOR models establish a reproducible pipeline for further research.
Potential further uses include:
- Automated reviewer assignment and selection based on aspect coverage.
- Real-time peer review scoring for editorial processes.
- Self-assessment tools for human reviewers using HPRR metrics.
6. Dataset and Model Release
- PeerRT: Reasoning-enriched review dataset with manuscript associations and aspect scores (ICLR reviews, GROBID extraction, Claude-generated traces).
- HPRR: LaTeX-derivable, multi-aspect peer review reward function—weights from human preference data.
- REMOR Models: SFT and GRPO-optimized variants available on Hugging Face, with full codebase for reproducibility (https://github.com/Khempawin/remor.git).
7. Summary
REMOR establishes a technical foundation for automated, reliable, and aspect-balanced peer review generation via LLM reasoning and multi-objective reinforcement learning. By integrating deep reward modeling aligned with explicit human preferences, domain-specific fine-tuning, and efficient generation, REMOR sets new standards for both quantitative and qualitative review quality in AI-driven scholarly evaluation (Taechoyotin et al., 16 May 2025).