REMOR: AI-Driven Peer Review Framework

Updated 16 August 2025

REMOR is an automated peer review generation framework that uses LLM chain-of-thought reasoning and reinforcement learning to produce multi-aspect scientific reviews.
It employs a three-stage pipeline including supervised fine-tuning on the PeerRT dataset, advanced reward modeling with HPRR, and GRPO-based optimization for precise human-aligned feedback.
REMOR delivers focused and reliable reviews with faster generation times, outperforming traditional review systems in both efficiency and quality.

REMOR is an automated peer review generation framework that leverages LLM reasoning and multi-objective reinforcement learning to synthesize high-quality scientific reviews. Designed to address the limitations of AI-generated feedback—typically characterized by superficial praise and lack of substantive criticism—REMOR integrates chain-of-thought-based reasoning with explicit multi-aspect optimization. Its core advances include a multi-aspect reward function aligned to human evaluation, a domain-specific curated reasoning-enriched dataset (PeerRT), and a Group Relative Policy Optimization (GRPO) training procedure yielding two principal model variants: REMOR-H (human-aligned) and REMOR-U (uniform-reward driven).

1. System Architecture and Reasoning Integration

REMOR is structured as a three-stage pipeline:

Supervised Fine-Tuning (SFT): The base LLM (DeepSeek-R1-Distill-Qwen-7B) is fine-tuned—using LoRA (rank 8, 32k context tokens)—on PeerRT, a dataset of peer reviews and synthetic reasoning traces.
Reward Modeling: A multi-aspect, sentence-normalized reward function, termed the Human-aligned Peer Review Reward (HPRR), is defined. It assesses review quality in criticism, examples, suggestions, importance/relevance, materials/methods, presentation/reporting, results/discussion, and manuscript-review relevance (measured by METEOR).
Multi-Objective Optimization via GRPO: The model undergoes reinforcement learning, with GRPO aggregating group-based reward comparisons over batches and updating policies toward either human-aligned (REMOR-H) or uniform (REMOR-U) objectives.

Central to REMOR is the explicit generation and optimization of reasoning traces. During both SFT and RL phases, the LLM produces chain-of-thought style commentary, incentivizing not just surface-level review generation but deep analytical coverage across all reward facets.

2. Multi-Aspect Reward Function Construction

The HPRR is defined as:

$\text{Reward} = \sum_{i} (w_i \cdot R_{(i)}) + w_{ReME} \cdot R_{ReME}$

Where $R_{(i)}$ is the normalized metric for aspect $i$ (criticism, example, ... suggestion/solution) and $R_{ReME}$ represents the METEOR-based manuscript-review relevance. Weights $w_i$ are derived in two regimes:

Uniform (REMOR-U): $w_i$ are set equally for all aspects.
Human-aligned (REMOR-H): $w_i$ are obtained from human reviewer preference data using Adapted Bradley-Terry and Constrained Reward Models.

Constraints for $w_i$ in the human-aligned setting are formulated as:

$\min \sum_{i=1}^{d} c_i \qquad \text{subject to} \quad c_i \geq 0,\, \sum_{i=1}^{d} c_i = 1$

with strict inequalities enforced using Laplace smoothing and L1 regularization. Aspects such as importance/relevance and suggestion/solution receive dominant weights when approximating human preferences.

3. Model Training, Datasets, and Efficiency

Supervised fine-tuning uses approximately 1,700 PeerRT samples (ICLR 2017–2020 reviews enriched via synthetic reasoning by Claude Sonnet 3.7) over three epochs at 0.0001 learning rate, followed by one GRPO RL epoch (~864 steps, 100 hours, temperature 0.9) with 4 generations per prompt. The cost/time comparison shows REMOR generation (~1 minute per review) is dramatically lower than multi-agent review systems (~20–30 minutes). Released models include SFT-only baselines and GRPO-optimized REMOR-U/REMOR-H on Hugging Face.

PeerRT construction integrates extracted paper texts (via GROBID), metric scores, and synthesized thought traces, providing both linguistic and reward feature supervision. The MEETER score aligns review-content overlapping with manuscript, incentivizing relevance.

4. Model Performance and Feedback Characteristics

REMOR-H and REMOR-U respectively maximize human-aligned and uniform synthetic rewards:

Model Variant	Uniform Reward	Human-aligned Reward	Review Quality Characteristics
REMOR-U	3.884	Lower than REMOR-H	Concise, substantive, broad aspect coverage
REMOR-H	Lower than U	0.670	Redundant, manuscript-matching, METEOR weighted
Human reviews	~1.8	~0.33	Highly variable; long tail of low-quality scores

Both REMOR variants consistently outperform human reviews, non-reasoning LLMs, and state-of-the-art agentic AI review systems by over a factor of two in average reward. REMOR-U generates focused and direct feedback addressing key paper dimensions; REMOR-H sacrifices some aspect richness for higher METEOR relevance. Variance analyses indicate REMOR-U provides more reliable (lower variance) review quality compared to human reviewers.

5. Implications for Peer Review and Scholarly Automation

REMOR demonstrates explicit reasoning and targeted multi-objective RL can correct the superficiality seen in prior AI reviewer models. By operationalizing a balanced reward across all peer review dimensions and training on context-rich, reasoning-enriched examples, REMOR avoids the "long tail" of low-quality reviews and matches best-in-class human feedback. The released HPRR, PeerRT, and REMOR models establish a reproducible pipeline for further research.

Potential further uses include:

Automated reviewer assignment and selection based on aspect coverage.
Real-time peer review scoring for editorial processes.
Self-assessment tools for human reviewers using HPRR metrics.

6. Dataset and Model Release

PeerRT: Reasoning-enriched review dataset with manuscript associations and aspect scores (ICLR reviews, GROBID extraction, Claude-generated traces).
HPRR: LaTeX-derivable, multi-aspect peer review reward function—weights from human preference data.
REMOR Models: SFT and GRPO-optimized variants available on Hugging Face, with full codebase for reproducibility (https://github.com/Khempawin/remor.git).

7. Summary

REMOR establishes a technical foundation for automated, reliable, and aspect-balanced peer review generation via LLM reasoning and multi-objective reinforcement learning. By integrating deep reward modeling aligned with explicit human preferences, domain-specific fine-tuning, and efficient generation, REMOR sets new standards for both quantitative and qualitative review quality in AI-driven scholarly evaluation (Taechoyotin et al., 16 May 2025).

PDF Markdown Chat (Pro)

References (1)

REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning (2025)

REMOR: AI-Driven Peer Review Framework

1. System Architecture and Reasoning Integration

2. Multi-Aspect Reward Function Construction

3. Model Training, Datasets, and Efficiency

4. Model Performance and Feedback Characteristics

5. Implications for Peer Review and Scholarly Automation

6. Dataset and Model Release

7. Summary

Whiteboard

Follow Topic

Continue Learning

REMOR: AI-Driven Peer Review Framework

1. System Architecture and Reasoning Integration

2. Multi-Aspect Reward Function Construction

3. Model Training, Datasets, and Efficiency

4. Model Performance and Feedback Characteristics

5. Implications for Peer Review and Scholarly Automation

6. Dataset and Model Release

7. Summary

Whiteboard

Follow Topic

Continue Learning

Related Topics