Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 191 tok/s Pro
2000 character limit reached

REMOR: AI-Driven Peer Review Framework

Updated 16 August 2025
  • REMOR is an automated peer review generation framework that uses LLM chain-of-thought reasoning and reinforcement learning to produce multi-aspect scientific reviews.
  • It employs a three-stage pipeline including supervised fine-tuning on the PeerRT dataset, advanced reward modeling with HPRR, and GRPO-based optimization for precise human-aligned feedback.
  • REMOR delivers focused and reliable reviews with faster generation times, outperforming traditional review systems in both efficiency and quality.

REMOR is an automated peer review generation framework that leverages LLM reasoning and multi-objective reinforcement learning to synthesize high-quality scientific reviews. Designed to address the limitations of AI-generated feedback—typically characterized by superficial praise and lack of substantive criticism—REMOR integrates chain-of-thought-based reasoning with explicit multi-aspect optimization. Its core advances include a multi-aspect reward function aligned to human evaluation, a domain-specific curated reasoning-enriched dataset (PeerRT), and a Group Relative Policy Optimization (GRPO) training procedure yielding two principal model variants: REMOR-H (human-aligned) and REMOR-U (uniform-reward driven).

1. System Architecture and Reasoning Integration

REMOR is structured as a three-stage pipeline:

  • Supervised Fine-Tuning (SFT): The base LLM (DeepSeek-R1-Distill-Qwen-7B) is fine-tuned—using LoRA (rank 8, 32k context tokens)—on PeerRT, a dataset of peer reviews and synthetic reasoning traces.
  • Reward Modeling: A multi-aspect, sentence-normalized reward function, termed the Human-aligned Peer Review Reward (HPRR), is defined. It assesses review quality in criticism, examples, suggestions, importance/relevance, materials/methods, presentation/reporting, results/discussion, and manuscript-review relevance (measured by METEOR).
  • Multi-Objective Optimization via GRPO: The model undergoes reinforcement learning, with GRPO aggregating group-based reward comparisons over batches and updating policies toward either human-aligned (REMOR-H) or uniform (REMOR-U) objectives.

Central to REMOR is the explicit generation and optimization of reasoning traces. During both SFT and RL phases, the LLM produces chain-of-thought style commentary, incentivizing not just surface-level review generation but deep analytical coverage across all reward facets.

2. Multi-Aspect Reward Function Construction

The HPRR is defined as:

Reward=i(wiR(i))+wReMERReME\text{Reward} = \sum_{i} (w_i \cdot R_{(i)}) + w_{ReME} \cdot R_{ReME}

Where R(i)R_{(i)} is the normalized metric for aspect ii (criticism, example, ... suggestion/solution) and RReMER_{ReME} represents the METEOR-based manuscript-review relevance. Weights wiw_i are derived in two regimes:

  • Uniform (REMOR-U): wiw_i are set equally for all aspects.
  • Human-aligned (REMOR-H): wiw_i are obtained from human reviewer preference data using Adapted Bradley-Terry and Constrained Reward Models.

Constraints for wiw_i in the human-aligned setting are formulated as:

mini=1dcisubject toci0,i=1dci=1\min \sum_{i=1}^{d} c_i \qquad \text{subject to} \quad c_i \geq 0,\, \sum_{i=1}^{d} c_i = 1

with strict inequalities enforced using Laplace smoothing and L1 regularization. Aspects such as importance/relevance and suggestion/solution receive dominant weights when approximating human preferences.

3. Model Training, Datasets, and Efficiency

Supervised fine-tuning uses approximately 1,700 PeerRT samples (ICLR 2017–2020 reviews enriched via synthetic reasoning by Claude Sonnet 3.7) over three epochs at 0.0001 learning rate, followed by one GRPO RL epoch (~864 steps, 100 hours, temperature 0.9) with 4 generations per prompt. The cost/time comparison shows REMOR generation (~1 minute per review) is dramatically lower than multi-agent review systems (~20–30 minutes). Released models include SFT-only baselines and GRPO-optimized REMOR-U/REMOR-H on Hugging Face.

PeerRT construction integrates extracted paper texts (via GROBID), metric scores, and synthesized thought traces, providing both linguistic and reward feature supervision. The MEETER score aligns review-content overlapping with manuscript, incentivizing relevance.

4. Model Performance and Feedback Characteristics

REMOR-H and REMOR-U respectively maximize human-aligned and uniform synthetic rewards:

Model Variant Uniform Reward Human-aligned Reward Review Quality Characteristics
REMOR-U 3.884 Lower than REMOR-H Concise, substantive, broad aspect coverage
REMOR-H Lower than U 0.670 Redundant, manuscript-matching, METEOR weighted
Human reviews ~1.8 ~0.33 Highly variable; long tail of low-quality scores

Both REMOR variants consistently outperform human reviews, non-reasoning LLMs, and state-of-the-art agentic AI review systems by over a factor of two in average reward. REMOR-U generates focused and direct feedback addressing key paper dimensions; REMOR-H sacrifices some aspect richness for higher METEOR relevance. Variance analyses indicate REMOR-U provides more reliable (lower variance) review quality compared to human reviewers.

5. Implications for Peer Review and Scholarly Automation

REMOR demonstrates explicit reasoning and targeted multi-objective RL can correct the superficiality seen in prior AI reviewer models. By operationalizing a balanced reward across all peer review dimensions and training on context-rich, reasoning-enriched examples, REMOR avoids the "long tail" of low-quality reviews and matches best-in-class human feedback. The released HPRR, PeerRT, and REMOR models establish a reproducible pipeline for further research.

Potential further uses include:

  • Automated reviewer assignment and selection based on aspect coverage.
  • Real-time peer review scoring for editorial processes.
  • Self-assessment tools for human reviewers using HPRR metrics.

6. Dataset and Model Release

  • PeerRT: Reasoning-enriched review dataset with manuscript associations and aspect scores (ICLR reviews, GROBID extraction, Claude-generated traces).
  • HPRR: LaTeX-derivable, multi-aspect peer review reward function—weights from human preference data.
  • REMOR Models: SFT and GRPO-optimized variants available on Hugging Face, with full codebase for reproducibility (https://github.com/Khempawin/remor.git).

7. Summary

REMOR establishes a technical foundation for automated, reliable, and aspect-balanced peer review generation via LLM reasoning and multi-objective reinforcement learning. By integrating deep reward modeling aligned with explicit human preferences, domain-specific fine-tuning, and efficient generation, REMOR sets new standards for both quantitative and qualitative review quality in AI-driven scholarly evaluation (Taechoyotin et al., 16 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube