ReviewerToo: AI-Assisted Peer Review Framework
- ReviewerToo is a modular, agent-based system designed to enhance AI-assisted peer review with diverse reviewer personas and systematic evaluation.
- It employs a five-stage pipeline integrating PDF parsing, literature review, persona-based evaluation, rebuttal drafting, and meta-review aggregation.
- Empirical validation on ICLR submissions shows near-human accuracy and improved review quality through ensemble consensus and structured scoring.
The ReviewerToo framework is a modular, agent-based system for AI-assisted peer review, designed to augment and systematize scientific paper evaluation in large-scale conference settings. Developed to address the long-standing problems of inconsistency, subjectivity, and scalability in traditional peer review, ReviewerToo models the review process as an orchestrated pipeline of specialized LLM agents with diverse reviewer personas, explicit evaluation criteria, and hybrid human–AI integration. The framework has been validated on thousands of ICLR 2025 submissions, providing quantitative evidence that AI-led reviews can achieve near-human predictive performance and high-quality feedback, while clarifying both the strengths and limitations of automated reviewers (Sahu et al., 9 Oct 2025).
1. Motivations and Foundational Objectives
Peer review at premier machine learning conferences suffers from low inter-reviewer agreement, random variation in accept/reject decisions, and reviewer fatigue due to massive submission volumes (e.g., 11,672 at ICLR 2025). Sources of inconsistency include idiosyncratic reviewer biases and divergent philosophies (e.g., theorist vs. empiricist), leading to unpredictable outcomes. The ReviewerToo framework seeks to resolve these issues via:
- Systematic Reproducibility: Modular, experiment-friendly architecture supporting controlled evaluation of AI-assisted reviewing.
- Persona Diversity: Encoding multiple reviewer stances using persona-specific system prompts to capture real-world heterogeneity.
- Structured Evaluation: Rubric-based criteria—such as novelty, soundness, clarity, impact—explicitly grounded in the submission or external literature.
- Ensemble Consensus: Aggregation of multiple persona judgments to yield calibrated, bias-resistant decisions.
- Human–AI Complementarity: Partial or full integration pathways, preserving complex evaluative judgments for domain experts.
2. Framework Architecture and Modules
ReviewerToo operates as a five-stage pipeline of coordinated LLM agents:
| Stage | Functionality | Output Type |
|---|---|---|
| Ingestion & Conversion | PDF/LaTeX parsing, summarization | Markdown summary |
| Literature Review Agent | Semantic Scholar retrieval, summary | Literature summary |
| Reviewer Agents | Persona-based rubric reviews, recommendations | Structured reviews |
| Author Agent | Rebuttal drafting using reviewer feedback | Response text |
| Metareviewer Agent | Fact-checking, aggregation, final decision | Meta-review, decision |
- Persona Specification: Reviewer agents use prompts encoding stance (critical, permissive), methodology (empiricist, theorist), or style (pedagogical), combined with official ICLR 2025 guidelines and, optionally, a literature summary.
- Structured Evaluation Criteria: Each review is generated in a template capturing summary, strengths, weaknesses, rubric scores for core criteria, and an accept/reject recommendation. Grounding is enforced by requiring citation of the manuscript or literature in every claim.
- Aggregation Engine: The metareviewer synthesizes all reviews—pre- and post-rebuttal—fact-checks assertions, discards unverified facts, assigns significance weights, and issues a final, calibrated summary and recommendation.
3. Key Algorithmic Components
Central to ReviewerToo are standardized formulations for decision-making and measuring agreement:
- Accept/Reject Classification: For personas, each outputs a binary (reject, accept). The ensemble function:
where for majority voting.
- Consistency Score: Agreement among personas is:
with for unanimous decisions.
Evaluation uses predictive accuracy (binary and five-way), macro-averaged precision/recall/F1, Cohen's for agreement, and review quality via ELO ratings from LLM-based judges on multiple dimensions.
4. Empirical Validation and Performance
ReviewerToo was benchmarked on the ICLR-2k dataset (1,963 papers) sampled across decision strata and reviewer score quantiles. The gpt-oss-120b model (zero-/few-shot) was used for all agent roles, with orchestration on 8×H100 GPUs (no fine-tuning).
Quantitative Outcomes:
- Binary accept/reject accuracy: Meta(all) = 81.8%, Human(avg) = 83.9%.
- Five-way F1: Meta(all) = 28.1%, Human(avg) = 13.7%.
- Review quality (ELO): Meta(all) = 1657 vs. Human(avg) = 540; best human reviewers (top 1%) reach 1316.
Domain-Specific Observations:
- High ELO for fact-checking, literature coverage, and clarity of summary.
- Systematic weaknesses in assessing methodological novelty and distinguishing fine-grained acceptance tiers (e.g., Oral vs. Spotlight).
Ablation Findings:
- Removing conference instructions or literature summaries substantially degrades both predictive accuracy and ELO.
- Conditioning on only rebuttals increases the system’s deference to authors ("sycophancy"), producing higher false positives.
5. Integration Guidelines and Hybrid Pipeline Design
For practical adoption, ReviewerToo provides evidence-based integration recommendations:
- Stepwise Pipeline:
- Deploy as an auxiliary (secondary) reviewer for structured “second opinions.”
- Aggregate 3–5 diverse persona reviews for ensemble decisions.
- Use the metareviewer to synthesize human and AI feedback into a single draft.
- Area chairs focus oversight on borderline and high-impact decisions, while routine cases utilize hybrid consensus.
Operational Best Practices:
- Combine multiple AI reviewers, not single personas.
- Structure all agent prompts with conference guidelines and literature summaries.
- Calibrate rebuttal-handling to prevent over-acceptance.
- Monitor both decision accuracy and review quality (e.g., ELO).
- Track persona bias trends (e.g., via Cohen’s ) and adapt aggregation weights accordingly.
6. Limitations, Open Challenges, and Extensions
ReviewerToo demonstrates near-human accept/reject accuracy and superior review structuring, but several open issues remain:
- Fine-grained Calibration: Current models struggle to reliably separate acceptance tiers beyond reject/accept.
- Rebuttal Dynamics: Post-rebuttal agents exhibit sycophantic tendencies, requiring improved prompt engineering to limit undue deference.
- Multi-Turn Deliberation: There is a need to develop stable multi-turn reviewer panels without context drift.
- Domain Adaptation: Extending the system to new scientific areas (e.g., theoretical physics, biomedicine) may require persona specialization and adapted guidelines.
- Bias Mitigation: Research is ongoing into weighting and correcting systematic biases of individual personas.
ReviewerToo establishes a robust foundation for scalable, hybrid peer-review systems tailored to the growth of scientific publishing and the preservation of nuanced human judgment (Sahu et al., 9 Oct 2025).