ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews

Published 11 Mar 2025 in cs.CL | (2503.08506v2)

Abstract: Academic paper review is a critical yet time-consuming task within the research community. With the increasing volume of academic publications, automating the review process has become a significant challenge. The primary issue lies in generating comprehensive, accurate, and reasoning-consistent review comments that align with human reviewers' judgments. In this paper, we address this challenge by proposing ReviewAgents, a framework that leverages LLMs to generate academic paper reviews. We first introduce a novel dataset, Review-CoT, consisting of 142k review comments, designed for training LLM agents. This dataset emulates the structured reasoning process of human reviewers-summarizing the paper, referencing relevant works, identifying strengths and weaknesses, and generating a review conclusion. Building upon this, we train LLM reviewer agents capable of structured reasoning using a relevant-paper-aware training method. Furthermore, we construct ReviewAgents, a multi-role, multi-LLM agent review framework, to enhance the review comment generation process. Additionally, we propose ReviewBench, a benchmark for evaluating the review comments generated by LLMs. Our experimental results on ReviewBench demonstrate that while existing LLMs exhibit a certain degree of potential for automating the review process, there remains a gap when compared to human-generated reviews. Moreover, our ReviewAgents framework further narrows this gap, outperforming advanced LLMs in generating review comments.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

Bridging Human and AI-Generated Paper Reviews with ReviewAgents

This paper introduces a comprehensive approach to automating academic paper reviews through a novel framework named ReviewAgents. The process of reviewing academic papers is critical yet increasingly cumbersome due to the proliferating volume of submissions. The authors propose leveraging LLMs as a viable solution to this challenge, aiming to produce reviews that align closely with those generated by human experts.

Core Contributions and Methodology

Review-CoT Dataset Development: At the heart of this effort is the creation of the Review-CoT dataset, a substantial collection of 142,324 review comments corresponding to 37,403 academic papers sourced from open review platforms. This dataset is meticulously structured to emulate the cognitive process human reviewers undergo, involving the summarization of a paper, evaluation of its strengths and weaknesses, and deriving a conclusion that reflects its novelty and relevance. The dataset is unique in incorporating relevant literature references—mirroring the practice of human reviewers—as part of the relevant-paper-aware training method.
Structured Reasoning and Multi-Agent Framework: Recognizing the inadequacy of direct comment generation by LLMs typical of existing methods, ReviewAgents implements a multi-agent system that structures the reasoning within LLMs to follow a sequence reflecting human reviewer processes. The system employs multiple roles, including reviewer agents and an area chair agent, to simulate the multi-step and collaborative nature of peer review. This approach mitigates biases inherent in single-agent reviews by synthesizing multiple reviews into a cohesive meta-review.
Benchmarking with ReviewBench: To critically assess the effectiveness of ReviewAgents, the authors develop ReviewBench, a benchmarking framework that evaluates LLM-generated reviews across dimensions like language diversity, semantic consistency, and sentiment alignment. It also introduces the Review Arena, a tournament-style evaluation that ranks review comments based on their alignment with human-generated reviews.

Empirical Findings

Experimental results demonstrate that the ReviewAgents framework markedly narrows the performance gap between machine-generated and human-generated reviews. The approach surpasses other state-of-the-art LLMs, achieving superior semantic and sentiment alignment as evidenced by metrics such as ROUGE and SPICE in the ReviewBench framework. The multi-agent nature of the framework, with optimal reviewer counts identified through ablation studies, contributes to balanced diversity and consistency in the generated reviews.

Implications and Future Directions

By providing a scalable and structured approach to academic reviews, ReviewAgents addresses critical bottlenecks in the scholarly peer review process, offering a potential tool for authors to pre-emptively assess and improve their work. This could also serve as a supplementary resource for human reviewers, enhancing review quality and efficiency. Future research directions could explore extending the dataset to encompass diverse academic disciplines and refining the joint training methodologies to seamlessly integrate the reasoning stages.

In conclusion, the development of the ReviewAgents framework signifies a sophisticated leap towards reconciling the capabilities of LLMs with the nuanced demands of academic peer review, contributing to the realization of more robust and automated scholarly evaluation systems.

Markdown Report Issue