Can large language models provide useful feedback on research papers? A large-scale empirical analysis

Published 3 Oct 2023 in cs.LG, cs.AI, cs.CL, and cs.HC | (2310.01783v1)

Abstract: Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are increasingly difficult to obtain. Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback. With the breakthrough of LLMs (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4's generated feedback with human peer reviewer feedback in 15 Nature family journals (3,096 papers in total) and the ICLR machine learning conference (1,709 papers). The overlap in the points raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR). The overlap between GPT-4 and human reviewers is larger for the weaker papers. We then conducted a prospective user study with 308 researchers from 110 US institutions in the field of AI and computational biology to understand how researchers perceive feedback generated by our GPT-4 system on their own papers. Overall, more than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers. While our findings show that LLM-generated feedback can help researchers, we also identify several limitations.

Abstract PDF Upgrade to Chat

Citations (81)

View on Semantic Scholar

Summary

The paper presents a robust empirical analysis showing GPT-4's feedback overlap with human reviews at 30-39%, indicating its potential to mimic conventional peer evaluations.
It details a novel automated pipeline processing thousands of manuscripts, demonstrating GPT-4's capacity for generating paper-specific and context-aware feedback.
User studies reveal that over half of surveyed researchers find GPT-4 feedback beneficial, highlighting its role in supplementing human critiques in scientific review.

Overview of "Can LLMs provide useful feedback on research papers? A large-scale empirical analysis"

This paper addresses the utility of LLMs, specifically GPT-4, in providing feedback on scientific manuscripts, a burgeoning area of interest in computational science. As scholarly output grows and expertise becomes deeply specialized, obtaining high-quality, timely peer reviews has become increasingly challenging. The study undertakes a structured exploration of using LLMs to autonomously generate feedback on research papers and employs rigorous empirical methods to assess the viability of these models as a complement to traditional human review.

Methodology and Evaluation

The authors developed a sophisticated automated pipeline that leverages GPT-4 to generate feedback on scientific manuscripts. This system processes a paper's PDF and produces structured evaluations. They conducted retrospective and prospective analyses to compare GPT-4-generated feedback against human feedback.

In the retrospective study, a significant dataset was employed, including papers from 15 Nature family journals and ICLR, encompassing thousands of papers and reviewer comments. The similarity of comments between GPT-4 and human reviewers was assessed, revealing overlap rates of 30.85% for Nature journals and 39.23% for ICLR, closely aligned with inter-human reviewer overlaps.

A prospective study involving 308 researchers explored user perceptions of GPT-4 feedback. Over half of the respondents found GPT-4 feedback beneficial, and a substantial proportion viewed it as more helpful than some human reviews.

Key Findings

Feedback Overlap: GPT-4's feedback showed a level of overlap with human reviewers comparable to that between two human reviewers, indicating that the model can mimic human-like feedback to a degree.
Paper-Specific Feedback: Shuffling experiments demonstrated that GPT-4 feedback is specific rather than generic, as shuffling feedback between papers drastically reduced overlap.
User Perception: Many researchers found GPT-4 feedback useful, particularly valuing its ability to highlight overlooked perspectives.
Comment Emphasis: The study noted differences in focus, with GPT-4 offering more comments on implications but fewer on novelty, suggesting variability in the depth and scope of LLM feedback.
Potential and Limitations: While LLM feedback can complement human review, it lacks depth in critiquing methodologies and providing actionable insights, underscoring the necessity of human expertise in scientific evaluation.

Implications and Future Directions

The integration of LLMs in scientific feedback mechanisms offers promising avenues for enhancing manuscript evaluation, particularly in scenarios where human reviews are delayed or inaccessible. This capability could democratize access to scientific feedback, aiding under-resourced researchers by providing immediate, albeit general, feedback during early manuscript stages.

Future research should focus on refining LLMs to generate more nuanced and specific feedback, potentially through integrating domain-specific training or hybrid models combining human expertise with machine efficiency. The responsible application of LLMs in scientific review remains pivotal, ensuring they augment rather than replace the nuanced critique achieved through human expertise.

Conclusion

The study contributes significantly to the discourse on AI's role in scientific processes, suggesting that while LLMs like GPT-4 hold potential in enhancing accessibility to feedback, human reviewers remain indispensable. The paper advocates for a balanced approach, leveraging LLMs for preliminary feedback while preserving the integral role of human insight in the scientific review process.