LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing (2406.16253v3)

Published 24 Jun 2024 in cs.CL

Abstract: This work is motivated by two key trends. On one hand, LLMs have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload? This study focuses on the topic of LLMs assist NLP Researchers, particularly examining the effectiveness of LLM in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with "deficiency" labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) "LLMs as Reviewers", how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) "LLMs as Metareviewers", how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.

PDF HTML Abstract

LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing

The paper "LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing" explores the capability of LLMs in the context of assisting NLP researchers with the review and meta-review processes of academic papers. The paper is primarily motivated by the dual trends of increasing adoption of LLMs for various tasks and the growing burdens on researchers to review a large and increasing number of submissions. The authors present a comprehensive examination of how LLMs perform as both reviewers and meta-reviewers, utilizing a newly constructed ReviewCritique dataset.

Dataset and Methodology

The ReviewCritique dataset is a core contribution of this paper. It comprises two key components: NLP paper submissions, along with both human-written and LLM-generated reviews, and detailed segment-level annotations. These annotations are labeled by NLP experts and include deficiency tags and explanations, enabling a granular comparison of review quality.

The paper formulates two main research questions:

How do LLM-generated reviews compare to those written by human reviewers in terms of quality and distinguishability?
Can LLMs effectively identify deficiencies in individual reviews when acting as meta-reviewers?

To address these questions, the dataset includes not only initial submissions and corresponding reviews but also meta-reviews and author rebuttals when available. The paper involves rigorous annotation and quality control processes carried out by highly experienced NLP researchers.

Experimental Results

LLMs as Reviewers

The analysis revealed several nuanced insights into how well LLMs perform in generating reviews compared to human reviewers. Key findings include:

Error Type Analysis: Human reviewers are prone to errors such as misunderstanding paper content and neglecting crucial details. In contrast, LLMs frequently introduce errors such as out-of-scope suggestions and superficial comments, indicating a lack of depth and paper-specific critique.
Review Component Analysis: LLMs performed relatively well in summarizing papers, with fewer inaccuracies in the summary sections compared to human summaries. However, LLMs tend to uncritically accept authors' claims on strengths and provide generic, unspecific feedback on weaknesses and writing quality.
Recommendation Scores: LLMs displayed a tendency to give higher scores across the board, failing to effectively distinguish between high-quality and lower-quality submissions.
Review Diversity: Using the ITF-IDF metric, human reviews showed higher diversity compared to LLM-generated reviews. Furthermore, LLMs exhibited high inter-model similarity, suggesting that using multiple LLMs does not significantly enhance review diversity.

LLMs as Meta-Reviewers

The paper evaluated the performance of closed-source models (GPT-4, Claude Opus, Gemini 1.5) and open-source models (Llama3-8B, Llama3-70B, Qwen2-72B) in identifying deficient segments in human-written reviews. The results indicate that:

Even top-tier LLMs struggle to match human meta-reviewers in identifying and explaining deficiencies in reviews.
Precision and Recall: While LLMs showed modest recall in identifying deficient segments, precision was relatively low, leading to many false positives.
Explanation Quality: Claude Opus achieved the highest scores in providing explanations, but overall, LLMs struggled to articulate reasoning comparable to human experts.

Implications and Future Directions

The findings of this paper have significant implications for the integration of AI in academic peer review processes. While LLMs show promise in generating summaries and offering some level of assistance in review tasks, their current capabilities fall short of fully replacing human expertise in both reviewing and meta-reviewing. The high incidence of generic and superficial feedback from LLMs, along with their difficulty in identifying nuanced deficiencies, highlights the need for continued human oversight.

Practically, LLMs could serve as preliminary reviewers, providing initial feedback that can be refined by human experts. This hybrid approach might alleviate some of the workload on human reviewers while ensuring the high standards of peer review are maintained.

Theoretically, the paper underscores areas for future research in enhancing LLMs' ability to understand and critique domain-specific content. The development of more sophisticated models that incorporate deeper reasoning and contextual understanding is necessary.

Conclusion

The paper "LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing" provides a meticulous evaluation of the role LLMs can play in the academic peer review process. The constructed ReviewCritique dataset offers a valuable resource for ongoing research in AI-assisted peer review and benchmarking. The paper's findings encourage cautious optimism about the benefits of LLMs, with a clear recognition of their current limitations and the need for comprehensive human oversight. Moving forward, the integration of AI in peer review will likely involve a collaborative approach, leveraging both the efficiency of LLMs and the nuanced judgment of human experts.