Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

Published 21 Apr 2026 in cs.CL, cs.AI, cs.DL, and cs.IR | (2604.19578v1)

Abstract: With the rapid advancement of LLMs, the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper shows that LLM assistance produces longer, more fluent reviews with reduced lexical complexity.
Methodology employs fine-grained, aspect-based linguistic analysis and a MLE-based detection pipeline on ICLR and NeurIPS data.
Findings indicate LLMs improve clarity but may dilute critical evaluation of originality and replicability in peer reviews.

Fine-Grained Analysis of LLM Impact on Peer Review in Top AI Conferences

Introduction

The paper "Impact of LLMs on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI" (2604.19578) presents a comprehensive empirical study of how LLMs are altering the structure and content of peer review reports in flagship AI venues (ICLR and NeurIPS). The central focus is a multi-level, aspect-based linguistic analysis pre- and post-LLM adoption, with particular attention to the differential effects of LLM assistance on review quality, reviewer expressiveness, and evaluative frameworks.

Methodology

The study leverages peer review corpora spanning ICLR 2017–2025 and NeurIPS 2016–2024, enabling temporal comparisons bracketing the introduction of models such as ChatGPT. The analysis operates at three granularities:

Linguistic Complexity: Lexical and syntactic sophistication are quantified using standardized tools (TAALES, TAASSC). Metrics track word and sentence length, n-gram frequency, use of nominal subjects, and clause composition.
Aspect-Based Content Analysis: Review texts are auto-annotated for eight evaluation aspects (summary, clarity, originality, soundness, substance, replicability, meaningful comparison, motivation), using a refined BERT-based sequence labeling model (92.75% accuracy).
LLM Assistance Detection: An MLE-based detector, using expert- and LLM-authored reference corpora, is applied to filter reviews likely affected by automated rewriting or generation. Lexicon-based prefilters identify high-probability LLM linguistic markers.

This fine-grained design supports side-by-side comparison between LLM-assisted and non-assisted reviews, and enables correlation analysis with reviewer-assigned scores and confidence.

Results

Linguistic and Structural Changes

Text Length and Fluency: LLM-assisted reviews are consistently longer (word and sentence count) and more fluent. This effect is magnified for reviewers self-reporting lower confidence, suggesting LLMs serve as a compensatory writing scaffold. However, reviewers with higher confidence produce increasingly concise reports post-LLM adoption, indicating an emerging bifurcation in review styles.

Lexical and Syntactic Sophistication: There is a measurable decline in lexical complexity post-LLM, with increased reliance on nominal subjects and more standardized clause structures. Auxiliary verb frequency and use of diverse phrasal constructions are notably reduced, reflecting the templated, formal genre preferred by contemporary LLM outputs. Bigram/trigram frequency also changes, indicating a drift toward canonicalized academic phraseology.

Aspect-Based Content Shifts

Over-Emphasis on Summaries and Clarity: LLM-assisted reviews show a substantial increase in the proportion of text dedicated to summaries and surface clarity. The aspect ‘summary’ dominates content allocation, with LLM-assisted reviewers especially likely to reproduce or expand upon the paper's own abstract and introduction sections, often at the expense of deeper argumentation.

Decreased Originality, Replicability, and Comparative Assessment: Aspect mining demonstrates a decline in explicit evaluation of originality, replicability, and meaningful comparison in LLM-assisted reviews. This trend holds both across time and in matched cohort comparisons between LLM-assisted and non-assisted groups. Non-LLM reviews retain higher density of originality and replicability analysis, despite shorter overall length.

Confidence-Aspect Interactions: After LLM adoption, low-confidence reviewers substantially increase their focus on summary, likely offloading analytical synthesis to LLM outputs. High-confidence reviewers, while also affected, maintain a relative emphasis on soundness and substance but with less detailed original analysis.

Association with Reviewer Scores and Confidence

Spearman correlation analyses reveal uniformly weak associations between the diversity or sentiment of aspect mentions and both overall reviewer score and reviewer confidence. LLM-assisted reviews do not systematically bias scores higher or lower post adoption—contradicting recent speculation that automated text tools might inflate acceptance rates (Latona et al., 2024). Notably, clarity is negatively correlated with reviewer confidence, indicating that even with improved surface fluency, LLM assistance does not increase actual critical certainty.

LLM Assistance Detection

Applying the MLE-based detection pipeline reveals that a non-trivial proportion (≥15%, in line with previous estimates [Liang et al., 2024; Latona et al., 2024]) of reviews in 2024–2025 are likely LLM-assisted, with this share growing yearly. The combination of aspect drift and surface normalization is magnified in the detected cohort, validating the identification methodology.

Theoretical and Practical Implications

From a theory perspective, this study provides evidence that LLMs, as deployed in current reviewing workflows, substantially modulate the form of academic peer review, making it more verbose, polished, and standardized—without corresponding gains in depth of evaluation or critical engagement with core scientific dimensions such as originality and replicability.

Practically, these findings suggest LLMs offer substantial value as linguistic aids, particularly for novice or domain-mismatched reviewers, but risk shifting review culture toward generic, summary-heavy templates and away from substantive critique. Notably, the use of LLMs does not significantly alter reviewer-assigned scores or confidence, indicating that core decision metrics remain robust, but raises concern about the dilution of rigorous scientific debate in peer review reports.

The authors advocate for transparent, regulated integration of LLM assistance in reviewing (guidance rather than prohibition), with the goal of balancing workload alleviation against risks to diversity of evaluative styles. They further suggest that publishers and conference organizers should provide official LLM-based tools, restricting functionalities (e.g., aspect coverage, originality assessment) until the models demonstrate maturity in nuanced critical reasoning.

Future Directions

The study opens several lines for further research:

Development of LLMs capable of deeper critical analysis and originality assessment.
Causal analysis using controlled reviewer assignment and intervention studies, to move beyond correlational findings.
Cross-domain generalization, leveraging review datasets from biomedical, physical, and interdisciplinary sciences.
Integration of content-based and author/recency-based expert assessments to triangulate true review quality and informativeness.

Conclusion

This work establishes that LLM adoption is producing reviews that are objectively longer, more fluent, and more standardized, but with diminished focus on deeper evaluative aspects such as originality and replicability. While LLM assistance can improve accessibility and reduce entry barriers for less confident reviewers, there is clear evidence of a trade-off between linguistic polish and substantive, critical review content. These nuanced findings provide a rich empirical basis for both future LLM design in academic workflows and policy decisions regarding the regulation of LLM use in scientific peer review (2604.19578).

Markdown Report Issue