- The paper shows that LLM assistance produces longer, more fluent reviews with reduced lexical complexity.
- Methodology employs fine-grained, aspect-based linguistic analysis and a MLE-based detection pipeline on ICLR and NeurIPS data.
- Findings indicate LLMs improve clarity but may dilute critical evaluation of originality and replicability in peer reviews.
Fine-Grained Analysis of LLM Impact on Peer Review in Top AI Conferences
Introduction
The paper "Impact of LLMs on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI" (2604.19578) presents a comprehensive empirical study of how LLMs are altering the structure and content of peer review reports in flagship AI venues (ICLR and NeurIPS). The central focus is a multi-level, aspect-based linguistic analysis pre- and post-LLM adoption, with particular attention to the differential effects of LLM assistance on review quality, reviewer expressiveness, and evaluative frameworks.
Methodology
The study leverages peer review corpora spanning ICLR 2017–2025 and NeurIPS 2016–2024, enabling temporal comparisons bracketing the introduction of models such as ChatGPT. The analysis operates at three granularities:
- Linguistic Complexity: Lexical and syntactic sophistication are quantified using standardized tools (TAALES, TAASSC). Metrics track word and sentence length, n-gram frequency, use of nominal subjects, and clause composition.
- Aspect-Based Content Analysis: Review texts are auto-annotated for eight evaluation aspects (summary, clarity, originality, soundness, substance, replicability, meaningful comparison, motivation), using a refined BERT-based sequence labeling model (92.75% accuracy).
- LLM Assistance Detection: An MLE-based detector, using expert- and LLM-authored reference corpora, is applied to filter reviews likely affected by automated rewriting or generation. Lexicon-based prefilters identify high-probability LLM linguistic markers.
This fine-grained design supports side-by-side comparison between LLM-assisted and non-assisted reviews, and enables correlation analysis with reviewer-assigned scores and confidence.
Results
Linguistic and Structural Changes
Text Length and Fluency: LLM-assisted reviews are consistently longer (word and sentence count) and more fluent. This effect is magnified for reviewers self-reporting lower confidence, suggesting LLMs serve as a compensatory writing scaffold. However, reviewers with higher confidence produce increasingly concise reports post-LLM adoption, indicating an emerging bifurcation in review styles.
Lexical and Syntactic Sophistication: There is a measurable decline in lexical complexity post-LLM, with increased reliance on nominal subjects and more standardized clause structures. Auxiliary verb frequency and use of diverse phrasal constructions are notably reduced, reflecting the templated, formal genre preferred by contemporary LLM outputs. Bigram/trigram frequency also changes, indicating a drift toward canonicalized academic phraseology.
Aspect-Based Content Shifts
Over-Emphasis on Summaries and Clarity: LLM-assisted reviews show a substantial increase in the proportion of text dedicated to summaries and surface clarity. The aspect ‘summary’ dominates content allocation, with LLM-assisted reviewers especially likely to reproduce or expand upon the paper's own abstract and introduction sections, often at the expense of deeper argumentation.
Decreased Originality, Replicability, and Comparative Assessment: Aspect mining demonstrates a decline in explicit evaluation of originality, replicability, and meaningful comparison in LLM-assisted reviews. This trend holds both across time and in matched cohort comparisons between LLM-assisted and non-assisted groups. Non-LLM reviews retain higher density of originality and replicability analysis, despite shorter overall length.
Confidence-Aspect Interactions: After LLM adoption, low-confidence reviewers substantially increase their focus on summary, likely offloading analytical synthesis to LLM outputs. High-confidence reviewers, while also affected, maintain a relative emphasis on soundness and substance but with less detailed original analysis.
Association with Reviewer Scores and Confidence
Spearman correlation analyses reveal uniformly weak associations between the diversity or sentiment of aspect mentions and both overall reviewer score and reviewer confidence. LLM-assisted reviews do not systematically bias scores higher or lower post adoption—contradicting recent speculation that automated text tools might inflate acceptance rates (Latona et al., 2024). Notably, clarity is negatively correlated with reviewer confidence, indicating that even with improved surface fluency, LLM assistance does not increase actual critical certainty.
LLM Assistance Detection
Applying the MLE-based detection pipeline reveals that a non-trivial proportion (≥15%, in line with previous estimates [Liang et al., 2024; Latona et al., 2024]) of reviews in 2024–2025 are likely LLM-assisted, with this share growing yearly. The combination of aspect drift and surface normalization is magnified in the detected cohort, validating the identification methodology.
Theoretical and Practical Implications
From a theory perspective, this study provides evidence that LLMs, as deployed in current reviewing workflows, substantially modulate the form of academic peer review, making it more verbose, polished, and standardized—without corresponding gains in depth of evaluation or critical engagement with core scientific dimensions such as originality and replicability.
Practically, these findings suggest LLMs offer substantial value as linguistic aids, particularly for novice or domain-mismatched reviewers, but risk shifting review culture toward generic, summary-heavy templates and away from substantive critique. Notably, the use of LLMs does not significantly alter reviewer-assigned scores or confidence, indicating that core decision metrics remain robust, but raises concern about the dilution of rigorous scientific debate in peer review reports.
The authors advocate for transparent, regulated integration of LLM assistance in reviewing (guidance rather than prohibition), with the goal of balancing workload alleviation against risks to diversity of evaluative styles. They further suggest that publishers and conference organizers should provide official LLM-based tools, restricting functionalities (e.g., aspect coverage, originality assessment) until the models demonstrate maturity in nuanced critical reasoning.
Future Directions
The study opens several lines for further research:
- Development of LLMs capable of deeper critical analysis and originality assessment.
- Causal analysis using controlled reviewer assignment and intervention studies, to move beyond correlational findings.
- Cross-domain generalization, leveraging review datasets from biomedical, physical, and interdisciplinary sciences.
- Integration of content-based and author/recency-based expert assessments to triangulate true review quality and informativeness.
Conclusion
This work establishes that LLM adoption is producing reviews that are objectively longer, more fluent, and more standardized, but with diminished focus on deeper evaluative aspects such as originality and replicability. While LLM assistance can improve accessibility and reduce entry barriers for less confident reviewers, there is clear evidence of a trade-off between linguistic polish and substantive, critical review content. These nuanced findings provide a rich empirical basis for both future LLM design in academic workflows and policy decisions regarding the regulation of LLM use in scientific peer review (2604.19578).