- The paper introduces an Anchor Embedding method that improves AI-generated review detection to over 60% true positive rates at a low false positive rate.
- It constructs a large-scale dataset of 788,984 paired reviews from top AI conferences to benchmark both commercial and open-source LLMs.
- The study also reveals significant qualitative differences between human and LLM reviews, raising concerns about fairness in peer review processes.
Benchmarking LLM Usage in Peer Review: Dataset Construction and AI Text Detection Performance
This paper addresses a critical and timely issue in scientific publishing: the potential for peer reviews to be secretly authored or edited by LLMs rather than by qualified human experts. The work focuses on establishing benchmark resources and evaluating algorithmic approaches to detect AI-generated text in the peer review context, which has distinct characteristics from broader factual or creative content detection tasks.
Dataset: Scale, Construction, and Coverage
A central contribution is the creation of a large-scale dataset comprising 788,984 parallel peer reviews—each review being associated with the same paper but written by a human or generated by one of five different state-of-the-art LLMs (GPT-4o, Claude Sonnet 3.5, Gemini 1.5 pro, Qwen 2.5 72b, and Llama 3.1 70b). The authors sourced paper-review pairs from eight years of leading AI conferences (ICLR and NeurIPS) using the OpenReview API and the ASAP dataset, ensuring coverage across evolving review templates and conference guidelines. For each paper, matching AI-generated reviews were created via carefully constructed prompts, incorporating conference-specific templates, reviewer guidelines, and explicit alignment with the decision made in the human review.
To further assess real-world adversarial scenarios, human reviews were also processed by LLMs under four staged levels of “editing,” simulating use cases where a reviewer might use an LLM for anything from light grammar checking to complete rewriting or content enhancement.
Methodological Innovations: Anchor Embedding Approach
The paper evaluates 18 open-source AI-text detection baselines applied in prior work, including approaches based on perplexity, entropy, syntactic or statistical features, and supervised fine-tuned classifiers. Empirically, these methods perform poorly in the peer review setting, especially when operating under the constraint of low false positive rates—which is vital to avoid unjustly accusing human reviewers.
To address these shortcomings, the authors introduce the “Anchor Embedding” methodology. This approach leverages the contextual one-to-one relationship between a review and the manuscript: for each paper, an “anchor” AI review is generated with a generic prompt using a known LLM. The semantic embedding of the test review (to be classified) is then compared by cosine similarity against this anchor. A threshold, learned on a calibration subset, is used for classification. Recognizing the uncertainty in the detector’s knowledge of which LLM was used by the putative AI reviewer, anchor reviews from multiple LLMs are generated and used in a voting ensemble: a sample is classified as AI-generated if any anchor LLM produces a similarity above its threshold.
Pseudocode Sketch for Anchor Embedding Detection
1
2
3
4
5
6
7
8
9
10
11
12
|
from openai import EmbeddingAPI
from sklearn.metrics.pairwise import cosine_similarity
def anchor_embedding_detect(test_review, paper_text, anchor_LLMs, threshold):
scores = []
for LLM in anchor_LLMs:
anchor_review = LLM.generate_review(paper_text)
emb_test = EmbeddingAPI.get_embedding(test_review)
emb_anchor = EmbeddingAPI.get_embedding(anchor_review)
score = cosine_similarity([emb_test], [emb_anchor])[0][0]
scores.append(score)
return int(any(score > threshold for score in scores)) |
This approach is computationally tractable given the scale of modern embedding APIs and can run efficiently in deployment settings when document text is available.
Empirical Results
Strong quantitative results demonstrate that standard detection methods achieve true positive rates (TPR) below 20% at a 0.1% false positive rate (FPR) for advanced LLMs (GPT-4o, Claude), while the Anchor Embedding approach achieves TPRs above 60% at 0.1% FPR for the same data. Notably, with open-source LLMs such as Llama or Qwen, simpler detection methods (e.g., Binoculars) achieve nearly perfect identification, reflecting the gap between open-source and commercial-grade AI reviewers in terms of detectability.
When evaluating detection of human-written reviews edited to varying degrees by LLMs, the Anchor approach shows stricter calibration than existing methods, flagging maximally edited reviews with higher confidence and maintaining low false positive rates for lightly edited content.
Analysis of Human vs. AI Review Characteristics
A manual audit of matched sets of human and GPT-4o reviews found qualitative differences: human reviews frequently cite specific figures, experiments, or related work, while LLM-generated reviews tend to be general, lacking citations or in-depth technical scrutiny. Quantitatively, across all tested LLMs, AI-generated reviews systematically assigned higher scores for soundness, contribution, presentation, and confidence—raising fairness and inflation concerns, especially if such reviews influence acceptance outcomes disproportionately.
Limitations and Practical Considerations
- Domain Coverage: The dataset, while large, only spans two leading AI conferences, limiting generalizability to other scientific communities or disciplines with different peer review norms.
- Operational Constraints: The proposed detection method relies on having access to the full manuscript at test time and necessitates calling LLMs to generate anchor reviews—a feasible workflow in manuscript management systems, but less so for lightweight document or snippet-level screening.
- Prompt Sensitivity: The detection results (for both baselines and the proposed method) are inherently influenced by the prompts used to elicit LLM-generated reviews; adversaries with advanced prompt engineering could reduce semantic similarity to anchor reviews, evading detectors.
Implications and Outlook
This work demonstrates that many widely deployed AI-text detection methods are insufficient in the context of peer review, particularly as reviewers leverage state-of-the-art commercial LLMs. The anchor-based methodology establishes a pragmatic baseline for real-world deployment within submission systems, though it comes at a moderate computational cost and the need for access to full paper text.
The dataset is a notable contribution, offering a resource for future research to benchmark detection strategies, experiment with more advanced ensemble or adversarial approaches, and evaluate evolving LLMs. As generative models improve and reviewers become more sophisticated in their AI usage (e.g., prompt tuning, adversarial paraphrasing), the arms race between detection and evasion will intensify. Effective governance for disclosures, reviewer education, and integration of detection mechanisms into peer review platforms remains a necessary complement to technical detection.
From an ethical and policy perspective, the paper substantiates concerns that undisclosed LLM involvement in peer review can substantially skew scientific assessment and acceptance outcomes. The social and institutional implications suggest the necessity for community norms, stronger editorial oversight, and continued vigilance against the invisible incursion of generative AI into critical scientific processes.