Generation–Review Asymmetry in Peer Review

Updated 6 May 2026

Generation–Review Asymmetry is defined as the systematic gap between AI-generated reviews and human evaluative reasoning in peer review.
Empirical analyses from GenReview and PeerPrism reveal that AI-generated reviews exhibit significant positive bias and calibration failures compared to human critiques.
This asymmetry challenges binary authorship detection, urging the development of multidimensional attribution frameworks to safeguard review integrity.

Generation–review asymmetry refers to the systematic divergence between the capability of LLMs to generate plausible peer review text and their ability to authentically replicate the evaluative depth, calibration, and instruction-following fidelity of human reviewers. This phenomenon becomes particularly salient as LLMs are deployed for scientific peer review, both as explicit agents of review generation and as silent assistants in hybrid workflows. The term encompasses empirical findings that AI-generated reviews—while structurally fluent and superficially convincing—reveal strong biases, calibration failures, and a disconnect between surface realization and substantive evaluative reasoning. This asymmetry also challenges the robustness of AI-authorship detection methods, especially where human intellectual contribution is separated from the surface form of the review text (Demetrio et al., 24 Oct 2025, Sadeghian et al., 16 Apr 2026).

1. Formalization and Taxonomy of Generation Regimes

Peer review generation in the age of LLMs spans multiple provenance regimes defined along two axes: the origin of evaluative ideas and the origin of surface textual realization. Let $R$ denote a set of peer reviews, each $r\in R$ annotated as $(O_{\rm idea}(r), O_{\rm text}(r))$ :

Fully human reviews (HH): $O_{\rm idea}(r) = \text{Human}$ and $O_{\rm text}(r) = \text{Human}$
Fully synthetic (AI) reviews (AA): $O_{\rm idea}(r) = \text{AI}$ and $O_{\rm text}(r) = \text{AI}$
Hybrid—human ideas, AI text (HA): $O_{\rm idea}(r) = \text{Human}$ and $O_{\rm text}(r) = \text{AI}$
Hybrid—AI ideas, human text (AH): $O_{\rm idea}(r) = \text{AI}$ and $r\in R$ 0 (not instantiated in PeerPrism)

The GenReview dataset operationalizes large-scale generation of LLM-written peer reviews via three-branch prompting (neutral, positive, negative), enabling direct comparison with human-authored counterparts across 32,652 ICLR submissions (Demetrio et al., 24 Oct 2025). PeerPrism further establishes a regime-annotated corpus to systematically evaluate the multidimensionality of authorship in review processes (Sadeghian et al., 16 Apr 2026).

2. Empirical Manifestations of Asymmetry: Calibration and Bias

LLMs demonstrate a pronounced calibration asymmetry when tasked with peer review generation. Quantitative evidence from GenReview shows that:

Under neutral prompting, the mean rating $r\in R$ 1, yielding a bias $r\in R$ 2 above the notional midline $r\in R$ 3. Only 35 of 32,652 neutral reviews gave a rating of 5; the modal AI-generated rating is 8.
Positive prompting almost always yields ratings of 8 or 9 ( $r\in R$ 4).
Negative prompting leads to universal recommendations to reject ( $r\in R$ 5, with 85% assigning a 4).

Statistical testing ( $r\in R$ 6, $r\in R$ 7) confirms that these effects are significant. Notably, even neutral LLM reviews exhibit an overwhelming acceptance bias, failing to replicate the critical selectivity and finer granularity of human calibration (Demetrio et al., 24 Oct 2025). This points to a systematic generation–review asymmetry: LLMs, even when correctly structured, do not operationalize nuanced negative evaluations as humans do.

3. Detection Methodologies and the Limits of Binary Attribution

AI-authorship detectors—such as Binoculars (Demetrio et al., 24 Oct 2025) and SOTA detectors benchmarked in PeerPrism (Sadeghian et al., 16 Apr 2026)—nominally achieve high precision and recall in distinguishing fully human (HH) from fully synthetic (AA) reviews. For example, Binoculars yields 100% recall and ≈99.3% precision on GenReview, with low false positive rates in pre-2023 human reviews.

However, when applied to hybrid regimes—particularly HA (human ideas, AI text)—detection accuracy and inter-detector agreement degrade sharply. In PeerPrism:

On HA-rewritten reviews, Anchor predicts 35% Human/65% AI, while Fast-DetectGPT yields 74% Human/26% AI; disagreement rates exceed 0.5 between detectors.
Average detector accuracy against "true" label (Human) for HA varies from <30% to >70% depending on method.

This divergence arises because detectors conflate surface stylometric properties with semantic provenance. Some detectors latch onto text-level artifacts (flagging any AI-generated surface as AI), while others are sensitive to preserved evaluative structure (semantically labeling as Human when the critique originates from human inputs) (Sadeghian et al., 16 Apr 2026). This exposes the inadequacy of binary attribution frameworks in complex real-world workflows.

4. Stylometric and Semantic Analyses Across Regimes

Empirical measurements in PeerPrism demonstrate stylometric and semantic shifts induced by LLM rewriting:

Regime	TTR	Readability	1st-person
HH (human)	0.55	37.8	5.04
AA (AI)	0.61	14.0	0.37
HA (hybrid)	0.56	19.5	2.34

Type–Token Ratio (TTR): Increases slightly in AI and hybrid reviews, suggesting greater lexical diversity under LLM generation.
Readability (Flesch): Decreases (lower score) in LLM regimes, indicating syntactically denser, potentially less accessible prose.
Authorial Markers: 1st-person pronoun usage is suppressed by LLMs (5.04 → 0.37–2.34 per 100 tokens), reflecting the surface neutralization characteristic of automated generation.
Paper-grounding Features: Citation and explicit manuscript reference counts increase in LLM and hybrid-generated text.

Semantic analysis further reveals that the core evaluative structure of HA reviews inherits from the source human critique ( $r\in R$ 8 similarity), while surface features increasingly approximate synthetic norms. This illustrates that generation–review asymmetry is not merely a superficial artifact but persists across both content and form dimensions (Sadeghian et al., 16 Apr 2026).

5. Instruction-Following and Fidelity Failures

In GenReview, instruction-following by LLMs is uneven:

Summary Section Length: Compliance ( $r\in R$ 9) is high (≈94% in the 100–300 word range).
Main Review Length: Compliance ( $(O_{\rm idea}(r), O_{\rm text}(r))$ 0) collapses to ≈12% (target 800–1000 words), with many reviews truncated below 700 words.
Sectional Structure and Ratings: Most LLM reviews include the required structural keywords and explicit rating (0.4% omission rate), with rare hallucinations or mis-ordered sections (<1%).

This inconsistency highlights a further form of generation–review asymmetry: LLMs reliably follow some shallow syntactic constraints but struggle with compound or nested instructions, compromising evaluative substance (Demetrio et al., 24 Oct 2025). A plausible implication is that generation quality is not uniform across requested review dimensions, and post-hoc or calibrated prompting may be required to enforce compliance.

6. Implications for Authorship Attribution and Peer Review Integrity

The collective evidence demonstrates that generation–review asymmetry undermines both the procedural authenticity of peer review and the operational validity of LLM-authorship detectors. Specifically:

Binary Detection Pitfall: AI-authorship detection, when restricted to HH/AA dichotomies, yields deceptively high metrics. In hybrid and real-world workflows, detectors' predictions become regime-dependent, converging neither on surface nor substantive provenance (Sadeghian et al., 16 Apr 2026).
Authorship as Multidimensional Construct: Authorship must be modeled as a vectorial combination of semantic (idea provenance) and stylistic (surface realization) contributions, not as a scalar label. Policy and tooling in review venues will need to distinguish between language assistance (stylistic mediation) and intellectual delegation (evaluative reasoning).
Bias and Calibration Risks: Systematic positive skew in LLM reviews risks eroding the function of peer review as a critical filter, especially for lower-quality or borderline submissions.
Robustness and Integrity: Future detection, attribution, and fidelity-checking tools must move toward provenance quantification and multidimensional authorship modeling to accurately sustain the integrity of scientific evaluation (Demetrio et al., 24 Oct 2025, Sadeghian et al., 16 Apr 2026).

7. Future Directions and Open Problems

Empirical insights from GenReview and PeerPrism indicate several open research challenges:

Provenance Quantification: Developing detector frameworks that disentangle and quantify the independent contributions of human and AI agents across both semantic and stylistic axes.
Calibration Mitigation: Designing calibrated prompting strategies or post-hoc adjustment mechanisms to correct for LLM bias and achieve human-like critical discrimination.
Hybrid Workflow Transparency: Building detection and attribution pipelines that are robust to an increasing prevalence of hybrid (HA, AH) review generation regimes.
Instructional Robustness: Enhancing LLMs’ ability to follow complex, context-dependent reviewing guidelines without shortcutting or hallucinating evaluative structures.

Taken together, generation–review asymmetry marks both a technical and normative frontier in the integration of LLMs within scientific peer review. Its resolution will require co-evolution of dataset construction, detector design, policy frameworks, and continuous evaluation of both surface and substantive provenance in scholarly communication (Demetrio et al., 24 Oct 2025, Sadeghian et al., 16 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews (2025)

PeerPrism: Peer Evaluation Expertise vs Review-writing AI (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generation–Review Asymmetry.