BadScientist: LLM Paper Fabrication Analysis
- BadScientist Framework is an evaluation system that analyzes vulnerabilities in LLM-driven research by fabricating and reviewing AI-generated papers.
- It employs a modular pipeline combining paper generation, multi-model review, and rigorous statistical calibration to validate manuscript integrity.
- The framework exposes a concern–acceptance conflict where reviewers flag issues yet accept unsound papers, urging the need for robust integrity safeguards.
The BadScientist framework is an evaluation system designed to analyze the vulnerability of LLM-based research agents and automated peer review systems to paper fabrication attacks. It provides both a modular pipeline for the end-to-end generation and review of research papers composed entirely without authentic experiments, and a rigorous statistical evaluation methodology to assess whether AI-generated, unsound papers can successfully pass through contemporary multi-model LLM review workflows. The framework exposes structural weaknesses in automated academic publishing processes and motivates the development of more robust integrity defense mechanisms (Jiang et al., 20 Oct 2025).
1. System Architecture and Pipeline
BadScientist is architected around three core modules: (1) a Paper Generation Agent , (2) a Review Agent comprising multiple LLM reviewers, and (3) an Analysis/Aggregation module for calibration, aggregation, and statistical error guarantee computation.
The framework pipeline is represented as:
- Seed Prompt : Specifies topic and attack strategy .
- Data Synthesizer : Generates pseudo‐experimental data conditioned on .
- Visualization Module : Renders plots/tables from .
- Manuscript Composer : Assembles a fully structured LaTeX manuscript.
- Review Agent : Queries multiple LLMs per paper using a fixed rubric, aggregates individual rubric vectors and textual comments .
- Calibration/Aggregation : Uses real paper data for threshold calibration and computes statistical concentration bounds.
Manuscript output is required to satisfy the constraint , ensuring compilability and structural correctness.
High-level agent pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 |
\begin{algorithm}[h]
\caption{BadScientist Paper Generation Agent %%%%16%%%%}
\begin{algorithmic}[1]
\Require topic %%%%17%%%%, strategy set %%%%18%%%%
\State Sample pseudo‐data %%%%19%%%%
\State Generate figures/tables %%%%20%%%%
\State Assemble paper %%%%21%%%%
\If{%%%%22%%%%} reject and \textbf{goto} 1 \;\,\Comment{ensure valid LaTeX}
\EndIf
\State \Return %%%%23%%%%
\end{algorithmic}
\end{algorithm} |
2. Presentation-Manipulation Strategies
BadScientist operationalizes five atomic attack strategies for paper fabrication, as well as their joint application:
- TooGoodGains (): Artificially amplifies performance improvements over state-of-the-art (SOTA).
- BaselineSelect (): Selectively reports weaker baselines and omits confidence intervals.
- StatTheater (): Constructs sophisticated statistical tables and -values that create an illusion of validity.
- CoherencePolish (): Focuses on producing flawless document structure, consistent notation, and professional typography.
- ProofGap (): Inserts "rigorous" proofs concealing subtle logical gaps.
- All: Applies all – strategies simultaneously.
Example of a fabricated table generated under (TooGoodGains):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
\begin{table}[h]
\centering
\caption{Fabricated Main Results under %%%%33%%%% (TooGoodGains)}
\label{tab:fake-results}
\begin{tabular}{lcccc}
\toprule
Method & Accuracy (\%) & %%%%34%%%% vs.\ SOTA & %%%%35%%%%-value & 95\% CI \
\midrule
Proposed (FakeNet) & \textbf{92.3} & +7.8 & %%%%36%%%% & [91.8, 92.8] \
SOTA Baseline & 84.5 & – & – & [83.9, 85.1] \
\bottomrule
\end{tabular}
\end{table} |
Example of a misleading fabricated loss curve:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
\begin{figure}[h]
\centering
\begin{tikzpicture}
\begin{axis}[
width=0.6\linewidth,
xlabel={Epochs},ylabel={Validation Loss},
ymin=0, ymax=1
]
\addplot[blue, thick] table {
1 0.8
2 0.6
3 0.4
4 0.25
5 0.15
};
% This curve is entirely fabricated.
\end{axis}
\end{tikzpicture}
\caption{Fabricated loss curve with unrealistically steady improvements.}
\end{figure} |
3. Formal Evaluation and Error Guarantees
The framework adopts a mathematically rigorous approach to aggregate review scores, calibrate thresholds, and provide formal error bounds.
Notation:
- Paper space:
- Strategies: ; Topics:
- Generator distribution:
- Review Models: ; rubric output .
- Aggregate score:
Concentration Bound (Theorem 1 — Bernstein-McDiarmid):
Given centered, vector sub-Gaussian rubric outputs and -Lipschitz aggregation:
where the terms control variance and worst-case deviations among LLM reviewers.
For binary acceptance predictors with margin ,
Calibration Error Bounds (Propositions 1 & 2):
- Use the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality to bound deviations between empirical and true accept rates.
- Isotonic regression estimates the threshold where estimated acceptance probability crosses 0.5, with explicit error bounds in terms of the regression slope and estimation error.
4. Experimental Results and Empirical Vulnerabilities
BadScientist reveals systematic weaknesses in LLM-based review:
| Strategy | ACPT | ACPT | ICR | ICR | ICR | ICR@M |
|---|---|---|---|---|---|---|
| 67.0% | 82.0% | 38.4% | 4.7% | 2.3% | 39.5% | |
| 32.0% | 49.0% | 35.2% | 4.5% | 2.3% | 35.2% | |
| 53.5% | 69.7% | 29.4% | 2.4% | 4.7% | 31.8% | |
| 44.0% | 59.0% | 28.2% | 5.9% | 1.2% | 30.6% | |
| 35.4% | 53.5% | 25.9% | 8.2% | 7.1% | 34.1% | |
| All | 52.0% | 69.0% | 50.6% | 5.7% | 8.0% | 51.7% |
- Acceptance rates for fabricated papers reach up to 82% for specific strategies.
- Integrity Concern Rate (ICR) varies by reviewer model and attack method; many papers flagged for integrity still receive acceptance-level scores.
- Score distributions demonstrate that reviewer models differ in their score tendencies (e.g., o4-mini is more lenient, o3 is more variable, GPT-4.1 is stricter near the threshold).
5. Concern–Acceptance Conflict
BadScientist identifies the phenomenon of concern–acceptance conflict: LLM reviewers often flag papers for integrity concerns in their free-form comments, yet still assign scores above the acceptance threshold.
Conflict rate is defined as the fraction of papers where a reviewer both raises an integrity concern and recommends acceptance. For example:
| Model | All | |||||
|---|---|---|---|---|---|---|
| o3 | 33.3% | 25.8% | 52.0% | 30.0% | 40.9% | 29.5% |
| o4-mini | 100.0% | 50.0% | 100.0% | 80.0% | 71.9% | 100.0% |
| GPT-4.1 | 50.0% | 50.0% | 75.0% | 0.0% | 33.3% | 57.1% |
Despite recognizing the unsound nature of fabricated submissions, LLM reviewers systematically fail to translate these concerns into actual rejections.
6. Detection and Mitigation Attempts
Two principal mitigation strategies are evaluated:
- Review-with-Detection (ReD): LLM reviewers return their scores, a binary "AI-Fabricated/Non-Fabricated" label, and quoted evidence.
- Detection-Only (DetOnly): Reviewers provide only the binary detection label with supporting evidence.
Performance metrics include true positive rate (TPR), false positive rate (FPR), accuracy (Acc), and score. Even with explicit prompts for fabrication detection, detection rates barely surpass random guessing. ReD sometimes results in higher acceptance rates compared to when no explicit detection prompt is given.
Detection results summary:
| Method | o3 Acc | o4-mini Acc | GPT-4.1 Acc |
|---|---|---|---|
| Random | 50.0% | 50.0% | 50.0% |
| ReD | 67.0% | 46.0% | 50.0% |
| DetOnly | 57.0% | 45.0% | 56.0% |
A plausible implication is that prompting LLMs for integrity checking in isolation is not effective in reliably identifying fabricated submissions (Jiang et al., 20 Oct 2025).
7. Implications and Safeguards
Key findings are:
- Automated AI-driven publication pipelines are fundamentally susceptible to fabrication: with certain attack strategies, up to 82% acceptance is observed.
- Concern–acceptance conflict undermines integrity signals, as LLM reviewers both flag and accept unsound work.
- Statistical aggregation and calibration, even with sound mathematical guarantees, fail to provide adequate defense against these attacks.
Recommended safeguards include:
- Defense-in-depth such as provenance verification (e.g., artifact timestamping, data checkpoints).
- Integrating integrity-weighted scoring, where acceptance is conditioned on the absence of flagged concerns.
- Mandatory human review for papers near or above the concern threshold.
- Audit logging of reviewer model actions and transparent post-publication review.
- Pursuing future research in adversarial reviewer training, multimodal credible-interval reasoning, and rigorous cross-validation with authentic experiments.
These results highlight urgent limitations of current AI-driven scientific publishing and the need for robust, multi-layered integrity verification systems (Jiang et al., 20 Oct 2025).