Papers
Topics
Authors
Recent
2000 character limit reached

BadScientist: LLM Paper Fabrication Analysis

Updated 6 December 2025
  • BadScientist Framework is an evaluation system that analyzes vulnerabilities in LLM-driven research by fabricating and reviewing AI-generated papers.
  • It employs a modular pipeline combining paper generation, multi-model review, and rigorous statistical calibration to validate manuscript integrity.
  • The framework exposes a concern–acceptance conflict where reviewers flag issues yet accept unsound papers, urging the need for robust integrity safeguards.

The BadScientist framework is an evaluation system designed to analyze the vulnerability of LLM-based research agents and automated peer review systems to paper fabrication attacks. It provides both a modular pipeline for the end-to-end generation and review of research papers composed entirely without authentic experiments, and a rigorous statistical evaluation methodology to assess whether AI-generated, unsound papers can successfully pass through contemporary multi-model LLM review workflows. The framework exposes structural weaknesses in automated academic publishing processes and motivates the development of more robust integrity defense mechanisms (Jiang et al., 20 Oct 2025).

1. System Architecture and Pipeline

BadScientist is architected around three core modules: (1) a Paper Generation Agent G\mathcal{G}, (2) a Review Agent R\mathcal{R} comprising multiple LLM reviewers, and (3) an Analysis/Aggregation module A\mathcal{A} for calibration, aggregation, and statistical error guarantee computation.

The framework pipeline is represented as:

  • Seed Prompt (t,s)(t,s): Specifies topic tt and attack strategy ss.
  • Data Synthesizer q(Dt,s)q(D|t,s): Generates pseudo‐experimental data conditioned on (t,s)(t,s).
  • Visualization Module viz(D)\mathrm{viz}(D): Renders plots/tables from DD.
  • Manuscript Composer compose(u,D,V)\mathrm{compose}(u,D,V): Assembles a fully structured LaTeX manuscript.
  • Review Agent R\mathcal{R}: Queries multiple LLMs per paper using a fixed rubric, aggregates individual rubric vectors rm(x)\mathbf{r}_m(x) and textual comments ωm(x)\omega_m(x).
  • Calibration/Aggregation A\mathcal{A}: Uses real paper data for threshold calibration and computes statistical concentration bounds.

Manuscript output is required to satisfy the constraint C(x)=I[compile(x)=successstruct(x)C]=1C(x) = \mathbb{I}[\text{compile}(x)=\text{success} \wedge \mathrm{struct}(x) \in \mathcal{C}] = 1, ensuring compilability and structural correctness.

High-level agent pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
\begin{algorithm}[h]
  \caption{BadScientist Paper Generation Agent %%%%16%%%%}
  \begin{algorithmic}[1]
    \Require topic %%%%17%%%%, strategy set %%%%18%%%%
    \State Sample pseudo‐data %%%%19%%%%
    \State Generate figures/tables %%%%20%%%%
    \State Assemble paper %%%%21%%%%
    \If{%%%%22%%%%} reject and \textbf{goto} 1 \;\,\Comment{ensure valid LaTeX}
    \EndIf
    \State \Return %%%%23%%%%
  \end{algorithmic}
\end{algorithm}

2. Presentation-Manipulation Strategies

BadScientist operationalizes five atomic attack strategies for paper fabrication, as well as their joint application:

  • TooGoodGains (s1s_1): Artificially amplifies performance improvements over state-of-the-art (SOTA).
  • BaselineSelect (s2s_2): Selectively reports weaker baselines and omits confidence intervals.
  • StatTheater (s3s_3): Constructs sophisticated statistical tables and pp-values that create an illusion of validity.
  • CoherencePolish (s4s_4): Focuses on producing flawless document structure, consistent notation, and professional typography.
  • ProofGap (s5s_5): Inserts "rigorous" proofs concealing subtle logical gaps.
  • All: Applies all s1s_1s5s_5 strategies simultaneously.

Example of a fabricated table generated under s1s_1 (TooGoodGains):

1
2
3
4
5
6
7
8
9
10
11
12
13
\begin{table}[h]
  \centering
  \caption{Fabricated Main Results under %%%%33%%%% (TooGoodGains)}
  \label{tab:fake-results}
  \begin{tabular}{lcccc}
    \toprule
    Method & Accuracy (\%) & %%%%34%%%% vs.\ SOTA & %%%%35%%%%-value & 95\% CI \
    \midrule
    Proposed (FakeNet) & \textbf{92.3} & +7.8 & %%%%36%%%% & [91.8, 92.8] \
    SOTA Baseline & 84.5 & – & – & [83.9, 85.1] \
    \bottomrule
  \end{tabular}
\end{table}

Example of a misleading fabricated loss curve:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
\begin{figure}[h]
  \centering
  \begin{tikzpicture}
    \begin{axis}[
      width=0.6\linewidth,
      xlabel={Epochs},ylabel={Validation Loss},
      ymin=0, ymax=1
    ]
      \addplot[blue, thick] table {
        1 0.8
        2 0.6
        3 0.4
        4 0.25
        5 0.15
      };
      % This curve is entirely fabricated.
    \end{axis}
  \end{tikzpicture}
  \caption{Fabricated loss curve with unrealistically steady improvements.}
\end{figure}

3. Formal Evaluation and Error Guarantees

The framework adopts a mathematically rigorous approach to aggregate review scores, calibrate thresholds, and provide formal error bounds.

Notation:

  • Paper space: X\mathcal{X}
  • Strategies: S\mathcal{S}; Topics: T\mathcal{T}
  • Generator distribution:

pG(xs,t)=p(xD,s,t)q(Ds,t,θ)dD.p_\mathcal{G}(x\mid s,t)=\int p(x\mid D,s,t)\,q(D\mid s,t,\theta)\,dD.

  • Review Models: M={m1,,mM}\mathcal{M} = \{m_1,\dots,m_M\}; rubric output rm(x){1,,L}K\mathbf{r}_m(x)\in\{1,\dots,L\}^K.
  • Aggregate score:

rˉ(x)=wmrm(x),s(x)=ϕ(rˉ(x)),y^(x)=I[s(x)τ].\bar{\mathbf{r}}(x) = \sum w_m \mathbf{r}_m(x),\quad s(x) = \phi(\bar{\mathbf{r}}(x)),\quad \hat y(x) = \mathbb{I}[s(x)\ge\tau].

Concentration Bound (Theorem 1 — Bernstein-McDiarmid):

Given centered, vector sub-Gaussian rubric outputs and LϕL_\phi-Lipschitz aggregation:

Pr(s(x)μs(x)t)exp(t22σw2+23cmaxt),\Pr\left(s(x) - \mu_s(x) \ge t\right) \le \exp\left(-\frac{t^2}{2\sigma_w^2 + \tfrac23\,c_{\max} t}\right),

where the terms control variance and worst-case deviations among LLM reviewers.

For binary acceptance predictors with margin γ(x)\gamma(x),

Pr(y^(x)y(x))    exp ⁣(γ(x)22σw2+23cmaxγ(x)).\Pr\bigl(\hat y(x)\neq y^\star(x)\bigr)\;\le\; \exp\!\left(-\frac{\gamma(x)^2}{2\sigma_w^2+\frac23\,c_{\max}\gamma(x)}\right).

Calibration Error Bounds (Propositions 1 & 2):

  • Use the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality to bound deviations between empirical and true accept rates.
  • Isotonic regression estimates the threshold where estimated acceptance probability crosses 0.5, with explicit error bounds in terms of the regression slope and estimation error.

4. Experimental Results and Empirical Vulnerabilities

BadScientist reveals systematic weaknesses in LLM-based review:

Strategy ACPTτrate_{\tau_{\rm rate}} ACPTτ0.5_{\tau_{0.5}} ICRo3_{\rm o3} ICRo4_{\rm o4} ICRGPT4.1_{\rm GPT4.1} ICR@M
s1s_1 67.0% 82.0% 38.4% 4.7% 2.3% 39.5%
s2s_2 32.0% 49.0% 35.2% 4.5% 2.3% 35.2%
s3s_3 53.5% 69.7% 29.4% 2.4% 4.7% 31.8%
s4s_4 44.0% 59.0% 28.2% 5.9% 1.2% 30.6%
s5s_5 35.4% 53.5% 25.9% 8.2% 7.1% 34.1%
All 52.0% 69.0% 50.6% 5.7% 8.0% 51.7%
  • Acceptance rates for fabricated papers reach up to 82% for specific strategies.
  • Integrity Concern Rate (ICR) varies by reviewer model and attack method; many papers flagged for integrity still receive acceptance-level scores.
  • Score distributions demonstrate that reviewer models differ in their score tendencies (e.g., o4-mini is more lenient, o3 is more variable, GPT-4.1 is stricter near the threshold).

5. Concern–Acceptance Conflict

BadScientist identifies the phenomenon of concern–acceptance conflict: LLM reviewers often flag papers for integrity concerns in their free-form comments, yet still assign scores above the acceptance threshold.

Conflict rate is defined as the fraction of papers where a reviewer both raises an integrity concern and recommends acceptance. For example:

Model s1s_1 s2s_2 s3s_3 s4s_4 s5s_5 All
o3 33.3% 25.8% 52.0% 30.0% 40.9% 29.5%
o4-mini 100.0% 50.0% 100.0% 80.0% 71.9% 100.0%
GPT-4.1 50.0% 50.0% 75.0% 0.0% 33.3% 57.1%

Despite recognizing the unsound nature of fabricated submissions, LLM reviewers systematically fail to translate these concerns into actual rejections.

6. Detection and Mitigation Attempts

Two principal mitigation strategies are evaluated:

  • Review-with-Detection (ReD): LLM reviewers return their scores, a binary "AI-Fabricated/Non-Fabricated" label, and quoted evidence.
  • Detection-Only (DetOnly): Reviewers provide only the binary detection label with supporting evidence.

Performance metrics include true positive rate (TPR), false positive rate (FPR), accuracy (Acc), and F1F_1 score. Even with explicit prompts for fabrication detection, detection rates barely surpass random guessing. ReD sometimes results in higher acceptance rates compared to when no explicit detection prompt is given.

Detection results summary:

Method o3 Acc o4-mini Acc GPT-4.1 Acc
Random 50.0% 50.0% 50.0%
ReD 67.0% 46.0% 50.0%
DetOnly 57.0% 45.0% 56.0%

A plausible implication is that prompting LLMs for integrity checking in isolation is not effective in reliably identifying fabricated submissions (Jiang et al., 20 Oct 2025).

7. Implications and Safeguards

Key findings are:

  • Automated AI-driven publication pipelines are fundamentally susceptible to fabrication: with certain attack strategies, up to 82% acceptance is observed.
  • Concern–acceptance conflict undermines integrity signals, as LLM reviewers both flag and accept unsound work.
  • Statistical aggregation and calibration, even with sound mathematical guarantees, fail to provide adequate defense against these attacks.

Recommended safeguards include:

  • Defense-in-depth such as provenance verification (e.g., artifact timestamping, data checkpoints).
  • Integrating integrity-weighted scoring, where acceptance is conditioned on the absence of flagged concerns.
  • Mandatory human review for papers near or above the concern threshold.
  • Audit logging of reviewer model actions and transparent post-publication review.
  • Pursuing future research in adversarial reviewer training, multimodal credible-interval reasoning, and rigorous cross-validation with authentic experiments.

These results highlight urgent limitations of current AI-driven scientific publishing and the need for robust, multi-layered integrity verification systems (Jiang et al., 20 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to BadScientist Framework.