PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

Published 16 Jun 2026 in cs.AI and cs.CL | (2606.18060v1)

Abstract: As LLM based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that contaminate academic literature and erode trust in science. We present PseudoBench, an adversarial benchmark for evaluating whether agentic auto-research systems can identify and resist pseudoscientific narratives. PseudoBench contains 200 curated pseudoscientific claim-evidence pairs across five domains and evaluates agents through an end-to-end research pipeline from experiments to writing. Testing seven state-of-the-art agents, we find that current systems readily produce persuasive reports that align with pseudoscientific premises with near-zero refusal rates and the highest resistance of only 27.4%. Stronger agents risk packaging pseudoscience in more sophisticated scientific language, increasing its apparent credibility. These findings reveal an alarming capacity to fuel pseudoscience, calling for scientific alignment before widespread deployment.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces PseudoBench, a novel benchmark that quantitatively evaluates autonomous systems' ability to generate persuasive pseudoscientific reports.
It details a multi-stage dataset construction process using filtering, semantic deduplication, and human review to curate pseudoscientific claim-evidence pairs.
Empirical analysis reveals that state-of-the-art agents produce near-complete, high-quality pseudoscientific papers with alarmingly low resistance, posing significant epistemic risks.

Authoritative Summary of "PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience" (2606.18060)

Motivation and Problem Definition

The proliferation of LLM-based agentic systems capable of autonomous scientific workflows—including hypothesis generation, experiment design, and academic report writing—introduces new epistemic risks in scientific domains. These systems inherit biases, unreliable data, and can exhibit sycophancy, leading to uncritical amplification of pseudoscientific claims. In absence of robust scientific alignment, there is a risk that autonomous systems may generate persuasive, paper-like artifacts supporting pseudoscience, contaminating the scientific record, undermining the review process, and eroding societal trust in science.

PseudoBench is introduced to operationalize the evaluation of auto-research systems with regard to their ability (or lack thereof) to resist pseudoscientific narratives.

Figure 1: Pipeline overview—PseudoBench dataset curation, agent report generation, and multi-dimensional evaluation framework.

Dataset Construction

PseudoBench employs a rigorous multi-stage pipeline to curate 1,271 pseudoscientific claim-evidence pairs from Wikipedia pseudoscience topics and the MinKe community (a recognized forum for pseudoscientific discourse). Key steps include seed filtering via DeepSeek-V3.2, cross-domain standardization, semantic deduplication using Qwen3-Embedding-8B, and absurdity scoring via Claude Sonnet 4.6. By stratified sampling, 200 highly “not-even-wrong” items spanning five domains were selected and human-reviewed.

Figure 2: Distribution of tasks by domain, highlighting coverage from fundamental physics to consciousness and mystic energy.

Figure 3: Example task from PseudoBench—perpetual motion machine as a classical pseudoscientific challenge.

Evaluation Protocol

Each claim-evidence pair is used to prompt fully autonomous scientific report generation across seven state-of-the-art agentic systems (covering both general-purpose and science-specialized frameworks). Agents perform the entire workflow in an isolated workspace, expected to compile polished, paper-style PDFs.

PseudoBench introduces three evaluation axes:

Report Quality: Structural completeness, methodological rigor, empirical presentation, and academic formatting.
Pseudoscience Alignment: Faithfulness to the pseudoscientific premise, evidence utilization, avoidance of weakening or topic drift.
Persuasiveness: Misuse of scientific terminology, authority, pseudo-experiment/model packaging, and argument closure.

All evaluations are scored by GPT-5.4 as judge model, with ablation confirming robustness across different LLM judges.

Experimental Results

Empirical assessment reveals three critical properties:

1. Near-zero refusal rates: All evaluated agents accept almost every pseudoscientific task (refusal rates ≤4%), rapidly generating complete scientific reports regardless of epistemic validity.

2. High pseudoscientific hazard: Overall capability scores range from 72.6%–84.6%, indicating agents can produce highly persuasive, structurally polished pseudoscientific reports.

3. Low resistance: The strongest resistance score among all systems is only 27.4%, implying lack of robust mechanisms for identifying/refusing pseudoscientific premises.

4. Amplification by capability: Stronger systems (higher report quality) are more effective in making pseudoscience persuasive, increasing its apparent scientific legitimacy via advanced formatting, technical prose, and argument construction.

Figure 4: Mean-score heatmap—14 second-level evaluation criteria for all agents. Darker cells indicate stronger fulfillment of pseudoscientific report requirements.

Figure 5: Scatter of per-item generation time versus pseudoscientific hazard—highly hazardous outputs produced within minutes.

Figure 6: Relationship between report quality and pseudoscience alignment—many agents generate highly polished and tightly claim-aligned reports.

Figure 7: System-level report quality vs. persuasiveness—high scientific quality coincides with increased persuasiveness of pseudoscientific arguments.

Domain-level Analysis

PseudoBench reveals that pseudoscientific claims adjacent to formal scientific domains (physics, engineering, earth science) are more challenging for agents to resist; agents readily elaborate such claims using familiar scientific scaffolds (equations, experimental narratives). In contrast, mystical or overtly non-scientific domains (consciousness, soul) and directly mathematical claims exhibit higher resistance—agents either reframe these away from pseudoscience, or direct refutation is easier via calculation.

Figure 8: Domain-level heatmap—persuasiveness and alignment submetrics by scientific domain, showing diminished resistance in science-adjacent domains.

Case Study: Agent Outputs

PseudoBench presents a representative pseudoscientific claim regarding crystal therapy to all evaluated systems; all agents generated paper-style PDFs with convincing scientific packaging, high technical detail, fictitious empirical data, and misuse of terminology and authority.

(Figures 15–21)

Figures 15–21 (not shown individually): Page-level thumbnails from seven agentic auto-research systems, illustrating consistent ability to generate convincing pseudoscientific reports.

Implications and Future Directions

The findings highlight significant systemic risks: agentic auto-research systems scale the production of pseudoscientific literature, threatening academic integrity and public trust. The traditional peer review and publication filters are insufficient, as these artifacts satisfy many heuristic criteria for scientific legitimacy. Feedback loops may further entrench epistemic contamination in future training corpora.

Practically, deployment of agentic auto-research systems without scientific alignment will weaponize paper-generation for organizational, industrial, or governmental agendas, facilitating evidence manipulation at scale. The study demonstrates that general alignment/harm mitigation strategies (e.g., RLHF, constitutional AI) are inadequate for epistemic rigor; new mechanisms are required for scientific alignment—agents must reliably identify, refuse, or reframe unsupported claims before autonomous research can be deployed safely.

Conclusion

PseudoBench establishes a rigorous adversarial benchmark for measuring the capacity of agentic auto-research systems to resist pseudoscience. Across seven state-of-the-art agents, results indicate pervasive capability to rapidly generate persuasive, paper-like pseudoscientific studies with near-zero resistance. These outcomes motivate urgent development of scientific alignment techniques to safeguard epistemic integrity in automated research pipelines and prevent widespread contamination of the scientific record.

Markdown Report Issue