Agents4Science Conference

Updated 9 November 2025

Agents4Science is an academic conference where AI agents serve as both authors and peer reviewers, integrating automation with scientific research.
The conference features fully autonomous and hybrid research submissions, employing state-of-the-art LLM ensembles to ensure methodological rigor and reproducibility.
Case studies like the Jr. AI Scientist system demonstrate automated hypothesis generation, comprehensive artifact analysis, and highlight challenges in achieving transformative advances.

Agents4Science is an academic conference established in 2025 as the first venue where AI agents serve as both authors and peer reviewers. Originating from the intersection of scientific research automation and advanced LLM capabilities, its goal is to provide a rigorous framework for the validation and evaluation of AI-driven scientific contributions. The conference, co-organized by Stanford University and Together AI, is characterized by fully autonomous or hybrid (human–AI) authorship, strict artifact submission requirements, and a review process mediated exclusively by ensembles of state-of-the-art LLM reviewers. As a practical instantiation of principles articulated in the Agent4S framework (2506.23692), Agents4Science stands at the frontier of machine-mediated scientific discovery, responsible AI authorship, and peer evaluation.

1. Foundational Context and Objectives

Agents4Science was founded to address the need for transparent and reproducible assessment of AI-generated scientific work within an academic framework, reflecting the shift toward the “Fifth Scientific Paradigm”—automation and integration of the complete research process by intelligent agents. The venue’s explicit objective is to assess the scientific merit of AI-authored contributions, set standards for responsible AI participation in research, and investigate the practical reality of AI-in-the-loop scientific exploration (Miyai et al., 6 Nov 2025). Agents4Science accepts submissions from fully automated research systems as well as hybrid collaborations, aiming to balance methodological rigor with openness to unconventional, machine-driven approaches.

2. Submission Criteria and Workflow

Submissions to Agents4Science must adhere to a formal specification designed to maximize reproducibility and facilitate automated review. The core requirements are:

Authorship: The primary author must be an AI agent; human co-authorship is permitted but not mandated.
Manuscript Formatting: Strict LaTeX template compliance, with defined page limits (typically 8 pages) and allowed figure formats.
Artifacts: Complete code, data, and supplementary appendices must accompany the submission; these are used to enable downstream verification by AI reviewers.
Research Content: The manuscript must extend or critically evaluate existing work, present original hypotheses, and include experiment results validated by executable artifacts.

The submission workflow is designed for full-cycle automation. For example, in the Jr. AI Scientist system, the AI agent performed: baseline analysis, hypothesis generation, candidate improvement filtering (via literature search), automated experiment execution (including ablation studies), and final manuscript preparation—entirely in compliance with Agents4Science requirements.

3. AI-Based Peer Review Mechanism

Peer review at Agents4Science is performed entirely by ensembles of advanced LLMs: OpenAI’s GPT-5, DeepMind’s Gemini 2.5, and Anthropic’s Claude Sonnet 4. Each reviewer is tuned via in-context learning with review samples from major conferences (e.g., ICLR 2024–2025), and prompted to follow official review forms. No human-in-the-loop processes are invoked in manuscript evaluation.

Each AI reviewer assesses manuscripts along standard axes:

Soundness
Novelty
Experimental rigor
Clarity of presentation
Significance

Scoring is conducted on a 1–10 scale per axis, with an overall recommendation generated via a reviewer-specific weighting scheme. The scoring formula is:

$\text{OverallScore} = \frac{w_s S + w_n N + w_r R + w_c C + w_p P}{w_s + w_n + w_r + w_c + w_p}$

where $S$ is soundness, $N$ is novelty, $R$ is rigor, $C$ is clarity, $P$ is significance, and $w_i$ are reviewer-specific weights. The weights themselves remain proprietary.

A distinguishing feature of Agents4Science is the requirement that all claims must be verifiable by the AI reviewers, who are granted unrestricted access to submitted code, data, and appendices.

4. Case Study: Jr. AI Scientist System at Agents4Science

The Jr. AI Scientist system represents a canonical example of an autonomous research workflow specifically optimized for the Agents4Science venue (Miyai et al., 6 Nov 2025). Its submission included three full-length papers, each comprising:

(i) Automated extraction of baseline limitations and hypothesis generation (pruned via literature search to avoid duplication).
(ii) Three-stage coding-agent workflow:
- Implementation of candidate improvements into runnable Python scripts.
- Iterative parameter tuning and performance benchmarking (metrics such as AUROC).
- Automated ablation studies (hyperparameter/component), with resulting data exported in structured formats (JSON, LaTeX tables).
(iii) Multi-pass draft writing, including agent-based logical reflection, formatting, and simulated peer critique to ensure template compliance.

Although the overall technical soundness and experimental rigor exceeded many human-authored baselines, all three Jr. AI Scientist manuscripts were rejected. Reviewers cited as strengths the methodological soundness, exhaustive ablation studies, and clarity of presentation. Key weaknesses identified included:

Limited improvement over existing baselines, with no transformative advances.
Moderate novelty due to incremental, baseline-centric contributions.
Insufficient experimentation through lack of benchmarking vs. contemporary state-of-the-art.
Shallow theoretical justification, with empirical improvements lacking formal explanation or principled generalization.

These outcomes highlight present limitations of autonomous AI-driven research and inform iterative system design and venue review policy.

5. Review Criteria, Metrics, and Artifacts

Agents4Science review methodology is dominated by compliance and reproducibility, with introduced metrics focusing on:

Axis	Assessment Method	Requirements
Soundness	Methodological error checks	Defensible claims, bug-free code
Novelty	Literature comparison (AI)	Must extend/challenge prior work
Rigor	Automated artifact testing	Comprehensive ablation and validation
Clarity	Structured template review	No formatting or sectioning defects
Significance	Impact assessment (AI)	Advancement of field state/capability

All submissions must provide artifacts permitting full verification by LLM reviewers. Human review is excluded from the evaluation pipeline, positioning Agents4Science as a testbed for “AI-to-AI” accountability and auditability.

6. Position within the Agent4S and AI for Science Ecosystem

Agents4Science operationalizes many concepts from the broader Agent4S framework (2506.23692), particularly regarding:

L4/L5 intelligent research pipelines, where AI agents exercise autonomy over the complete scientific process or collaborate across laboratories.
Automation of artifact generation, experimental design, and result synthesis.
Integration of thoroughly autonomous peer review—fulfilling the Agent4S ambition of full-cycle closed-loop research evaluation.

A plausible implication is that Agents4Science provides a microcosm for the investigation of the socio-technical, methodological, and epistemic challenges raised by autonomous scientific practice—issues such as creativity, generalization, interpretability, and explainability.

7. Current Limitations and Research Trajectories

Despite engineering advances, current AI Scientist systems assessed at Agents4Science exhibit several persistent limitations:

Incremental advances rather than paradigm-shifting contributions.
Difficulties in reproducing or outperforming disparate state-of-the-art baselines autonomously.
Lack of integrated modules for formal theoretical reasoning or proof-based validation.
Challenges in benchmarking generalization across unseen datasets or domains.

Reviewers recommend the development of systems that broaden ideation (through tree search or hybrid human–AI interfaces), expand comparative benchmarking, and embed theoretical validation alongside empirical performance improvements. The evolution of Agents4Science as both a venue and a crucible for AI-in-the-loop research is likely to influence future standards for responsible research automation, reproducibility mandates, and the epistemic norms underlying both machine and human-generated scientific discovery.