The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search (2504.08066v1)

Published 10 Apr 2025 in cs.AI, cs.CL, and cs.LG

Abstract: AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-LLM (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large. We have open-sourced the code at https://github.com/SakanaAI/AI-Scientist-v2 to foster the future development of this transformative technology. We also discuss the role of AI in science, including AI safety.

Summary

The paper introduces an autonomous system that eliminates human-coded templates to streamline the complete research process from idea generation to publication.
The paper employs an innovative agentic tree search method combined with VLM feedback to efficiently explore hypotheses and refine experimental outcomes.
The paper demonstrates practical impact by producing an autonomous manuscript accepted at an ICLR workshop while highlighting areas for further improvement in hypothesis novelty.

This paper introduces The AI Scientist-v2 (2504.08066), an advanced automated system designed to perform end-to-end scientific discovery and manuscript generation. Building upon its predecessor (AI Scientist-v1), which demonstrated the feasibility of automated workflows but relied on human-coded templates, v2 significantly enhances autonomy and exploration capabilities.

The core problem addressed is automating the entire scientific process, from hypothesis generation and experiment design to execution, data analysis, visualization, and manuscript writing, aiming to reduce human intervention and accelerate discovery.

Key innovations in AI Scientist-v2 (2504.08066) include:

Elimination of Template Dependency: The system no longer requires human-authored code templates for specific domains. It starts with a more generalized idea generation process and autonomously generates necessary code.
Agentic Tree Search with Experiment Manager: A novel progressive agentic tree search method, orchestrated by a dedicated experiment manager agent, enables deeper and more systematic exploration of hypotheses. This moves away from the linear experimentation of v1.
VLM Feedback Loop: Vision-LLMs (VLMs) are integrated into the workflow to provide feedback on generated figures during experimentation and to review figures and captions during the manuscript writing phase, improving visualization quality and alignment.

The AI Scientist-v2 workflow operates through several phases:

Generalized Idea Generation: The system starts with high-level prompts and uses tools like Semantic Scholar to formulate research ideas and hypotheses, moving beyond incremental modifications of existing codebases.
Experimentation (Structured Stages): An Experiment Progress Manager guides the process through four defined stages:
- Stage 1: Preliminary Investigation (establishing a minimal working prototype)
- Stage 2: Hyperparameter Tuning (refining the baseline)
- Stage 3: Research Agenda Execution (implementing core research)
- Stage 4: Ablation Studies (assessing component importance) Each stage uses explicit stopping criteria and selects the best-performing node to seed the next stage.
Parallelized Agentic Tree Search: Within each experimentation stage, the system employs a tree search. Each node represents an experiment instance, involving LLM-driven code generation, execution, data saving (in .npy files), and plotting. Nodes are classified as "buggy" (if errors occur or VLM feedback is negative) or "non-buggy". The system selects nodes to expand (prioritizing debugging buggy nodes), generates child nodes with refined/debugged code, and executes them in parallel. Specialized node types exist for hyperparameter tuning, ablations, replication (for statistical robustness), and aggregation (for summarizing results).
Vision-LLM Review: VLMs review generated plots during experimentation and later figures, captions, and associated text in the manuscript, providing feedback to improve visual clarity and coherence.
Manuscript Writing: The system generates a complete scientific manuscript based on the experimental results, using a simpler single-pass generation followed by a reflection stage incorporating VLM feedback and checks against format/length constraints. Dataset loading leverages standard methods like Hugging Face's load_dataset where possible.

To evaluate AI Scientist-v2's capabilities, the authors submitted three fully autonomous manuscripts generated by the system to a peer-reviewed ICLR workshop focusing on negative results. One of these manuscripts achieved an average reviewer score high enough to be accepted under standard human submission criteria, marking the first reported instance of a fully AI-generated paper successfully navigating a peer-review process.

The accepted paper investigated whether temporal consistency regularization improves compositional generalization in sequence models on synthetic arithmetic tasks. Contrary to the hypothesis, the regularization did not yield significant improvements and sometimes harmed performance. The AI Scientist team's internal review highlighted issues like unclear method descriptions, missing citations, inaccuracies in figure captions and interpretations, and potential dataset overlap. Human reviewers appreciated the clear presentation of negative results but also noted the need for clearer justification of the method, broader architectural/dataset evaluation, and more depth.

Despite this success, the authors acknowledge significant limitations:

Acceptance was at a workshop level (higher acceptance rates) compared to a top-tier conference.
Only one of three submissions was accepted, indicating inconsistency.
The system still struggles with formulating truly novel, high-impact hypotheses and designing innovative methodologies requiring deep domain expertise.

The paper discusses the ethical implications of AI-generated research, emphasizing the importance of transparency, open discussion, and establishing community norms for disclosure and evaluation. The authors conducted the workshop evaluation with IRB approval and coordination with ICLR leadership, withdrawing the accepted paper before official publication to facilitate broader discussion.

The codebase for The AI Scientist-v2 and the experimental data for the workshop submissions have been open-sourced to encourage further research and public discussion on the evolving role of AI in science. The authors anticipate that future advancements will likely overcome current limitations, potentially leading to AI-generated research matching or exceeding human quality in the future.