EXP-Bench: Autonomous AI Research Evaluation

Updated 30 July 2025

EXP-Bench is a benchmark suite that evaluates AI agents on full, end-to-end autonomous research workflows using tasks derived from influential AI papers.
It employs a three-stage, semi-automated curation pipeline to extract experimental details and verify reproducible execution in containerized environments.
Preliminary evaluations show that while agents perform well on isolated tasks, only 0.5% successfully complete the entire research cycle due to challenges in design, implementation, and execution.

EXP-Bench is a benchmark suite developed to systematically evaluate the capacity of AI agents to autonomously conduct the full spectrum of AI research experimentation. Distinct from isolated code generation or task-specific testbeds, EXP-Bench directly operationalizes authentic research workflows from influential AI publications, requiring agents to formulate hypotheses, design and implement experimental procedures, execute experiments, and analyze results—all grounded in real research questions and code drawn from leading AI literature. As such, EXP-Bench provides a rigorous and realistic assessment of progress toward AI agents that can perform end-to-end scientific experimentation without human supervision.

1. Motivation and Benchmark Scope

EXP-Bench was conceived to measure whether LLM-based agents, and their orchestration frameworks, are able to coordinate all steps of a genuine AI research cycle autonomously. While significant advances have been made in code generation, question answering, and dataset-specific research agents, there had been no prior benchmark requiring agents to take incomplete starter code and perform the complete cycle of experimental design, implementation, execution, and conclusion within the context of authentic AI research challenges.

The benchmark comprises 461 experimental tasks derived from 51 highly cited, code-linked papers in the AI field (selected from venues such as NeurIPS and ICLR), and is further decomposed into 12,737 individually gradable subtasks spanning algorithmic design, implementation, execution, and result analysis. Tasks range across subfields, including reinforcement learning, computer vision, and generative modeling.

2. Semi-Autonomous Curation Pipeline

The curation of EXP-Bench leverages a three-stage, semi-automated process designed to maximize fidelity and minimize human workload:

Source Selection and Filtering: High-quality research papers are identified based on citation and popularity metrics, with the requirement that open-source code repositories are available.
Experimental Detail Extraction: Using a multi-pass, multi-modal extraction system, the pipeline parses the papers (employing OCR on figures, tables, and text) to distill the core research question, methods, procedural steps, and expected results. An agent with capabilities for PDF analysis, codebase exploration, and basic web search is deployed to recover implementation chains, scripts, and configuration necessary to replicate the experimental pipeline.
Verification and Refinement: Each candidate task undergoes execution in a clean, containerized environment. Results are compared against ground-truth outputs as reported in the original papers. Discrepancies (e.g., missing code, dependencies, execution errors) are identified via this loop and inform the refinement of the extracted task definition. Only tasks passing this high-fidelity check are incorporated into the benchmark.

This pipeline ensures that tasks are both authentic and executable, closely mirroring the demands faced when adapting actual research compendia for automated experimentation.

3. Structure of Benchmark Tasks and Agent Interfaces

Each benchmark task provides the agent with:

The core research question (from the paper)
A high-level methodological outline describing the experiment
A masked code repository (key implementation details removed or obscured), ensuring the agent must infer and reconstruct rather than simply reproduce

Agents must generate:

A design specification (D): experimental configuration and protocols, derived from high-level description
Implementation (I): filling in missing code, scripts, or parameter files
Executable pipeline (E): code/script orchestration enabling successful automated execution
Conclusion (C): reasoned synthesis or analysis of final results, as supported by the outputs

Grading is performed on both atomic (D, I, E, C) and conjunctive (e.g., full-chain correctness, where all steps are required for credit) axes. This nested, multifaceted structure distinguishes EXP-Bench from prior code-generation or research QA benchmarks.

4. Experimental Evaluation of AI Agents

Current LLM-based research agents—including OpenHands and IterativeAgent—were benchmarked on EXP-Bench. Results indicate some partial proficiency on atomic subtasks, with design and implementation correctness occasionally reaching 20–35%. However, performance degrades substantially when end-to-end experiment execution is demanded: only 0.5% of full workflows resulted in correct, reproducible experimental outcomes.

The principal sources of failure are:

Design Misclassification (≈16% of errors): Agents frequently misinterpret dispersed or implicit experimental variables.
Implementation Errors (≈39%): Missing or partially correct code fragments hinder full pipeline assembly.
Execution Failures (~29% environment/dependency, ~24% script-level): Incomplete or misconfigured environments, along with script errors, block successful runs.

These results consistently highlight the conjunctive complexity of autonomous research—perfomance rapidly falls off when coordinate design, implementation, and execution are all required.

5. Challenges Revealed and Bottleneck Analysis

EXP-Bench’s results profile the primary limitations of current automation approaches:

Dispersed Experimental Information: Research papers often present experimental detail in non-contiguous, semi-structured formats (i.e., cross-referencing main text, appendices, and external code). Agents lack robust mechanisms to reconstruct these dependencies.
Environment Reproducibility: Automated configuration of run environments, including obscure library versions and hardware targets, remains a significant stumbling block for agent frameworks that lack targeted OS- and package-management reasoning.
Masked Code and Reasoning Generalization: By obscuring key implementation elements, EXP-Bench prohibits rote code copying, requiring true abstraction, synthesis, and reasoning abilities that remain weakly developed in contemporary LLM agents.

These bottlenecks suggest that progress toward autonomous research agents requires advances not only in LLM code-generation, but in extraction of procedural, contextual, and environmental information from semi-structured, cross-modal artifacts.

6. Implications and Future Outlook

The introduction of EXP-Bench refocuses the automation of AI research on authentic, end-to-end experimentation, providing not only an evaluation yardstick but also a resource to assist in developing and training future agents. The benchmark is positioned to:

Supply a training and evaluation ground that is directly coupled to the most challenging aspects of automated scientific method
Identify where modular (subtask-solvable) capabilities break down in the presence of conjunctive, multi-phase requirements
Serve as a template for extending task curation, grading, and execution verification to new domains and experimental modalities

As agent performance on EXP-Bench improves, there is potential to catalyze a substantive transformation in how computational science is practiced, reducing manual overhead and potentially accelerating both the design and validation of novel AI systems.

7. Public Release and Community Engagement

EXP-Bench is open-sourced and accessible to the research community at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench. The repository includes structured benchmarks, documentation, evaluation scripts, and analyses of current agent performance across experimental phases. This access is intended to drive transparency, reproducibility, and collaborative progress toward robust, agent-driven scientific experimentation in AI research.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to EXP-Bench.