- The paper introduces EXP-Bench, a benchmark that assesses AI agents' ability to conduct full, end-to-end research experiments.
- The benchmark uses a semi-automated pipeline to extract structured experimental procedures from peer-reviewed publications and open-source code.
- Initial evaluations show top agents achieving only 0.5% full correctness, highlighting key limitations in current AI research methodologies.
This paper introduces EXP-Bench, a novel benchmark designed to evaluate the capability of AI agents to conduct complete, end-to-end AI research experiments. Recognizing that current AI agents often struggle with the complexities of rigorous experimentation beyond isolated tasks like code generation or data analysis, EXP-Bench provides a platform to assess agents across the full research workflow: formulating hypotheses, designing procedures, implementing code, executing experiments, and analyzing results.
EXP-Bench curates realistic research tasks directly from influential, peer-reviewed AI publications (specifically from NeurIPS and ICLR 2024) and their associated open-source codebases. This approach ensures tasks are grounded in actual scientific procedures validated by the community. The benchmark currently comprises 461 such tasks derived from 51 papers, spanning diverse AI subfields like computer vision, NLP, and reinforcement learning. Each task is structured to provide an agent with a research question, a high-level method description, and a code repository (with certain components masked to prevent direct copying). The ground truth for each task includes a detailed experimental design, necessary code modifications (represented as a git diff), and a final conclusion based on expected results. These ground truths are broken down into 12,737 fine-grained, gradable subtasks.
A significant challenge in creating such a benchmark is extracting structured experimental details from heterogeneous research artifacts (papers, supplementary materials, code). To address this, the authors developed a semi-automated dataset construction pipeline. This pipeline involves:
- Source Selection and Filtering: Identifying high-quality papers with open-source code based on metrics like citation counts and repository activity.
- Experiment Procedure Extraction: Decomposing high-level research goals into structured sub-steps. This stage uses a multi-modal extraction process (combining OCR, retrieval-augmented querying, and semantic extraction) to derive the core research task (question, method, expected outcome) from the paper. It then employs an implementation extraction AI agent operating in a tool-augmented environment to identify the sequence of scripts in the codebase that realize the experimental procedure, which is then used to define step-by-step implementation requirements.
- Verification and Refinement: Executing the extracted implementations in a controlled environment to validate functionality against expected outputs and employing lightweight human review to ensure alignment with source materials. Tasks are finalized with masked files to prevent agents from directly accessing solutions.
The evaluation of AI agents on EXP-Bench uses a multi-metric pipeline assessed by an LLM-based judge and a code execution validator. The metrics include:
- M (Monitor): Integrity check (detecting disallowed behaviors like accessing masked files or faking data).
- D (Design Correctness): Evaluating if the agent's experimental design (variables, constants, procedures) aligns with the ground truth.
- I (Implementation Correctness): Assessing if the agent's code modifications fulfill the required implementation components (measured via git diff comparison against ground truth requirements).
- E (Execution): Verifying if the agent's generated code is executable and produces expected outputs in a controlled environment.
- C (Conclusion Correctness): Judging if the agent's derived conclusion correctly answers the research question based on the experimental results.
Conjunctive metrics (e.g., I·E, All√ (D, I, C correct), All·E√ (D, I, C, E correct)) are used for a more rigorous assessment of end-to-end correctness.
Initial evaluations of leading agents, including OpenHands and IterativeAgent powered by various LLMs (like Claude-Sonnet 3.7, Haiku 3.5, Amazon Nova Pro, DeepSeek R1, OpenAI o3-mini), revealed significant limitations. While agents achieved partial scores (e.g., 20-35% correctness on individual design or implementation aspects), the success rate for completing a fully correct and executable experiment (All·E√) was a mere 0.5% for the top-performing configuration (OpenHands + o3-mini), and often 0% for others. This highlights major bottlenecks in current agents' ability to handle the complexities of real-world AI experimentation.
Detailed analysis of agent failures identified prevalent patterns across different phases:
- Design: Frequent errors included incomplete or misclassified experimental variables (16.05% prevalence) and the inclusion of extraneous procedural additions (7.62%).
- Implementation: The most common failure was missing essential implementation components (39.71%), such as crucial code snippets, proper data preprocessing steps (1.83%), or correct evaluation metric implementation (2.15%).
- Execution: Failures were dominated by environment or dependency misconfigurations (29.38%) and script-level issues (23.84%) like unrecognized model names or missing files, indicating persistent reproducibility challenges.
- Conclusion: Agents often failed by providing missing or underdeveloped conclusions (26.18%) or through incorrect interpretation of results (19.66%).
The analysis also showed that conjunctive metrics significantly reduce agent scores compared to evaluating individual components, effectively revealing the brittleness of end-to-end performance. Cost-time analysis per task showed variations across agents but little direct correlation with overall performance, suggesting efficiency issues alongside correctness problems.
The authors discuss limitations, noting that EXP-Bench currently focuses primarily on the experimentation procedure and does not fully capture the broader, less structured aspects of the research lifecycle like initial ideation or literature review. Future work aims to leverage the dataset to train more capable AI agents, potentially using reinforcement learning with verifiable rewards to automate aspects of the research lifecycle and accelerate scientific discovery.
In conclusion, EXP-Bench provides a valuable benchmark and dataset for evaluating AI agents on realistic, end-to-end AI research experiments extracted from peer-reviewed literature and codebases. The evaluations highlight significant challenges and failure modes for current agents, emphasizing the need for further research and development to enable truly autonomous scientific experimentation in AI. The open-sourced nature of EXP-Bench (2505.24785) and its associated pipeline offers a resource for the community to track progress and guide future efforts in this area.