Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents (2502.16069v2)

Published 22 Feb 2025 in cs.AI and cs.LG

Abstract: Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of LLMs in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.

Summary

  • The paper introduces Curie, an AI agent framework that automates scientific experimentation with rigor enforced by intra-agent, inter-agent, and knowledge modules.
  • A new benchmark of 46 computer science experimentation tasks was developed to evaluate AI agent performance under rigorous conditions.
  • Empirical evaluation showed Curie achieved a 3.4x improvement over the strongest baseline on the new experimentation benchmark.

The paper introduces Curie, an AI agent framework designed to automate scientific experimentation with rigor. Rigor is achieved through three modules: an intra-agent rigor module for reliability, an inter-agent rigor module for methodical control, and an experiment knowledge module for interpretability. The framework is evaluated on a novel benchmark of 46 questions across four computer science domains, demonstrating a 3.4×\times improvement over the strongest baseline.

The authors observe that scientific experimentation transforms curiosity into verifiable knowledge, relying on both creativity and rigor. Existing approaches leveraging LLMs to automate scientific research are limited in their ability to support rigorous experimentation. Rigorous experimentation involves formulating hypotheses, designing experiments, executing controlled trials, and analyzing results. To address these limitations, the paper proposes Curie.

Curie accepts an experimental question and relevant context as input. The Architect Agent generates high-level experimental plans, coordinates the process, and reflects on findings. Technician Agents implement and execute controlled experiments. The Experimental Rigor Engine includes:

  • Intra-Agent Rigor Module (Intra-ARM): Secures reliability within individual agents by enforcing extensible rigor policies.
  • Inter-Agent Rigor Module (Inter-ARM): Maintains methodical control over agent coordination, ensuring correct task transitions and efficient task scheduling.
  • Experiment Knowledge Module: Enhances interpretability by maintaining well-structured documentation.

The paper focuses on computer science research problems using LLM-friendly interfaces. The authors introduce an Experimentation Benchmark comprising 46 tasks across multiple computer science domains, derived from influential research papers and open-source projects.

The paper describes scientific experimentation as a three-stage process: experimental design, experiment execution, and data documentation and analysis. Rigor is grounded in three core principles: methodical procedure, reliability, and interpretability.

Related work has leveraged AI to accelerate scientific discovery, focusing on literature reviews, brainstorming, hypothesis generation, and data analysis. Existing agents for end-to-end scientific research rely on ad-hoc prompts and predefined workflows but lack systematic enforcement of methodical procedure, reliability, and interpretability. A wide range of benchmarks have been developed to assess the capabilities of AI agents across diverse domains, but the authors argue that these benchmarks do not capture the iterative hypothesis refinement, complex experiment setup and execution, and robust result interpretation required for experimentation.

Curie comprises an Architect Agent, Technician Agents, and the Experimental Rigor Engine. Given an experimental question, the Architect designs high-level experimental plans. The Inter-ARM intercepts and enforces methodical procedure. The Intra-ARM validates partition integrity. The Technician then sets up the controlled experiment. All components use the Experiment Knowledge Module for storing and tracking experimental progress.

Intra-ARM verifies the tasks of the Architect and Technicians step by step, using modular validation. The Experimental Setup Validator verifies that the experimental setup aligns with the plan before execution. The Execution Validator enhances reproducibility by executing the setup in a controlled environment.

Inter-ARM enables collaboration between the Architect, Technicians, and Intra-ARM through fine-grained plan partitioning, control flow enforcement, and partition scheduling. Inter-ARM breaks down complex experimental plans into smaller, independent partitions, facilitates modular execution and enables parallelization. Control flow enforcement ensures that transitions follow a logical sequence. The scheduler utilizes partition execution priorities, allowed partition state transitions, and agent availability.

The Experiment Knowledge Module addresses two challenges: inconsistent reads and inconsistent writes. The module uses structured knowledge reads and tiered write access to organize experimental progress, transforming the experimental plan and process into a structured and enriched format. Subsequent modifications are recorded to maintain a DAG-like history of changes. The tiered write access policy restricts and validates updates, ensuring that components can only modify the portions of the plan they are responsible for, while all changes undergo rigorous validation.

The authors designed a benchmark to stress test Curie's ability to automate experiments while enforcing rigor. The benchmark consists of 46 tasks across 4 domains within computer science. Tasks are structured as a full experimental process, requiring hypothesis formation, iterative refinement, and rigorous validation. Each task specifies the experiment question, practical constraints, and high-level setup requirements. Experimental context includes domain knowledge and starter code. Ground truth is defined in experimental design and result analysis. The benchmark structures complexity along experiment-driven dimensions, including design complexity, experiment setup complexity, relationship complexity, and experiment goal complexity.

The authors compare Curie with OpenHands and Microsoft Magentic, using GPT-4o as the underlying LLM. Performance is assessed using four key metrics: experiment design, execution setup, implementation alignment, and conclusion correctness. An LLM judge is employed for straightforward verification, with manual assessment of implementation alignment.

The results show that Curie consistently outperforms the baselines across all four metrics. For experiment design correctness, all frameworks perform well. For execution setup and implementation alignment, Curie demonstrates higher reliability. For conclusion correctness, Curie maintains a strong lead. The authors also provide a performance breakdown by domain and by complexity. They observe that increasing complexity difficulties across all dimensions correlates with a decline in performance across all agents.

In conclusion, the authors introduce Curie, an AI agent framework designed to automate and enhance the rigor of scientific experimentation. The Experimental Rigor Engine enforces methodical control, reliability, and interpretability. The empirical evaluation demonstrates Curie's capability to automate rigorous experimentation.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews