Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents (2502.16069v2)

Published 22 Feb 2025 in cs.AI and cs.LG

Abstract: Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of LLMs in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.

Summary

The paper demonstrates Curie achieves a 3.4x improvement over state-of-the-art AI agents in rigorously automating scientific experiments.
It details a dual-agent architecture with an Architect and Technician agents coordinated by rigor modules to ensure reliability and interpretability.
The introduced benchmark spans 46 tasks across four domains, rigorously evaluating experiment design, setup, and conclusion accuracy.

Curie: Automating Scientific Experimentation with AI Agents

The paper "Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents" (2502.16069) introduces Curie, an AI agent framework designed to enhance the rigor and automation of scientific experimentation. Rigor is enforced through methodical control, reliability, and interpretability. Curie is evaluated using a novel benchmark derived from influential computer science research papers and open-source projects. The results demonstrate Curie's ability to automate complex experimentation tasks with significant improvements over existing state-of-the-art agents.

Core Components of Curie

Curie's architecture consists of two primary agent types, the Architect Agent and Technician Agents, mediated by an Experimental Rigor Engine. The high-level workflow (Figure 1) involves the Architect Agent designing experimental plans, while Technician Agents implement and execute these plans. The Experimental Rigor Engine, the innovation at the core of Curie, ensures rigor throughout the experimental process via three key modules:

Intra-Agent Rigor Module (Intra-ARM): Ensures reliability within individual agents by enforcing extensible rigor policies.
Inter-Agent Rigor Module (Inter-ARM): Maintains methodical control over agent coordination, ensuring correct task transitions and efficient scheduling.
Experiment Knowledge Module: Enhances interpretability by maintaining well-structured documentation.
Figure 1: Curie workflow with an example task in LLM reasoning. The Architect is responsible for designing high-level plans and reflects on the new findings. The Technician is responsible for implementing and executing the experiments based on the plans. Whenever an agent completes its action (step \circled{1}.)

Intra-Agent Rigor Module

The Intra-ARM focuses on ensuring reliability within individual agents. Large-scale experiments are vulnerable to error propagation, particularly with LLM-based agents prone to hallucination. Intra-ARM employs modular validation, utilizing validators that continuously verify each stage of the experiment. Key validators include:

Experimental Setup Validator: Verifies that the experimental setup aligns with the plan before execution (Figure 2).
Execution Validator: Enhances reproducibility by executing the setup in a controlled environment (Figure 3).
Figure 2: Intra-ARM setup validation high-level workflow.

Figure 3: Example errors that can be captured by the setup validator.

Inter-Agent Rigor Module

The Inter-ARM ensures methodical control over agent coordination and resource efficiency. Traditional agentic conversational patterns are often unsuitable for rigorous experimentation. Inter-ARM facilitates collaboration between the Architect, Technicians, and Intra-ARM through:

Fine-grained Plan Partitioning: Breaks down complex experimental plans into smaller, independent partitions (Figure 4).
Control Flow Enforcement: Ensures logical task transitions between agents.
Partition Scheduling: Optimizes resource utilization based on partition priorities, state transitions, and agent availability.
Figure 4: Simplified Inter-ARM workflow with a partition state snapshot. Partition, control flow, and scheduling policies are customizable.

Experiment Knowledge Module

The Experiment Knowledge Module enhances interpretability by addressing the limitations of LLMs in knowledge management. LLMs can exhibit inconsistent recall and are prone to hallucination. The module integrates two mechanisms:

Structured Knowledge Reads: Transforms the experimental plan into a structured format with critical metadata, recording modifications as a time machine (Figure 5).
Tiered Write Access: Enforces a tiered write access policy to restrict and validate updates to the experimental plan.
Figure 5: Simplified partial snapshot of an example Time Machine.

Experimentation Benchmark and Evaluation

The paper introduces a novel experimentation benchmark composed of 46 tasks across four computer science domains (Table 1). These tasks are derived from real-world research papers and open-source projects. The benchmark assesses the agent's ability to formulate hypotheses, iteratively refine experiments, and rigorously validate results.

Tasks are designed to test how well an agent navigates multi-step experimentation, adapts to unexpected results, and maintains structured records over long-term iterative processes. The benchmark structures complexity along experiment-driven dimensions:

Design Complexity
Experiment Setup Complexity
Relationship Complexity
Experiment Goal Complexity

In the experiments, Curie is compared against state-of-the-art AI agents, OpenHands [wang2024openhands] and Microsoft Magentic [fourney2024magentic]. Performance is evaluated using four key metrics:

Experiment Design
Execution Setup
Implementation Alignment
Conclusion Correctness
Figure 6: Average scores across different complexity dimensions at varying difficulty levels for Curie, OpenHands, and Magentic. Curie outperforms the others consistently, with performance generally dropping as complexity increases.

Empirical results demonstrate that Curie achieves a 3.4x improvement in correctly answering experimental questions compared to the strongest baseline.

Figure 7: Question 6: “Does the optimal number of reasoning steps vary across different LLMs?”

Figure 8: Average alignment scores across different complexity dimensions at varying difficulty levels for Curie, OpenHands, and Magentic. Curie outperforms the others consistently, with performance generally dropping as complexity increases.

Figure 9: Average conclusion scores across different complexity dimensions at varying difficulty levels for Curie, OpenHands, and Magentic. Curie outperforms the others consistently, with performance generally dropping as complexity increases.

Figure 10: Average design scores across different complexity dimensions at varying difficulty levels for Curie, OpenHands, and Magentic. Curie outperforms the others consistently, with performance generally dropping as complexity increases.

Figure 11: Curie overview.

Figure 12: Case Study. Curie can help researchers validate, expand, and critique existing research on the benefits of repeated sampling in LLM reasoning~\cite{brown2024large}.

Conclusion

Curie represents a significant advancement toward rigorous and automated scientific experimentation. The Experimental Rigor Engine ensures methodical control, reliability, and interpretability throughout the experimentation process. The introduced Experimentation Benchmark provides a means to evaluate AI agents on real-world research-level challenges. Empirical results demonstrate Curie's capability to automate rigorous experimentation, outperforming state-of-the-art AI agents. Future work may include adapting Curie for interdisciplinary research, accommodating domain-specific methodologies and uncertainty control, and enabling knowledge reuse across experiments.