Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery (2507.07257v2)

Published 9 Jul 2025 in cs.AI, astro-ph.IM, cs.CL, and cs.MA

Abstract: We present a multi-agent system for automation of scientific research tasks, cmbagent (https://github.com/CMBAgents/cmbagent). The system is formed by about 30 LLM agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.

Summary

The paper introduces cmbagent, a multi-agent system leveraging around 30 specialized LLM agents to autonomously decompose and execute complex scientific tasks using a robotics-inspired planning and control strategy.
The system employs context and retrieval-augmented agents to overcome LLM limitations, significantly enhancing performance on domain-specific benchmarks compared to baseline models.
Empirical evaluations demonstrate robust, cost-efficient analysis with publication-quality outputs in cosmology tasks, achieving success rates up to 78% in iterative planning and execution.

Open Source Multi-Agent Planning & Control for Autonomous Scientific Discovery

System Architecture and Agentic Workflow

The paper introduces cmbagent, an open-source, fully autonomous multi-agent system designed to automate complex scientific research tasks, with a focus on quantitative domains such as cosmology. The architecture leverages approximately 30 specialized LLM agents, each responsible for distinct subtasks including information retrieval, code generation, result interpretation, and peer critique. The orchestration of these agents is governed by a robotics-inspired Planning & Control (P&C) strategy, implemented atop the AG2 agent framework.

The P&C workflow is bifurcated into two phases:

Planning Phase: The system receives a user-specified main task and decomposes it into a sequence of structured subtasks. This decomposition is performed by a planner agent, with iterative feedback from a plan reviewer agent. The process is strictly agentic—no human-in-the-loop intervention occurs. The finalized plan is serialized and stored for execution.
Control Phase: The controller agent dispatches subtasks to domain-specialized agents (e.g., researcher, engineer). Execution is monitored, with context and outputs from each step propagated to subsequent agents via context injection. Code execution is performed locally, and failures trigger automated retries, package installation, or session termination based on configurable thresholds.

This architecture enables robust, traceable, and cost-efficient agentic workflows, with session memory managed through selective context retention and chat history truncation to optimize LLM token usage and API costs.

Figure 2: High-level Planning & Control workflow in cmbagent, illustrating agent roles and transitions during task decomposition and execution.

Domain-Specific Context and Retrieval-Augmented Agents

A key innovation is the deployment of context agents and RAG agents to overcome LLM context window and knowledge limitations:

Context Agents: For critical scientific libraries (e.g., camb, class), agents are instantiated with extended, up-to-date documentation injected into their system prompts. This is achieved via automated Markdown extraction from Read the Docs builds or manual curation, ensuring agents have direct access to evolving API and theoretical details.
RAG Agents: For tasks requiring access to large corpora (e.g., hundreds of papers or extensive codebases), agents employ vector-embedding-based retrieval to dynamically augment their context with relevant information. This hybridizes semantic search with LLM reasoning, enabling scalable, low-latency access to domain knowledge.

Empirical evaluation demonstrates that context agents significantly outperform baseline LLMs on domain-specific benchmarks, particularly for advanced tasks where generalist models systematically fail.

Figure 4: Example output from cmbagent's cosmology research task—distance modulus vs. redshift for the Union2.1 SNe Ia sample, with best-fit model overlay.

Evaluation on Scientific and Data Science Benchmarks

The system is benchmarked on both domain-specific and general data science tasks:

Cosmology Task: cmbagent autonomously performs cosmological parameter inference using the Union2.1 Type Ia supernovae dataset, including data ingestion, model specification, MCMC sampling, and result interpretation. The system produces publication-quality plots and statistical summaries without human intervention.
Figure 6: Posterior distributions for $H_0$ and $\Omega_\Lambda$ from the autonomous supernova analysis, demonstrating credible parameter estimation.
DS-1000 Benchmark: On a subset of the DS-1000 code generation benchmark (pandas, numpy, matplotlib), the Planning & Control mode achieves a success rate of 78%, compared to 66% for a one-shot baseline. This demonstrates the efficacy of iterative planning and agentic feedback for complex, multi-step problems.
Context Agent Ablation: On 14 cosmology problems, the camb context agent (with gemini-2.5-pro) outperforms gpt-4o, gpt-4.1, and gemini-2.5-pro without context, especially on advanced tasks where baseline models fail completely.

Integration, Distribution, and User Interfaces

cmbagent is distributed as open-source software (GitHub, PyPI), with containerization support (Docker) and a web-based GUI (HuggingFace Spaces). The GUI exposes multiple operational modes: Planning & Control, One Shot, and Human-in-the-Loop, catering to varying task complexity and user oversight requirements.

Figure 7: GUI welcome page, providing access to different agentic workflow modes.

Figure 8: Planning & Control interface, allowing users to specify tasks and monitor agentic execution.

End-to-End Autonomous Research and Manuscript Generation

cmbagent is integrated as the backend for the denario project, a multi-agent system for fully autonomous scientific research. In denario, agentic workflows generate research ideas, methodologies, execute experiments, and synthesize results into publication-ready manuscripts, including automated literature search and citation management.

Figure 10: denario GUI results page, displaying outputs from an end-to-end autonomous research session.

Implementation Considerations and Trade-offs

LLM Backend: cmbagent supports multiple LLM providers (OpenAI, Google, Anthropic), with agent selection and context size tailored to task requirements and cost constraints.
Session Management: Context injection and chat history truncation are used to balance memory retention and API cost, typically halving session expenses compared to naive approaches.
Code Execution: All code is executed locally, with robust error handling, package installation, and retry logic. This enables reproducibility and traceability, but requires careful sandboxing for security.
Scalability: The modular agent design and containerization facilitate deployment in both local and cloud environments, supporting batch and interactive workloads.

Implications and Future Directions

The results demonstrate that multi-agent LLM systems, when equipped with domain-specific context and robust orchestration, can autonomously perform complex scientific analyses at or above the level of human-in-the-loop workflows. The ability to generate research-quality outputs—including parameter inference, visualization, and manuscript drafting—suggests a paradigm shift in scientific automation.

Key implications include:

Scalability and Reproducibility: Automated workflows can be scaled across domains, increasing throughput and standardizing analysis pipelines.
Human Displacement and Oversight: The removal of the human-in-the-loop raises concerns about automation bias, error propagation, and the need for rigorous validation and oversight.
Ethical and Environmental Impact: Increased automation may exacerbate issues related to resource consumption, transparency, and the societal role of scientific expertise.

Future work will likely focus on:

Extending agentic frameworks to multi-modal and experimental domains (e.g., automated laboratories).
Enhancing agent collaboration and self-critique for improved robustness.
Developing standardized benchmarks and evaluation protocols for autonomous scientific agents.
Addressing ethical, legal, and environmental challenges associated with large-scale scientific automation.

Conclusion

cmbagent represents a significant advance in the automation of scientific research, demonstrating that multi-agent LLM systems can autonomously plan, execute, and interpret complex quantitative tasks. The integration of domain-specific context, retrieval-augmented reasoning, and robust orchestration enables performance that surpasses state-of-the-art LLMs on both domain-specific and general benchmarks. The system's open-source distribution and modular design position it as a foundation for future developments in autonomous scientific discovery, while also highlighting the urgent need for responsible governance and oversight as such technologies mature.

PDF Markdown

Follow-up Questions

Related Papers

Authors (27)

First 10 authors:

Tweets

alphaXiv

Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery (21 likes, 0 questions)