Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery (2507.07257v2)
Abstract: We present a multi-agent system for automation of scientific research tasks, cmbagent (https://github.com/CMBAgents/cmbagent). The system is formed by about 30 LLM agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.
Summary
- The paper introduces cmbagent, a multi-agent system leveraging around 30 specialized LLM agents to autonomously decompose and execute complex scientific tasks using a robotics-inspired planning and control strategy.
- The system employs context and retrieval-augmented agents to overcome LLM limitations, significantly enhancing performance on domain-specific benchmarks compared to baseline models.
- Empirical evaluations demonstrate robust, cost-efficient analysis with publication-quality outputs in cosmology tasks, achieving success rates up to 78% in iterative planning and execution.
Open Source Multi-Agent Planning & Control for Autonomous Scientific Discovery
System Architecture and Agentic Workflow
The paper introduces cmbagent, an open-source, fully autonomous multi-agent system designed to automate complex scientific research tasks, with a focus on quantitative domains such as cosmology. The architecture leverages approximately 30 specialized LLM agents, each responsible for distinct subtasks including information retrieval, code generation, result interpretation, and peer critique. The orchestration of these agents is governed by a robotics-inspired Planning & Control (P&C) strategy, implemented atop the AG2 agent framework.
The P&C workflow is bifurcated into two phases:
- Planning Phase: The system receives a user-specified main task and decomposes it into a sequence of structured subtasks. This decomposition is performed by a planner agent, with iterative feedback from a plan reviewer agent. The process is strictly agentic—no human-in-the-loop intervention occurs. The finalized plan is serialized and stored for execution.
- Control Phase: The controller agent dispatches subtasks to domain-specialized agents (e.g., researcher, engineer). Execution is monitored, with context and outputs from each step propagated to subsequent agents via context injection. Code execution is performed locally, and failures trigger automated retries, package installation, or session termination based on configurable thresholds.
This architecture enables robust, traceable, and cost-efficient agentic workflows, with session memory managed through selective context retention and chat history truncation to optimize LLM token usage and API costs.
Figure 2: High-level Planning & Control workflow in cmbagent, illustrating agent roles and transitions during task decomposition and execution.
Domain-Specific Context and Retrieval-Augmented Agents
A key innovation is the deployment of context agents and RAG agents to overcome LLM context window and knowledge limitations:
- Context Agents: For critical scientific libraries (e.g.,
camb
,class
), agents are instantiated with extended, up-to-date documentation injected into their system prompts. This is achieved via automated Markdown extraction from Read the Docs builds or manual curation, ensuring agents have direct access to evolving API and theoretical details. - RAG Agents: For tasks requiring access to large corpora (e.g., hundreds of papers or extensive codebases), agents employ vector-embedding-based retrieval to dynamically augment their context with relevant information. This hybridizes semantic search with LLM reasoning, enabling scalable, low-latency access to domain knowledge.
Empirical evaluation demonstrates that context agents significantly outperform baseline LLMs on domain-specific benchmarks, particularly for advanced tasks where generalist models systematically fail.
Figure 4: Example output from cmbagent's cosmology research task—distance modulus vs. redshift for the Union2.1 SNe Ia sample, with best-fit model overlay.
Evaluation on Scientific and Data Science Benchmarks
The system is benchmarked on both domain-specific and general data science tasks:
- Cosmology Task: cmbagent autonomously performs cosmological parameter inference using the Union2.1 Type Ia supernovae dataset, including data ingestion, model specification, MCMC sampling, and result interpretation. The system produces publication-quality plots and statistical summaries without human intervention.
Figure 6: Posterior distributions for H0 and ΩΛ from the autonomous supernova analysis, demonstrating credible parameter estimation.
- DS-1000 Benchmark: On a subset of the DS-1000 code generation benchmark (pandas, numpy, matplotlib), the Planning & Control mode achieves a success rate of 78%, compared to 66% for a one-shot baseline. This demonstrates the efficacy of iterative planning and agentic feedback for complex, multi-step problems.
- Context Agent Ablation: On 14 cosmology problems, the camb context agent (with gemini-2.5-pro) outperforms gpt-4o, gpt-4.1, and gemini-2.5-pro without context, especially on advanced tasks where baseline models fail completely.
Integration, Distribution, and User Interfaces
cmbagent is distributed as open-source software (GitHub, PyPI), with containerization support (Docker) and a web-based GUI (HuggingFace Spaces). The GUI exposes multiple operational modes: Planning & Control, One Shot, and Human-in-the-Loop, catering to varying task complexity and user oversight requirements.
Figure 7: GUI welcome page, providing access to different agentic workflow modes.

Figure 8: Planning & Control interface, allowing users to specify tasks and monitor agentic execution.
End-to-End Autonomous Research and Manuscript Generation
cmbagent is integrated as the backend for the denario project, a multi-agent system for fully autonomous scientific research. In denario, agentic workflows generate research ideas, methodologies, execute experiments, and synthesize results into publication-ready manuscripts, including automated literature search and citation management.
Figure 10: denario GUI results page, displaying outputs from an end-to-end autonomous research session.
Implementation Considerations and Trade-offs
- LLM Backend: cmbagent supports multiple LLM providers (OpenAI, Google, Anthropic), with agent selection and context size tailored to task requirements and cost constraints.
- Session Management: Context injection and chat history truncation are used to balance memory retention and API cost, typically halving session expenses compared to naive approaches.
- Code Execution: All code is executed locally, with robust error handling, package installation, and retry logic. This enables reproducibility and traceability, but requires careful sandboxing for security.
- Scalability: The modular agent design and containerization facilitate deployment in both local and cloud environments, supporting batch and interactive workloads.
Implications and Future Directions
The results demonstrate that multi-agent LLM systems, when equipped with domain-specific context and robust orchestration, can autonomously perform complex scientific analyses at or above the level of human-in-the-loop workflows. The ability to generate research-quality outputs—including parameter inference, visualization, and manuscript drafting—suggests a paradigm shift in scientific automation.
Key implications include:
- Scalability and Reproducibility: Automated workflows can be scaled across domains, increasing throughput and standardizing analysis pipelines.
- Human Displacement and Oversight: The removal of the human-in-the-loop raises concerns about automation bias, error propagation, and the need for rigorous validation and oversight.
- Ethical and Environmental Impact: Increased automation may exacerbate issues related to resource consumption, transparency, and the societal role of scientific expertise.
Future work will likely focus on:
- Extending agentic frameworks to multi-modal and experimental domains (e.g., automated laboratories).
- Enhancing agent collaboration and self-critique for improved robustness.
- Developing standardized benchmarks and evaluation protocols for autonomous scientific agents.
- Addressing ethical, legal, and environmental challenges associated with large-scale scientific automation.
Conclusion
cmbagent represents a significant advance in the automation of scientific research, demonstrating that multi-agent LLM systems can autonomously plan, execute, and interpret complex quantitative tasks. The integration of domain-specific context, retrieval-augmented reasoning, and robust orchestration enables performance that surpasses state-of-the-art LLMs on both domain-specific and general benchmarks. The system's open-source distribution and modular design position it as a foundation for future developments in autonomous scientific discovery, while also highlighting the urgent need for responsible governance and oversight as such technologies mature.
Follow-up Questions
- How does the cmbagent system coordinate its multi-agent workflow without human intervention?
- What benefits does the robotics-inspired Planning & Control strategy offer in complex scientific tasks?
- In what ways do context and RAG agents improve domain-specific performance in cmbagent?
- How are error handling and session management implemented to ensure reproducible scientific analyses?
- Find recent papers about autonomous scientific discovery.
Related Papers
- Emergent autonomous scientific research capabilities of large language models (2023)
- Agent Laboratory: Using LLM Agents as Research Assistants (2025)
- AgentRxiv: Towards Collaborative Autonomous Research (2025)
- The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search (2025)
- From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery (2025)
- InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification (2025)
- MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research (2025)
- ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows (2025)
- AI-Researcher: Autonomous Scientific Innovation (2025)
- Deep Research Agents: A Systematic Examination And Roadmap (2025)
Authors (27)
Tweets
alphaXiv
- Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery (21 likes, 0 questions)