Cmbagent: Autonomous Scientific Workflow

Updated 30 July 2025

Cmbagent is a multi-agent system designed for fully autonomous, end-to-end scientific research in cosmology and astrophysics, integrating diverse LLM experts.
The system employs specialized agents for planning, retrieval, code generation, execution, and critique, ensuring robust error recovery and scalable performance.
Demonstrated in precise cosmological parameter estimation, cmbagent sets a new paradigm in agent-based scientific discovery with open-source benchmarks.

Cmbagent is a multi-agent system designed for fully autonomous, end-to-end scientific research workflows in cosmology and astrophysics. It orchestrates specialized LLM agents—each expert in retrieval, code generation, reasoning, execution, or critique—to autonomously analyze datasets, perform model inference, generate code and plots, and synthesize results at a level commensurate with advanced research tasks. The system operates without human intervention, executing experiments such as cosmological parameter estimation and superseding the performance of single LLM baselines and one-shot agent frameworks (Xu et al., 9 Jul 2025, Laverick et al., 30 Nov 2024). With open-source deployment and rigorous benchmarking, cmbagent exemplifies a new paradigm of agent-based scientific discovery.

1. Multi-Agent System Architecture

The cmbagent system encompasses approximately 30 interacting LLM agents, each specializing in sub-tasks spanning planning, retrieval, coding, execution, critique, formatting, and logging (Xu et al., 9 Jul 2025, Laverick et al., 30 Nov 2024). The architecture is hierarchically organized:

Planner Agents: Generate decomposed, multi-step research plans from the main scientific task. Each plan details discrete sub-tasks, the agent assignment, required input/output formats, and expected artifacts (e.g., code, plots).
Reviewer Agents: Critique and refine plans, ensuring stepwise correctness and plausibility.
Researcher and Retrieval Agents: Perform retrieval-augmented generation (RAG) over curated repositories of scientific papers, code documentation, and tutorials. These agents are further specialized—e.g., experimental data RAG, theory RAG, software RAG, and a memory agent storing past dialogue for error avoidance.
Engineer and Executor Agents: Engineer agents generate and refine domain-specific code (e.g., for data analysis, MCMC sampling), and executor agents run this code locally, capturing output and exceptions.
Critiquer and Formatter Agents: Critique scientific interpretations, evaluate results consistency, and format outputs (plots, JSON, Markdown) for downstream consumption. Recorder agents archive agentic transitions and session metadata for reproducibility.

Inter-agent communication is primarily controlled and one-way according to a strict protocol, with context variables carried forward and prior chat histories expunged upon moving to the next plan step. This reset strategy, analogous to modular planning and control frameworks in robotics, ensures context continuity and prevents memory bloat (Xu et al., 9 Jul 2025).

2. Planning & Control Strategy

Combining a planning phase with a control (execution) phase, cmbagent employs a formal Planning & Control approach. Upon receipt of a user main task, a planner agent decomposes the task into explicit, atomic steps—often guided by templates derived from robotics and autonomous systems (Xu et al., 9 Jul 2025). Each plan step is:

Proposed and reviewed (via plan reviewer agent), with feedback loops executed for up to n₍reviews₎ refinement rounds.
Passed to a controller agent that sequentially dispatches the steps, ensuring agents receive the correct context, code history, and instructions. Upon completion, successful outputs are formatted and archived, while failed executions invoke retries (n₍fails₎) or session termination according to error-handling policy.

This design supports robust error recovery, maintains session determinism, and enables complex task orchestration such as nested code execution, documentation retrieval, and output validation. Notably, session transitions—e.g., from retrieval to coding to execution—are explicitly recorded and inspectable via structured logs.

3. Autonomous Scientific Analysis: Example Application

Cmbagent's capabilities were demonstrated through a PhD-level cosmological parameter analysis using the Union2.1 Type Ia supernova compilation (Xu et al., 9 Jul 2025). Key steps orchestrated entirely by agents included:

Data Acquisition and Parsing: Engineer agents autonomously download and parse ASCII tables with supernova name, redshift, distance modulus, and uncertainties.
Model Specification: A flat ΛCDM model is selected, with the system automatically implementing the relevant formulas:
- Luminosity distance:
$d_L(z) = \frac{c (1+z)}{H_0} \int_0^z \frac{dz'}{\sqrt{\Omega_m (1+z')^3 + \Omega_\Lambda}}$

Distance modulus:

$\mu = 5 \log_{10}(d_L) + 25$

Only $H_0$ and $\Omega_\Lambda$ are varied (imposing $\Omega_m = 1 - \Omega_\Lambda$ ).

Parameter Estimation: An optimized MCMC sampler, generated by the engineer and verified by the critiquer agent, is executed to sample the posterior:

$P(\theta | \text{data}) \propto \exp\left( -\frac{1}{2} \sum_i \frac{(\mu_i^\text{obs} - \mu^\text{th}(z_i, \theta))^2}{\sigma_i^2} \right)$

with $\theta = (H_0, \Omega_\Lambda)$ .

Result Synthesis and Visualization: The system autonomously produces 2D marginalized posteriors for $(H_0, \Omega_\Lambda)$ and overlays the best-fit theoretical model atop data. All code, plots, and intermediate outputs are locally executed and saved with provenance metadata.

Throughout, no human-in-the-loop intervention is required; all agent transitions, code generations, and critiques occur within the platform. The same architecture extends to other cosmological tasks, e.g., ACT lensing power spectrum analysis (Laverick et al., 30 Nov 2024), with configuration files and likelihood module setup orchestrated by retrieval and engineer agents.

4. Validation, Benchmarking, and Evaluation

The system's efficacy was evaluated using a range of technical and scientific benchmarks:

General Benchmarking: On DS-1000 (pandas, numpy, matplotlib), the Planning & Control strategy improved task success from 66% (One Shot mode) to 78%, demonstrating the advantage of structured multi-agent execution (Xu et al., 9 Jul 2025).
Domain-Specific Performance: On specialized cosmology tasks (e.g., camb usage), context-injection agents outperformed state-of-the-art LLM baselines (gpt-4o, gpt-4.1, gemini-2.5-pro), succeeding at tasks where others failed.
Autonomous Operation: The system can autonomously download datasets, preprocess data, implement physics models, and perform parameter inference, including the production of plots and manuscript-ready outputs, with no manual oversight.
RAG Agent Selection: Systematic evaluation of retrieval-augmented generation pipelines using the CosmoPaperQA benchmark (105 expert-curated QAs) and human/LLM-as-a-Judge evaluation showed that OpenAI-based RAG agents achieve 91.4% accuracy and are deployed in cmbagent (Xu et al., 9 Jul 2025).

5. Technical Implementation, Codebase, and Accessibility

Cmbagent is implemented upon the autogen/ag2 framework (originating from Microsoft and now maintained under ag2ai) (Laverick et al., 30 Nov 2024). Key technical pillars include:

Agent Configuration: Each agent's parameters (model type, temperature, TopP, system prompts) are controlled via yaml and Python API. Contextual enrichment is leveraged (e.g., camb/class documentation) for precision.
Code Generation & Execution: Engineer agents employ LLM-based code synthesis, with executor agents running code locally and propagating exceptions or outputs for critique. Detailed logs and session histories ensure reproducibility.
Open-Source Distribution: The platform and all associated resources (CosmoPaperQA dataset, evaluation pipelines, documentation) are publicly available on GitHub (https://github.com/CMBAgents/cmbagent) and via PyPi for pip installation. Additional demonstration resources include HuggingFace Space and YouTube walkthroughs.
Interface and Usability: A GUI is available for non-expert access; containerized deployments (Docker) facilitate secure environment management. All outputs are produced in standardized, citable formats.

from cmbagent import CMBAgent

agent_temperature = {'cosmocnc_agent': 0.000001}
agent_top_p = {'cosmocnc_agent': 0.1, }

cmbagent = CMBAgent(
    agent_list=['cosmocnc'],
    verbose=True,
    agent_instructions={},
    agent_temperature=agent_temperature,
    agent_top_p=agent_top_p
)
task = """Use cosmocnc to write code to compute the unbinned log-likelihood for the "SO_sim_0" catalogue, only for one mass observable, "q_so_sim", for 40 values of "bias_sz", linearly spaced between 0.79 and 0.81. The code must plot the exponential of the log-likelihood, normalising it to one at its highest value, and save the plot as a pdf file and the code as a .py file."""
cmbagent.solve(task)

6. Limitations, Challenges, and Future Directions

Challenges identified for cmbagent and similar MAS approaches include:

Resource Usage and Efficiency: High token usage and model invocation costs per session necessitate cost tracking and new benchmarks for agentic OPEX.
Physics Consistency and Validation: Occasional over-confidence or subtle factual errors remain in LLM responses, highlighting the need for frequent cross-agent critique and robust diagnostic agents.
Scalability and Robustness: Efforts are underway to shift from human-in-the-loop to “zero-player” operation, employing techniques such as multi-agent reinforcement learning (proximal policy optimization).
Domain Adaptation: Developing and integrating fine-tuned, domain-specific LLMs (AstroLLama, cosmosage) to further improve task reliability.
Onboarding and Software Management: Extending support for installation, software setup, and error recovery to increase accessibility for less expert users.

Planned research includes scaling to thousands of simultaneous, independent scientific research tasks (leveraging cloud/hybrid deployment), and autonomy in generating, verifying, and synthesizing research manuscripts directly from agentic analyses.

7. Impact and Role in Contemporary Astrophysics

Cmbagent systematizes and automates complex, technical workflows in cosmology, representing a foundational step toward autonomous scientific discovery (Xu et al., 9 Jul 2025, Xu et al., 9 Jul 2025, Laverick et al., 30 Nov 2024). By orchestrating multi-agent LLM-driven planning, retrieval, code synthesis, and result critique, the system not only demonstrates technically robust parameter analyses (e.g., MCMC estimation from supernova and CMB lensing data) but also sets new standards for reproducibility, efficiency, and scalability. Its open-source release and comprehensive benchmarking further facilitate broad adoption and extension within the astrophysics community.

The platform’s integration of advanced agent architectures, proven superior performance on domain-specific tasks compared to state-of-the-art LLMs, and ability to execute all components of quantitative science workflows make it a significant development in the path toward scalable, autonomous scientific research infrastructure.