cmbagent: Multi-Agent LLM for Autonomous Research

Updated 4 July 2026

cmbagent is an open-source multi-agent LLM system that automates quantitative, code-based scientific research by decomposing tasks and executing end-to-end workflows.
It employs a robotics-inspired Planning & Control architecture, using stepwise plan generation and autonomous execution to manage complex scientific analyses.
Validated in cosmology and astrophysics, cmbagent integrates retrieval-augmented synthesis, local code execution, and self-correction for reliable research automation.

Searching arXiv for papers on cmbagent and closely related evaluations to ground the article. cmbagent is an open-source multi-agent LLM system for quantitative scientific research automation, developed initially in cosmology and later used as a general-purpose backend for autonomous scientific discovery workflows. Across the literature, it is described as a system that can plan research tasks, retrieve scientific and software knowledge, write and execute code locally, interpret outputs, critique intermediate results, and terminate only when the plan is completed or execution has irrecoverably failed. Its defining architectural motif is a robotics-inspired Planning & Control strategy in which planning, review, execution, failure recovery, and result synthesis are all agent-driven; in its fully autonomous form, the system operates with no human-in-the-loop once the main task is specified (Xu et al., 9 Jul 2025).

1. Historical development and system identity

cmbagent emerged from earlier work on AI-assisted cosmological analysis in which a multi-agent system combined Retrieval Augmented Generation, local code execution, and controlled orchestration through AutoGen/ag2. In that earlier stage, the architecture was explicitly human-in-the-loop: the planner decomposed a task, retrieval agents gathered information from papers and documentation, an engineer wrote code, an executor ran it locally, and an admin mediated every step. The motivating demonstration was cosmological parameter inference for the ACT DR6 CMB lensing likelihood using MCMC, reproduced on a laptop with no human-written code (Laverick et al., 2024).

The later formulation redefined cmbagent as a fully autonomous system for scientific discovery. The 2025 paper presents it as an “Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery”, formed by about 30 LLM agents and intended to automate end-to-end scientific workflows, especially quantitative and code-based ones. The system is explicitly positioned as a step beyond earlier human-supervised multi-agent cosmology workflows because planning and execution are both handled by agents rather than by an approval loop (Xu et al., 9 Jul 2025).

The scope of cmbagent broadened further when it became the “deep-research backend” of Denario, a larger modular scientific assistant with Idea, Literature, Methods, Analysis, Paper, and Review modules. In that architecture, Denario functions as the top-level workflow shell, while cmbagent provides the deliberative execution substrate for multi-step tool use, especially analysis and “Planning + Control” variants of idea and methods generation. This suggests that cmbagent evolved from a cosmology-native research assistant into a reusable orchestration engine for autonomous scientific workflows across multiple disciplines (Villaescusa-Navarro et al., 30 Oct 2025).

2. Planning & Control architecture

The central organizational principle of cmbagent is a two-phase Planning & Control workflow. In the planning phase, the system receives a user-defined main task and produces a stepwise plan; in the control phase, it executes the approved plan subtask by subtask. This strategy is described as inspired by robotics and implemented entirely through language agents, without a human approval loop in the autonomous mode (Xu et al., 9 Jul 2025).

Planning is performed by a planner–reviewer loop. In the autonomous formulation, a plan_setter first selects which agents should participate in the session and writes those selections into context, after which a planner and plan_reviewer iteratively refine the plan. The number of review rounds is a hyperparameter, $n_{\mathrm{reviews}}$ , usually set to 1. The papers note that increasing review rounds often makes plans “overly complex and ineffective,” indicating an intentional preference for concise operational decomposition over prolonged meta-deliberation. The resulting plan is capped by $n_{\mathrm{steps}}$ and typically contains 3 to 8 steps in Denario-backed workflows (Villaescusa-Navarro et al., 30 Oct 2025).

The planner’s output is structured. In the Denario-integrated description, each subtask has exactly three fields:

sub_task
sub_task_agent
bullet_points

In the standalone autonomous formulation, each plan step contains a sub-task, a set of actions or instructions, and the agent assigned to carry it out. Plans are stored in context, saved as JSON, and associated with recorded cost in USD (Xu et al., 9 Jul 2025).

Execution is controlled by a controller or control agent. The full plan is injected into its system message, and the agent updates operational state through a record_status function. The context dictionary explicitly tracks the current step, whether it is completed, failed, or in progress, whether new plots or code were produced, whether code execution failed, and which agent should act next. Two termination conditions are explicit: the workflow ends either when the final step succeeds or when failed code executions exceed a user-defined maximum. A terminator agent handles graceful shutdown. The system also imposes a hard cap on message count, with $n_{\mathrm{rounds}} = 500$ by default, to prevent infinite debugging loops (Villaescusa-Navarro et al., 30 Oct 2025).

The main execution roles are stable across papers. The researcher handles reasoning, scientific interpretation, summarization, and the final academic-style writeup of results. The engineer writes Python analysis pipelines, generates plots and statistics, and interacts with code execution loops. Supporting agents include formatting agents, recorder agents, an executor, a post-execution interpreter, an installer for missing packages, and a terminator. Communication and handoffs are implemented with AG2 transition mechanisms, including function calls returning agent targets and native handoff methods (Xu et al., 9 Jul 2025).

3. Retrieval, domain context, and execution environment

cmbagent is not only an orchestration framework; it is also a retrieval-grounded and execution-capable environment. The system uses two main knowledge-injection strategies: full-context injection and RAG over embeddings. Large documentation corpora can be injected directly into system prompts when affordable, while vector-based retrieval is used when full-context inclusion is too expensive or too long. The literature describes context agents specialized for CAMB and CLASS, as well as RAG-based agents for classy_sz and for corpora of scientific papers (Xu et al., 9 Jul 2025).

The precursor system already established this pattern. It included experiment RAG agents for papers and data releases, community software RAG agents for packages such as CAMB, CLASS, Cobaya, and GetDist, research-software RAG agents for smaller packages such as classy_sz and cosmocnc, and a memory RAG agent storing summaries of past sessions. These agents were implemented as OpenAI assistants with file_search, using text-embedding-3-large for embeddings. The earlier architecture thereby treated scientific workflow automation as a combination of retrieval-grounded synthesis, code generation, and controlled execution (Laverick et al., 2024).

A later dedicated evaluation of astrophysical RAG agents clarified which retrieval stack best supports cmbagent-like systems. On the CosmoPaperQA benchmark of 105 expert-curated cosmology QA pairs, the best-performing configuration used OpenAI retrieval and generation, specifically text-embedding-3-large with GPT-4.1, achieving 91.4% human-evaluated accuracy. The study attributes the strong performance of the OpenAI Assistant configuration to richer retrieval orchestration, including query rewriting, keyword plus semantic search, parallel search, and reranking. The same paper states that these results are intended to inform the literature-understanding component of systems such as cmbagent (Xu et al., 9 Jul 2025).

Local execution is equally central. cmbagent can run code locally, capture stdout and errors, detect missing packages, and invoke an installer agent that performs pip install when dependencies are absent. In the autonomous workflow, engineer-generated code is passed through a formatter, executed locally, and then interpreted by a post-execution interpreter that decides whether to continue, retry, install dependencies, or terminate. A notable systems choice in the Denario analysis backend is that the researcher does not access saved data files directly; instead, the engineer must print all quantitative information needed for interpretation to the console. This creates a text-mediated interface between computation and interpretation rather than a shared binary state (Villaescusa-Navarro et al., 30 Oct 2025).

The system also uses a cost-controlling memory design. During control, shared context includes final code from previous steps, execution outputs, and researcher messages, but after each step the agents and full chat history are reset while a distilled system context is preserved. The authors state that this typically cuts session cost by about a factor of two relative to preserving the full chat history across steps (Xu et al., 9 Jul 2025).

4. Scientific workflows and empirical demonstrations

cmbagent’s flagship demonstrations are end-to-end scientific analyses. The 2025 autonomous-scientific-discovery paper reports successful application to a PhD-level cosmology task: fitting a flat $\Lambda$ CDM model with free parameters $H_0$ and $\Omega_\Lambda$ to the Union2.1 Type Ia supernova dataset. The approved six-step plan was: download and preprocess the supernova data; implement the flat $\Lambda$ CDM model, likelihood, and priors; perform preliminary MCMC timing and optimization; run the full MCMC; generate plots and summary statistics; and comment on the results. The paper states that this task was solved successfully the first time it was run (Xu et al., 9 Jul 2025).

The same paper reports benchmark-style evaluations. On 14 CAMB problems, the camb context agent using gemini-2.5-pro compared “very favorably” with baseline frontier models queried through the engineer agent, and on Problems 12, 13, and 14 the baselines failed systematically while the context agent achieved much better performance. On a subset of DS-1000 covering pandas, numpy, and matplotlib, Planning & Control improved total success rate from 66% in One Shot mode to 78% (Xu et al., 9 Jul 2025).

The earlier ACT DR6 case study provides a complementary demonstration under human supervision. There, cmbagent reproduced cosmological parameter constraints from ACT DR6 CMB lensing data using Cobaya, classy_sz, and GetDist, running 4 MCMC chains in parallel over 10 CPU cores on a MacBook Pro. The paper reports that the Gelman–Rubin diagnostic reached $R - 1 = 0.01$ and that the resulting contours overlapped almost perfectly with the original collaboration chains, with only statistically insignificant differences attributed to theory precision settings (Laverick et al., 2024).

Within Denario, cmbagent underlies the most scientifically substantive phase of end-to-end paper generation. The analysis module consumes input.md, idea.md, and methods.md, then uses Planning + Control to generate results.md and plots. The paper reports representative operating costs and latencies for these cmbagent-backed workflows: the curated idea-generation mode costs about \$1 and takes about 4 minutes per idea; the curated methods mode costs roughly \$0.50 and also takes around 4 minutes; in analysis, each control step with gemini-2.5-pro or gpt-4.1 as engineer or researcher costs about \$0.30, so a 6-step plan totals around \$2. For simple analyses with less than 1 GB of data, total run time is around 30 minutes on a personal computer. The authors summarize that Denario can generate a paper in around 30 minutes for about \$4, implying that cmbagent performs most of the heavy execution in that budget (Villaescusa-Navarro et al., 30 Oct 2025).

The black hole–stellar mass relation case in Denario is especially revealing. The authors first used a Planning + Control cmbagent session to restructure a dataset of 1000 catalogs into parquet tables and generate a structured textual schema guide. They then launched a second cmbagent session to produce a richer quantitative dataset description for downstream prompt construction. Later Denario runs, using the improved prompt and refined methods, moved from simplistic linear regression to XGBoost, SHAP values, and Huber-loss regression, yielding the conclusion that supernova feedback dominates the black hole–stellar mass relation in low-mass galaxies while AGN feedback dominates in massive ones, with cosmology as a secondary factor. The paper states that the authors suggest this may be genuinely new, while also noting that it does not isolate which improvements came from prompt refinement versus backend capability (Villaescusa-Navarro et al., 30 Oct 2025).

5. Integration into Denario and multimodal extension

Denario makes cmbagent selectable as an explicit backend rather than hiding it as infrastructure. The Python API exposes this directly through calls such as den.get_idea(mode="fast") and den.get_idea(mode="cmbagent"). Fast modes are typically implemented in LangGraph and used for directed-graph workflows with state passing and lower latency, whereas cmbagent is invoked when Denario needs richer conversational multi-agent patterns, recursive team composition, tool use, conditional control, and planning-and-execution loops (Villaescusa-Navarro et al., 30 Oct 2025).

In this larger system, cmbagent powers more than analysis. It can serve as a proposal-and-critique engine for curated idea generation, where the choreography asks idea_maker to generate five ideas, idea_hater to critique them, idea_maker to improve or select two, idea_hater to critique those two, and then idea_maker to choose the best and report it as a title plus five-sentence description. The methods module uses a related pattern centered on a researcher persona that clarifies hypotheses and assumptions and then writes a roughly 500-word methodology description in markdown (Villaescusa-Navarro et al., 30 Oct 2025).

A later extension incorporated vision-LLM capabilities into cmbagent. That work treats plots as intermediate checkpoints rather than as merely final outputs. When newly executed code produces a figure, a GPT-4o model generates a domain-specific scientific rubric, and a VLM judge then evaluates the plot against that rubric. In correction mode, the specialized loop is Plot Judge $n_{\mathrm{steps}}$ 0 Plot Debugger $n_{\mathrm{steps}}$ 1 engineering team; in discovery mode, it becomes Plot Scientist $n_{\mathrm{steps}}$ 2 Experiment Proposer $n_{\mathrm{steps}}$ 3 engineering team. The judged plot is evaluated as a black-box artifact rather than through code inspection alone (Gandhi et al., 18 Nov 2025).

The benchmark reported in that VLM paper contains 10 tasks spanning oscillators, spectral-line models, epidemiology, and cosmology. Average pass@1 rises from 0.2–0.3 for code-only baselines and 0.4–0.5 for code-and-text variants to 0.7–0.8 for VLM-augmented systems. In the cosmology case study, the system was asked to compute the lensed CMB TT power spectrum and plot

$n_{\mathrm{steps}}$ 4

for $n_{\mathrm{steps}}$ 5. The initial plot was incorrect because CAMB’s get_cmb_power_spectra() output had already been scaled as $n_{\mathrm{steps}}$ 6, and the script applied the same scaling a second time. The VLM-guided correction loop identified the discrepancy through peak positions and amplitudes, traced it to the double-scaling bug, and corrected the figure in one additional pass (Gandhi et al., 18 Nov 2025).

This suggests that the evolving conception of cmbagent is not limited to text, code, and scalar metrics. In the later literature it becomes a system that can plan, execute, inspect its own visual outputs, judge them against dynamically generated domain criteria, and either self-correct or branch into exploratory analysis (Gandhi et al., 18 Nov 2025).

6. Reliability, failure modes, and open problems

The most systematic reliability study of cmbagent frames its central risk as “plausible but wrong.” Across eighteen astrophysical tasks, the system performs strongly when tasks are well specified and grounded with domain context, but often fails silently when numerical correctness, physical consistency, or parameter identifiability are stressed. The authors evaluate two automated modes: One-Shot and Deep Research (Rawat et al., 28 Apr 2026).

In One-Shot CAMB workflows, domain-specific context is decisive. With CAMB context, the paper reports ESR = 0.96, PAS = 0.95, NAS = 0.86, and a Final Score = 0.85. Without context, performance drops to ESR = 0.62, PAS = 0.54, NAS = 0.18, and Final Score = 0.15. A base LLM baseline, direct GPT-4o-mini, achieves ESR = 0.09 and a final score approximately equal to zero. The paper summarizes the context benefit as an approximately $n_{\mathrm{steps}}$ 7 improvement, comparing 0.85 with 0.15 (Rawat et al., 28 Apr 2026).

The evaluation formalizes these quantities. For One-Shot mode, execution success is binary. Parameter accuracy is defined by

$n_{\mathrm{steps}}$ 8

with

$n_{\mathrm{steps}}$ 9

Numerical accuracy combines NRMSE, SMAPE, and Lin’s CCC as

$n_{\mathrm{rounds}} = 500$ 0

and the final score is

$n_{\mathrm{rounds}} = 500$ 1

The failure taxonomy distinguishes code failure, wrong parameters, wrong computation, and correct execution (Rawat et al., 28 Apr 2026).

The paper’s main warning is that agent scaffolding converts many overt failures into silent scientific errors. In the no-context condition, about 47% of trials fall into Mode C: wrong computation, meaning code executes and outputs appear plausible while the computation is scientifically invalid. Task 10 is emblematic: most trials omit raw_cl=True, creating an amplitude error of about $n_{\mathrm{rounds}} = 500$ 2, and all trials omit the tensor B-mode contribution. Task 14 correctly computes per- $n_{\mathrm{rounds}} = 500$ 3 delensing efficiency but then collapses it to a scalar mean, yielding $n_{\mathrm{rounds}} = 500$ 4 (Rawat et al., 28 Apr 2026).

Deep Research mode reveals a more serious epistemic problem: weak self-diagnosis. On four Bayesian astrophysical inference tasks, the reported outcomes are T1 SN1a: 4/5 completed, PRS = 0.97; T2 NGC 3198: 5/5 completed, PRS = 0.05; T3 Exoplanets: 5/5 completed, PRS = 0.73; T4 SLACS: 1/5 completed, PRS = 0.00. Across all four tasks, failure transparency fails. The system can return polished posteriors for unconstrained or physically implausible quantities without flagging prior domination, degeneracy, boundary pathologies, inconsistent units, or systematic calibration bias. In the SN1a task, for example, it reports a precise $n_{\mathrm{rounds}} = 500$ 5 estimate from Union2.1 data despite the unbroken $n_{\mathrm{rounds}} = 500$ 6– $n_{\mathrm{rounds}} = 500$ 7 degeneracy; in the NGC 3198 task it can fit the rotation curve while inferring unphysical NFW concentrations (Rawat et al., 28 Apr 2026).

Several limitations recur across the cmbagent literature. Loopiness and brittleness motivate hard caps such as $n_{\mathrm{rounds}} = 500$ 8 and bounded retry budgets. Coding fragility arises from deprecated APIs, missing packages, and execution failures. Over-review in planning can produce plans that are too complex. Prompt sensitivity is consistently emphasized: broad or underspecified prompts yield shallow or incorrect analyses, whereas highly detailed prompts with explicit analytical tasks produce materially better results. A particularly severe failure mode described in Denario is that downstream writing modules may receive a plausible narrative even when an upstream implementation failed repeatedly, because the paper-writing stage only sees summarized outputs rather than raw execution evidence (Villaescusa-Navarro et al., 30 Oct 2025).

The stated roadmap focuses on more adaptive planning and stronger checking. The Denario paper argues that fixed plans are unlike real research and proposes future controllers that can reassess the remaining plan after each task, deleting, modifying, or adding steps dynamically. Other directions include more asynchronous execution, more feedback and checking agents, parallelized workflows, tighter coupling between claims-checking and literature agents, and more rigorous benchmarking frameworks (Villaescusa-Navarro et al., 30 Oct 2025).

Taken together, the literature defines cmbagent as a modular scientific execution engine whose main strengths are explicit task decomposition, domain-grounded retrieval, local code execution, bounded self-correction, and breadth across quantitative scientific domains. Its main weaknesses are prompt sensitivity, coding brittleness, silent scientific failure, and limited failure transparency. A plausible implication is that cmbagent is most effective not as a replacement for scientific validation, but as an autonomous research worker whose outputs must still be checked against domain knowledge, physical constraints, and independent reproductions (Rawat et al., 28 Apr 2026).