Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

Published 29 Apr 2026 in cs.SE and cs.AI | (2604.27209v1)

Abstract: LLMs can now generate substantial code and draft research text, but research-software projects require more than either artifact alone. The mathematical thesis, executable system, benchmark surface, and public claims must mature together, yet often drift apart. We identify two LM-specific failure modes: hallucination accumulation, in which claims exceed what code or theory supports and unsupported assertions propagate across sessions; and desynchronization, in which code, theory, or the model's own world model fall out of alignment. We propose Comet-H, an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward with a half-life, and re-checks the paper and README against the code and benchmarks whenever documentation changes. We frame prompt selection as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand-weighted linear score. This transparent scorer, paired with a fading record of unfinished work, bounds long-horizon follow-ups, requires no learned policy, and makes each prompt choice legible from the workspace. We created a portfolio of 46 research-software repositories across two dozen domains. We study A3 in depth, a Python static-analysis tool built entirely within the loop, which reaches (F1 = 0.768) on a 90-case benchmark, compared with a next-best baseline of 0.364. Across approximately 400 commits, we find that audit-and-contraction passes dominate the later phases of every successful trajectory.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces Comet-H, a framework that synchronizes evolving research software artifacts to prevent hallucination accumulation and desynchronization.
It employs a stateful controller with a workspace abstraction spanning six surfaces to iteratively ground and audit theory, code, claims, and evidence.
Empirical evaluations on 46 repositories, including a Python static analysis tool, show significant precision/recall improvements with integrated orchestration.

Orchestrating Evolving Research Software with LLMs: The Comet-H Approach

Introduction and Motivation

This paper introduces Comet-H, an automated orchestration framework for using LLMs in the iterative development of research software where the core specification—encompassing mathematical theory, executable artifacts, empirical benchmarks, and public claims—co-evolves rather than being fixed a priori. Previous work on LLM-driven software engineering predominantly targets well-specified, static tasks and lacks mechanisms for controlling specification, auditability, and synchronization across theory, implementation, and claims as the project unfolds. The authors conceptualize research software development as a co-evolutionary process and identify two distinctive failure modes arising from LLM involvement: hallucination accumulation and desynchronization. Comet-H is designed to explicitly counter these failure modes through architectural and methodological innovations.

Co-Evolution Failure Modes and the Artifact Synchronization Problem

The central technical challenge addressed is the drift between coupled artifact surfaces:

Hallucination Accumulation: Unsupported assertions introduced at any point in the loop (e.g., empirical claims in text or README not reflected in code or data) can propagate and compound over multiple LLM completions, since the LLM cannot disambiguate its prior outputs from grounded facts.
Desynchronization: The divergence between theory, code, empirical evidence, and claims—e.g., code changes outpacing formal specification, or claims continuing to reference superseded theory—leading to global incoherence, which is unobservable by conventional task-based evaluation pipelines.

Crucially, the paper advances the view that in research, theory is not a static input but a mutable element, adapting under empirical pressure. This drives the need for orchestration approaches capable of managing the long-term integrity of mutually dependent artifacts as goals and specifications evolve.

The Comet-H Architecture: A Prompt Automaton and Workspace Model

Comet-H operationalizes artifact co-evolution via a stateful controller operating on a workspace abstraction:

The workspace consists of six explicit surfaces: theory (T), repository code (R), public projection (P), evidence surface (E), utility hypothesis (U), and open obligations (Q).
Seventeen prompt "families" are orchestrated by Comet-H in a loop mapped to four phases: seed (problem/theory definition), generation (repository and artifacts), hardening (iterative tightening of code, theory, claims, evidence), and tail (final audits and polish).
Prompt selection is cast as a contextual bandit problem: prompts are arms, the workspace deficit vector is context, and a linear, hand-tuned function—rather than a learned policy—scores options. Fading memory of unfinished obligations (via exponential decay) ensures recent quality debts have high priority, but the controller is robust to long-horizon drift.

Reactive triggers ensure that any change to public-facing documentation (paper, README) forces immediate grounding and audit, sharply limiting the propagation of potential hallucinations. Adjacency constraints guarantee that expansions in functionality remain within a conceptual single step of existing capability, preventing unconstrained, ungrounded drift.

Empirical Evaluation: Portfolio and Numerical Results

The methodology is validated on a substantial automated creation of 46 research software repositories spanning over a dozen domains. Deep analysis is conducted on "a3", a Python static analysis and bug-finding tool synthesized entirely within the Comet-H loop:

a3 achieved $F_1 = 0.768$ on a 90-case benchmark, compared to a next-best baseline of $0.364$—a strong improvement in precision/recall due to effective co-evolutionary coupling of theory and implementation.
Each added orchestration layer (benchmarks, theory, code, grounding) yielded monotonic ablation gains, supporting the claim that the controller's design enables true synergy among artifact surfaces rather than superficial, decorative complexity.
The system was able to reduce hundreds of candidate bug warnings to a handful of confirmed, actionable issues, validating practical value.

Across the broader portfolio, late-stage commits were consistently dominated by audit, benchmarking, and claim contraction, confirming that forced grounding and audit prompts are necessary to prevent endless expansion and the compounding of error or unsupported claims.

Observed Model Behaviors and Theoretical Implications

From ~400 orchestrated development runs with LMs, the authors report several consistent patterns:

Audit Dominance: Verification, grounding, and contraction phases dominate successful developmental trajectories; audit triggers are critical for artifact integrity.
Theory Revision: LMs are capable of substantive theoretical pivots if the controller permits it; rigidity in theory results from workflow constraints, not inherent LM deficiency.
Visible Failure Reporting: The most trustworthy outputs are those that explicitly publish uncertainty and negative results.
Self-Organization: Over long horizons, LMs exhibit emergent modularization and compositionality in artifact structure if provided sufficiently rich development context.

A key theoretical implication is that evaluation of automated research agents should shift from static endpoint metrics to process-oriented criteria: synchronization lag, recoverability after drift, prompt allocation relative to artifact state, and the ability to maintain supportable claims as specifications evolve.

Comet-H extends beyond program search/optimization agents (FunSearch, AlphaEvolve), repository-patch agents (SWE-Agent, Devin), and staged pipelines (AutoResearchClaw) by making the mutation and coupling of metrics, specifications, and evaluation surfaces a first-class, auditable process. Controller formalism is distinguished from flat FSM, learned policies, or strategies focused on reward maximization, emphasizing transparency and bounded memory, inspired by contextual bandit models but tailored for nonstationary, multi-surface research settings.

Limitations and Outlook

Limitations arise primarily from the fixed, hand-tuned nature of the scoring and orchestration law, and varying external expert validation across the full portfolio. While a3 is thoroughly tested, evidence for optimality and generality is domain-dependent and empirical rather than theoretical. The approach may need adaptation for research regimes with significantly different co-evolutionary dynamics or more demanding symbolic state requirements.

Conclusion

Comet-H presents a formal and practical framework for managing the co-evolution of theory, code, evidence, and claims in research software projects guided by LMs, directly confronting the unique failure modes of hallucination accumulation and desynchronization. Empirical evaluation showcases its ability to generate mutually consistent, auditable, and high-performance research software across diverse domains. The architectural mechanisms grounded in prompt selection, decaying obligation memory, and forced audit loops provide a principled foundation for further developments in autonomous research orchestration. Future work will likely involve integrating adaptive or learned scoring strategies while maintaining auditability and refining artifact-surface couplings for broader classes of scientific automation.

Reference: "Theory Under Construction: Orchestrating LLMs for Research Software Where the Specification Evolves" (2604.27209)

Markdown Report Issue