How Much LLM Does a Self-Revising Agent Actually Need?

Published 8 Apr 2026 in cs.AI and cs.CL | (2604.07236v2)

Abstract: Recent LLM-based agents often place world modeling, planning, and reflection inside a single LLM loop. This can produce capable behavior, but it makes a basic scientific question difficult to answer: which part of the agent's competence actually comes from the LLM, and which part comes from explicit structure around it? We study this question not by claiming a general answer, but by making it empirically tractable. We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure. We instantiate this protocol in a declarative runtime and evaluate it on noisy Collaborative Battleship [4] using four progressively structured agents over 54 games (18 boards $\times$ 3 seeds). The resulting decomposition isolates four components: posterior belief tracking, explicit world-model planning, symbolic in-episode reflection, and sparse LLM-based revision. Across this decomposition, explicit world-model planning improves substantially over a greedy posterior-following baseline (+24.1pp win rate, +0.017 F1). Symbolic reflection operates as a real runtime mechanism -- with prediction tracking, confidence gating, and guarded revision actions -- even though its current revision presets are not yet net-positive in aggregate. Adding conditional LLM revision at about 4.3\% of turns yields only a small and non-monotonic change: average F1 rises slightly (+0.005) while win rate drops (31$\rightarrow$29 out of 54). These results suggest a methodological contribution rather than a leaderboard claim: externalizing reflection turns otherwise latent agent behavior into inspectable runtime structure, allowing the marginal role of LLM intervention to be studied directly.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a declared reflective runtime protocol that decomposes agent functions for precise measurement of LLM contributions.
Explicit world-model planning dramatically boosts win rates, illustrating the value of externalized reasoning in agent performance.
Sparse LLM revision offers only marginal F1 score improvements while introducing performance tradeoffs, emphasizing the benefit of symbolic control mechanisms.

Empirical Decomposition of LLM, Planning, and Reflection in Self-Revising Agents

Introduction

This work rigorously investigates the question of how much competence in LLM-based agents is attributable to the LLM versus the explicit symbolic and reflective scaffolding wrapped around it. Recent practice in LLM agents conflates world modeling, planning, and reflection in a monolithic prompt loop, obfuscating the provenance and impact of each component. The paper presents an empirical approach, instantiating a protocol that partitions agent mechanisms into inspectable runtime constructs. This “declared reflective runtime protocol” isolates posterior belief tracking, world-model planning, symbolic reflection, and sparse LLM revision into discrete layers. By evaluating these decomposed variants in the context of noisy Collaborative Battleship, the authors produce precise measurements of each layer’s marginal contribution to agent performance.

Declared Runtime Protocol and Agent Decomposition

The core protocol externalizes key agent mechanisms as explicit, declarative structures, tracked and manipulated outside the LLM, with four principal elements:

State: explicitly tracked world and agent state.
Signals: deterministic confidence and revision eligibility computations.
Guarded Actions: revision and policy patching only when precisely stipulated runtime criteria are met.
Hypothetical Transitions: forward-simulated state transitions for planning.

Agent instantiations form a sequence of ablated variants:

greedy+MCMC: Posterior-based policy with no planning or revision.
WMA: Adds explicit world-model planning and question selection.
MRA: Adds symbolic in-episode self-revision, realized without invoking any LLM.
MRA-LLM: Permits LLM intervention as a conditional, runtime-gated revision mechanism.

The main architectural distinction is that metacognitive and reflective operations are not latent behaviors emergent from LLM prompting, but declared, inspectable control flows amenable to precise ablation and measurement.

Empirical Findings

World-Model Planning and Interrogative Strategy

Explicit world-model planning, realized in WMA, dramatically outperforms the pure posterior baseline. The win rate increase (+24.1 points, from 50.0% to 74.1%) vastly outpaces the change in average F1 (+0.017), indicating that explicit question selection affects game success more strongly than fine-grained targeting precision. These results support the hypothesis that reasoning over explicit world structure, even with simple policies, is a major driver of agent success.

Symbolic Reflection and Self-Revision

Introduction of symbolic reflection (MRA) does not uniformly confer improvement; average win rate and F1 remain constant or marginally regress. However, case analysis reveals that symbolic revisions are critical in specific board/game configurations (e.g., B17-seed0), but suboptimal policy presets may induce harmful revisions elsewhere. The key contribution is methodological: by isolating reflective mechanisms, diagnosable runtime structures identify failure modes and enable targeted calibration of revision strategies rather than opaque prompt tuning.

Sparse LLM Revision

MRA-LLM introduces LLM-based revision as a runtime-instrumented intervention, invoked at 4.3% of turns when permissive gating is enabled. The effect is characterized by non-monotonic and tradeoff-prone performance: average F1 increases slightly (+0.005, to 0.557), but win rate drops (to 53.7%). This indicates that while LLM revision can enhance local action quality, it can impair completion efficiency by consuming critical question/shot resources. The explicit externalization of the reflective protocol makes these effects measurable and reproducible, in stark contrast to prompt-embedded alternatives.

Theoretical and Methodological Implications

The primary scientific impact is the demonstration that symbolic, declaratively implemented layers—explicit planning, runtime-calibrated reflection—can account for the majority of agent efficacy, with the LLM confined to a residual, quantitatively bounded role. This reconfigures the locus of intelligence in LLM agents: the protocol demonstrates that with careful scaffolding, most world modeling, planning, and even reflection can be externalized and directly interrogated, reserving LLM invocation for well-justified, context-specific cases.

Notably, this decomposition allows agent designers to precisely attribute and optimize where LLM interventions are empirically beneficial, moving away from undiagnosable prompt engineering paradigms. The method proposed offers a template for robust agent evaluation that can, in principle, be generalized beyond the Battleship domain to any structured, partially observable environment.

Potential Future Directions

Cross-domain Validation: The protocol’s generality allows extension to other domains with hybrid symbolic-LLM agent architectures.
Adaptive Calibration: Future work may optimize revision policy parameters dynamically, informed by meta-level learning rather than hand-tuned presets.
Broader LLM Invocations: Investigating more granular or strategic LLM invocation schedules could further clarify the boundaries and utility of LLM-driven revision.
Integration with Language-grounded Belief Models: Combining this decomposition with language-informed belief updates (e.g., LIPS-style mechanisms) offers a promising avenue for agents operating in more open-ended linguistic environments.

Conclusion

Explicit decomposition of agent mechanisms into belief tracking, world-model-based planning, symbolic reflection, and sparsely gated LLM revision yields a rigorous framework for measuring the marginal contribution of LLMs in self-revising agents. The majority of agent competence in the tested regime is attributable to explicit planning; symbolic reflection operates as a tangible runtime mechanism, though its aggregate benefit is sensitive to calibration. Sparse LLM revision exerts only bounded and non-monotonic influence. These results recommend a design posture: maximize the declared, symbolic substrate and reserve LLMs for empirically justified intervention, instrumented through declarative reflective protocols. This approach shifts the core methodology of LLM-agent research from prompt entanglement to auditable, structure-centric development.

Markdown Report Issue