Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures

Published 3 Apr 2026 in cs.SE, cs.AI, and cs.ET | (2604.03515v1)

Abstract: LLM-based coding agents can localize bugs, generate patches, and run tests with diminishing human oversight, yet the scaffolding code that surrounds the LLM (the control loop, tool definitions, state management, and context strategy) remains poorly understood. Existing surveys classify agents by abstract capabilities (tool use, planning, reflection) that cannot distinguish between architecturally distinct systems, and trajectory studies observe what agents do without examining the scaffold code that determines why. This paper presents a source-code-level architectural taxonomy derived from analysis of 13 open-source coding agent scaffolds at pinned commit hashes. Each agent is characterized across 12 dimensions organized into three layers: control architecture, tool and environment interface, and resource management. The analysis reveals that scaffold architectures resist discrete classification: control strategies range from fixed pipelines to Monte Carlo Tree Search, tool counts range from 0 to 37, and context compaction spans seven distinct strategies. Five loop primitives (ReAct, generate-test-repair, plan-execute, multi-attempt retry, tree search) function as composable building blocks that agents layer in different combinations; 11 of 13 agents compose multiple primitives rather than relying on a single control structure. Dimensions converge where external constraints dominate (tool capability categories, edit formats, execution isolation) and diverge where open design questions remain (context compaction, state management, multi-model routing). All taxonomic claims are grounded in file paths and line numbers, providing a reusable reference for researchers studying agent behavior and practitioners designing new scaffolds.

Abstract PDF Upgrade to Chat

Authors (1)

Benjamin Rombaut

Summary

The paper introduces a 12-dimensional taxonomy that links source-level architectural choices to variations in agent behavior and performance.
It employs open coding to dissect 13 agents along control, tool interfacing, and resource management dimensions, revealing a design spectrum rather than discrete categories.
The study shows that choices in control loops and context compaction critically influence agent reliability, cost profiles, and overall robustness.

Architectural Taxonomy of Coding Agent Scaffolds: An Expert Review

Overview and Motivation

The study "Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures" (2604.03515) undertakes a systematic, source-level analysis of coding agent scaffolds, offering a granular taxonomy grounded in direct inspection of open-source implementations. Thirteen agents, representative of both research and production-grade tools, are dissected across twelve architectural dimensions, each pinned to specific file paths and commits. The primary objective is to advance beyond conceptual or black-box taxonomies, elucidating precisely how implementation intricacies shape agent behavior, capability, cost profile, and failure modes.

This work identifies persistent gaps in the field's understanding—particularly the confounding effects of architectural scaffolds versus LLM model choices—which prior empirical and survey-based analyses fail to resolve. By formalizing architectural spectra and compositional control primitives, the study sets a foundation for reproducible, architecture-aware agent research and practical design.

Taxonomy Structure and Methodology

The taxonomy aggregates nine initially derived dimensions (eventually splitting to twelve) into three architectural strata:

Control Architecture: Decision-making and orchestration topologies (e.g., control loop, driver type, code-level implementation).
Tool and Environment Interface: Agent-exposed toolsets, invocation protocols, codebase context retrieval, and code execution isolation.
Resource Management: State persistence, context window compaction, model routing, and persistent memory.

The methodology leverages open coding and iterative analysis templates, focusing exclusively on readable, open-source agents with architectural distinctiveness. Each architectural claim is empirically justified via in-repository traces, supporting both analytical rigor and reproducibility.

Key Findings: Spectra, Compositionality, and Architectural Primitives

Spectrum vs. Discrete Classification

Analysis demonstrates that agent scaffolds do not cluster into discrete architectural categories. Instead, every major dimension (e.g., control flow, state management, tool granularity) presents as a spectrum:

Control Loops: Range from static, feedback-free pipelines (Agentless) to full MCTS (Moatless Tools), with most agents layering and composing primitives such as ReAct, generate-test-repair, and plan-execute.
Tool Exposure: Spans from zero LLM-callable tools (Aider) to highly granular action classes (Moatless Tools, 37+), though core capability types (read, search, edit, execute) converge across agents.
Context Compaction: Exhibits seven distinct mechanisms, from simple truncation to LLM-initiated summarization with self-verification (Gemini CLI).

Architectural Compositionality

A major empirical assertion is that most agent scaffolds are not instantiated as monolithic patterns; instead, they are composed of loosely coupled loop primitives and resource strategies, e.g., integrating ReAct at inner phases of pipelines or overlaying tree search atop sequential agents. As a result, architectural evaluation must decompose along dimensions, not agent-level monikers.

Divergence and Convergence

There is conspicuous convergence on dimensions constrained by externalities—edit format (e.g., string replacement tools for code edits), core tool classes, and execution isolation (Docker containers for autonomous SWE-bench agents). Conversely, divergence persists in context management, persistent memory strategies, and multi-model routing—dimensions reflecting open, unsolved problems in agent design.

Numerical Results and Architectural Claims

Control Strategies: Of 13 agents, 7 utilize the ReAct loop as primary control architecture; only 2 agents rely on a single architectural primitive.
Tool Sets: Tool counts vary from 0 (purely user-driven) to 37 (fully autonomous, highly granular command sets), yet all LLM-driven agents converge on at least four capability classes.
Context Management: Only one agent (mini-swe-agent) employs no context compaction, resulting in agent crashes on overflow—demonstrating compaction as mandatory for sustained autonomy.
Persistent Memory: Five CLI-focused agents maintain persistent memory, while SWE-bench/batch agents typically eschew cross-session learning. Notably, most agents with persistent memory employ LLM-involved rule writing or session-level SQLite/JSONL persistence.
Control Flow: Only one agent (Cline) employs recursive main loop implementation; open-source SWE-bench agents overwhelmingly use explicit while-loops with stateful event logs.

Theoretical and Practical Implications

Model-Scaffold Interface

The findings strongly substantiate the position that performance, reliability, cost, and behavioral profiles of coding agents are inextricably tied to scaffold-level decisions—orthogonally to LLM capabilities. This undercuts current leaderboard-style benchmarks as they systematically confound variables, and reinforces the need for architecture-controlled, hold-constant ablation studies.

Design Patterns and Open Questions

Loop Driver as Foundational: The demarcation between user-driven, scaffold-sequenced, and LLM-driven loop drivers underpins fundamental agentic capabilities and delegation of navigation/localization.
Toolset Granularity and Action Space: Increased tool specialization improves targeted reasoning but inflates action space, potentially impairing LLM tool selection.
Emergent Sub-agent Delegation: Multiple mechanisms (tool-based, role-based, event-sourced) exist, but lack of dominant patterns indicates an unsettled design area.
Safety and Execution Isolation: The use of platform-level isolation (Docker, Bubblewrap, OS sandbox) is predominant for unattended agents, whereas interactive agents rely more on fine-grained policy engines or direct user supervision; LLM-based call risk assessment (Codex CLI Guardian) is unique.

Future Ecosystem Trajectory

Standardization Prospects: Converging tool capability categories support the development of shared protocols such as the Model Context Protocol (MCP), but compositional control and state management layers are more promising targets for standardization as the field matures.
Ecosystem Maturity: The coexistence of codebase-forking reuse and protocol-driven extension reveals an ecosystem in flux; architectural modularity (as in Moatless Tools' separation of step logic and control strategy) is rare but would enable greater experimentation and stability.

Implications for Future Research and Evaluation

The taxonomy enables the field to move from opaque, leaderboard-driven evaluation to architecture-aware, factor-controlled comparisons. The twelve dimensions identified provide concrete axes for architectural ablation and causal inference studies, facilitating the design of experiments where only one scaffold variable—such as context compaction method or control loop structure—is manipulated while holding others fixed.

Longitudinal studies should track whether divergence on open dimensions continues or stabilizes, and the taxonomy provides a ready framework for extension to proprietary agents and multi-language codebases, pending documentation availability.

Conclusion

This taxonomy decisively advances the field's capacity to characterize, compare, and understand coding agent architectures by foregrounding source-level compositionality and continuous design spectra over abstract capability-based bins. The composable loop primitives identified function as genuine architectural building blocks, and current divergence on state and resource management marks out the frontier of agent design. The practical utility of this work is twofold: it informs research methodology for empirical agent studies and supplies a design map for practitioners developing new scaffolds. Further experimental and longitudinal work leveraging this taxonomy will be essential for the principled development and evaluation of next-generation coding agents.

Markdown Report Issue