CodeTracer: Towards Traceable Agent States

Published 13 Apr 2026 in cs.SE and cs.AI | (2604.11641v2)

Abstract: Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Summary

The paper introduces CodeTracer, a scalable framework that decomposes agent execution into hierarchical trace trees for precise failure onset localization.
It employs evolving extraction and tree indexing to standardize diverse agent logs, enhancing traceability and actionable debugging signals.
Empirical analysis on CodeTraceBench reveals systematic failure patterns, highlighting evidence-to-action gaps and efficiency trade-offs in code agents.

CodeTracer: Formal Summary and Expert Analysis

Motivation and Problem Formulation

Code agents powered by frontier LLMs have become increasingly capable in executing complex software engineering tasks, yet their debugging, process transparency, and failure diagnosis remain underdeveloped. The agent orchestration of heterogeneous actions, including parallel tool calls and multi-stage workflows, obfuscates the provenance of failures and error chains. Traditional evaluation, relying primarily on coarse outcome-level metrics, fails to provide actionable insight into how agent trajectories diverge from optimality. Trace analyses of real agent workflows require scalable approaches to systematically localize the onset and propagation of critical errors.

CodeTracer Architecture and Trace Standardization

CodeTracer implements a scalable trajectory tracing framework, decomposing agent logs from heterogeneous execution environments into hierarchical trace trees with persistent and reusable memory. The pipeline consists of three modular stages:

Evolving Extraction: Automatically discovers run directory layouts and adapts parsers for step normalization, enabling artifact standardization across diverse agent frameworks.
Tree Indexing: Converts normalized step sequences into a hierarchical trace tree, distinguishing exploration steps (non-state-changing) from state-changing actions. This structure supports compressed and navigable indexing of execution traces, crucial for context-aware diagnosis.
Figure 1: Overview of the CodeTracer pipeline, from trajectory normalization to hierarchical trace creation and failure onset localization.
Failure Onset Localization: Traverses the trace tree to pinpoint the earliest causally critical error stage and extracts a compact set of failure-relevant steps and supporting evidence. This stage outputs actionable debugging signals designed for reflective replay.
Figure 2: Hierarchical trace tree construction, differentiating exploration and state-changing steps for compressed trace navigation.

CodeTraceBench: Benchmark Construction

To drive systematic evaluation, CodeTraceBench aggregates 4,354 standardized and annotated trajectories spanning five model backbones (including Claude-sonnet-4, GPT-5, DeepSeek-V3.2, Qwen3-Coder-480B, Kimi-K2-Instruct) and four agent frameworks (SWE-Agent, MiniSWE-Agent, OpenHands, Terminus 2) across 236 tasks and 26 categories. Each instance features granular step-level supervision, failure critical stage labels, and incorrect step annotations. Annotation reliability is confirmed with interannotator agreement ( $\kappa = 0.73$ ), attesting to consistent error critical labeling.

Empirical Analysis and Failure Patterns

The empirical study reveals key behavioral and architectural insights:

Task Regime Preference and Masked Failures: Backbones exhibit task-specific reliability, diverging more on regime coverage than aggregate resolved count, yet converge to identical fabrication and stagnation patterns on universally hard categories.
Figure 3: Task categories solved per backbone, evidencing convergent failure patterns on unsolved tasks.
Architectural Complexity vs. Performance: Increasing orchestration and token budget in agent frameworks yields diminishing returns; backbone LLM capability is the primary determinant of success. Lightweight agents (MiniSWE-Agent) are competitive in end-to-end metrics, highlighting the inefficacy of overengineering.
Workflow Stage-Dependent Error Types: Error critical steps cluster predictably by workflow stage—environment and dependency errors are early, mislocalized edits and incorrect hypotheses late—enabling stage-aware guardrails and preemptive interventions.
Figure 4: Distribution of error critical steps across workflow stages for solved and unsolved runs.
Iteration Budget Saturation: Resolved rate curves saturate rapidly (~40 steps), with additional iterations primarily amplifying unproductive loops rather than enabling recovery. Model capability, not budget, sets the success ceiling.

Step-Level Failure Localization: CodeTracer vs. Baselines

On CodeTraceBench, CodeTracer substantially outperforms direct log prompting and lightweight baselines. Macro F1 scores for step-level localization reach 48.02% (GPT-5), 46.57% (Claude-sonnet-4), and 46.14% (DeepSeek-V3.2), with token-efficiency tradeoffs determined by search depth strategy: GPT-5 exhibits early commitment and higher precision, Claude-sonnet-4 exhaustive search and higher recall, DeepSeek-V3.2 a balanced approach.

Evidence-to-Action Gap: Ineffective step fractions double in unsolved trajectories, evidencing a significant agent bottleneck: LLMs frequently gather pertinent evidence but fail to translate this into correct state transitions and hypotheses.
Figure 5: Step budget decomposition per backbone showing evidence-action gap between solved and unsolved runs.
Component Ablation: Hierarchical tree indexing delivers the largest improvement in localization accuracy, validating the centrality of structural trace representation.

Reflective Replay and Recovery

CodeTracer's diagnostic outputs, when injected via reflective replay under matched budgets, reliably recover failed runs, enhancing Pass@1 across all backbones. The diagnosis pass incurs minimal token overhead and isolates the benefit to localization quality rather than increased computation.

Figure 6: Reflective replay improves pass rates on originally failed runs after injecting CodeTracer's diagnostic signals.

Action Efficiency and Industrial Agent Comparison

Analysis of effective action ratios reveals broad distributions irrespective of backbone strength—every model has numerous trajectories with suboptimal action efficiency, accentuating the necessity for early detection of unproductive behavior and targeted intervention.

Figure 7: Effective action ratio histogram for Claude-sonnet-4.

Industrial agents (e.g., Claude Code) feature specialized tooling, aggressive context management, and parallel tool execution, resulting in lower exploration-to-change ratios and higher trajectory efficiency. However, parallelism introduces ordering-sensitivity challenges absent from academic frameworks.

Implications and Future Directions

The study underscores the limits of outcome-only evaluation, advocating for process-level, step-wise diagnostic supervision. CodeTracer structurally enables failure onset localization, supporting both empirical analysis and practical agent improvement via reflective replay. The evidence-action gap and stage-dependent error clustering motivate integrating hierarchical trace analyses and diagnosis signals into agent architectures as dense feedback sources.

Future research should expand trace standardization to cover richer agent toolsets and diverse repository environments, enhance annotation schemas for more granular supervision, and develop online adaptive replay strategies for interactive environments. The localization pipeline could serve as the foundation for reinforcement and offline RL, leveraging per-step deviation signals to close the behavioral gap between industrial and academic agents.

Conclusion

CodeTracer delivers scalable hierarchical tracing and automated failure onset localization for code agent trajectories, validated on CodeTraceBench—a large benchmark of step-level annotated runs across multiple backbones and frameworks. Experimental results reveal systematic failure patterns, actionable evidence-to-action gaps, and agent architecture inefficiencies, with CodeTracer achieving robust gains in step-level localization and effective replay-driven recovery. The framework constitutes a powerful diagnostic tool for structured evaluation and targeted improvement in software-engineering-oriented autonomous agents (2604.11641).

Markdown Report Issue