AgenTracer-8B: LLM Error Attribution
- AgenTracer-8B is a failure attribution model that precisely identifies critical (agent, step) errors in multi-agent LLM systems using automated trajectory data and reinforcement learning.
- It leverages the TracerTraj dataset, built from over 2,000 annotated multi-agent traces via counterfactual replay and fault injection, to cover a wide range of failure modes.
- Empirical evaluations show that AgenTracer-8B surpasses leading models with over 69% agent-level and 20% step-level accuracy, delivering actionable diagnostics for self-correcting workflows.
AgenTracer-8B is a dedicated, lightweight, 8-billion-parameter failure attribution model built to diagnose and localize errors within LLM-based multi-agent systems. These systems, comprising multiple collaborating models, perform substantially more complex tasks than monolithic agents, but at the cost of increased fragility and opaque failure modes. AgenTracer-8B was developed by leveraging automated, precisely labeled trajectory data and reinforcement learning to surpass previous attribution solutions, offering actionable diagnostics and enabling self-correcting LLM agentic workflows (Zhang et al., 3 Sep 2025).
1. Problem Definition and Motivation
LLM-based multi-agent systems orchestrate a sequence of heterogeneous agents and tool invocations, allowing for compositionally complex workflows. However, this structural sophistication amplifies the challenge of diagnosing failures. Pinpointing the decisive failure—identifying both the specific agent () and the step () whose correction would change the outcome from failure to success—defines the agentic system failure attribution problem. Formally, if a trajectory yields outcome ($0$ for failure, $1$ for success), and denotes a trajectory where action at step is replaced by oracle action , then the set of “corrective actions” is
0
The decisive error is given by
1
Empirical studies have shown that state-of-the-art LLMs (e.g., GPT-4, DeepSeek-R1) achieve less than 10% step-level accuracy on attribution tasks such as the Who & When benchmark. This low performance precludes reliable self-debugging or efficient retraining in practical agentic deployments (Zhang et al., 3 Sep 2025).
2. TracerTraj Dataset Construction
The development of AgenTracer-8B centers on the TracerTraj (“–2.5K”) dataset, which comprises over 2,000 multi-agent execution traces annotated with (agent, step) attribution labels. The construction leverages two complementary automated techniques:
- Counterfactual Replay: For each failed trajectory from six source frameworks (MetaGPT, AutoGen, AgentPrune, AFlow, OWL-Workforce, Smolagents) across six benchmarks (MBPP+, KodCode, Blackjack, GAIA, MATH, GSM8K), an analyzer LLM (DeepSeek-R1) proposes minimal corrective actions. The outcome is used to annotate the earliest step whose correction transitions the trace from failure to success, yielding the negative set 2.
- Programmatic Fault Injection: For successful trajectories, lightweight perturbations (such as flipping a function’s return value or corrupting an API call) are injected at randomly chosen steps. If this injection causes failure, the perturbation step and responsible agent are annotated, yielding the positive set 3.
Table: Breakdown of the TracerTraj “–2.5K” Dataset
| Domain | #Curated 4 | #Annotated (–2.5K) |
|---|---|---|
| Coding | 2,170 | 1,288 |
| Math | 1,185 | 630 |
| Agentic | 1,300 | 558 |
The dataset thus captures a wide spectrum of real, complex multi-agent failure modes, supporting supervised training of specialized tracers.
3. Model Architecture and Input/Output Protocol
AgenTracer-8B is constructed atop the Qwen3-8B backbone. During inference, it consumes a serialized multi-agent trace—encoding agent names, step indices, agent actions, tool invocation logs, and environment feedback—in linearized, interleaved format:
1
A lightweight prompt template instructs the model to produce both a free-text rationale between > ... tags and a structured output delineating agent and step in <answer><agentID>|<stepID></answer>.
The token embedding layer and transformer parameters remain identical to Qwen3-8B; fine-tuning is performed exclusively on the –2.5K dataset, using special end-of-trace tokens and prompt engineering to target the attribution output format.
4. Multi-Granular Reinforcement Learning Optimization
Training leverages a multi-granular reinforcement learning process using an online variant of Proximal Policy Optimization (PPO), termed GRPO. This method directly optimizes for failure attribution accuracy by sampling candidate (agent, step) pairs 5 for each trajectory 6 and computing per-candidate rewards:
7
where
- 8 is 9 if the output is formatted correctly,
- 0 if 1, else 2,
- 3, with 4 and 5.
The PPO-style objective 6 with dynamically annealed clipping encourages effective exploration early in training and exploitation later.
5. Empirical Performance and Evaluation
AgenTracer-8B is evaluated on the Who & When benchmark (both handcrafted and automated), in addition to held-out splits of the –2.5K test set across the coding, math, and agentic domains. Agent-level accuracy measures correct identification of 7; step-level accuracy measures exact localization of 8.
| Benchmark | Model | Agent-Level | Step-Level |
|---|---|---|---|
| Who & When (handcrafted) | Gemini-2.5-Pro | 51.72% | 9.72% |
| (w/ ground-truth) | Claude-Sonnet-4 | 56.90% | 17.24% |
| AgenTracer-8B | 69.10% | 20.68% | |
| Coding split (w/o GT) | Gemini-2.5-Pro | 66.92% | 6.29% |
| Claude-Sonnet-4 | 63.78% | 11.02% | |
| AgenTracer-8B | 72.21% | 18.85% |
AgenTracer-8B surpasses major proprietary LLMs in both agent and step-level accuracy, with step gains up to 22.68% and agent gains up to 18.18%.
6. Application Scenarios and Downstream Utility
AgenTracer-8B supports downstream integration for actionable, self-correcting loops within LLM agentic workflows. On failed trajectories from frameworks such as MetaGPT and MaAS, repeated cycles of tracing (i.e., identify 9), formatting feedback, and system-level correction yield cumulative gains up to 14.21% (MaAS+MATH-500) and 4.8% (OWL+GAIA) over three iterations. Classical approaches to self-refinement, such as Self-Refine and CRITIC, degrade performance in these settings.
A case study demonstrates the tracing of a subtle error by a Web Surfer agent in a multi-agent document analysis pipeline, where AgenTracer-8B accurately attributes the failure to an early, otherwise unobvious step. Competing models either misattribute or provide ambiguous rationales, highlighting AgenTracer-8B’s specificity.
7. Limitations and Prospective Advancements
While AgenTracer-8B’s model size is reduced relative to other LLMs, inference on very long traces (0) can introduce several seconds of latency. Incorporation of windowed attention or retrieval-based mechanisms is proposed to address these bottlenecks. Although –2.5K covers six multi-agent frameworks and six tasks, agentic paradigms in deployment may vary significantly; unsupervised domain adaptation and trace clustering are proposed directions for broader generalization. The current utilization of free-text <think> rationales, while enabling rich explanations, may remain ambiguous for non-expert users; integration of structured, causal explanation formats is identified as a future enhancement (Zhang et al., 3 Sep 2025).