AgenTracer-8B: LLM Error Attribution

Updated 3 July 2026

AgenTracer-8B is a failure attribution model that precisely identifies critical (agent, step) errors in multi-agent LLM systems using automated trajectory data and reinforcement learning.
It leverages the TracerTraj dataset, built from over 2,000 annotated multi-agent traces via counterfactual replay and fault injection, to cover a wide range of failure modes.
Empirical evaluations show that AgenTracer-8B surpasses leading models with over 69% agent-level and 20% step-level accuracy, delivering actionable diagnostics for self-correcting workflows.

AgenTracer-8B is a dedicated, lightweight, 8-billion-parameter failure attribution model built to diagnose and localize errors within LLM-based multi-agent systems. These systems, comprising multiple collaborating models, perform substantially more complex tasks than monolithic agents, but at the cost of increased fragility and opaque failure modes. AgenTracer-8B was developed by leveraging automated, precisely labeled trajectory data and reinforcement learning to surpass previous attribution solutions, offering actionable diagnostics and enabling self-correcting LLM agentic workflows (Zhang et al., 3 Sep 2025).

1. Problem Definition and Motivation

LLM-based multi-agent systems orchestrate a sequence of heterogeneous agents and tool invocations, allowing for compositionally complex workflows. However, this structural sophistication amplifies the challenge of diagnosing failures. Pinpointing the decisive failure—identifying both the specific agent ( $i^*$ ) and the step ( $t^*$ ) whose correction would change the outcome from failure to success—defines the agentic system failure attribution problem. Formally, if a trajectory $\tau$ yields outcome $\Omega(\tau)$ ($0$ for failure, $1$ for success), and $\mathcal{R}(\tau, t, a'_t)$ denotes a trajectory where action $a_t$ at step $t$ is replaced by oracle action $a'_t$ , then the set of “corrective actions” is

$t^*$ 0

The decisive error is given by

$t^*$ 1

Empirical studies have shown that state-of-the-art LLMs (e.g., GPT-4, DeepSeek-R1) achieve less than 10% step-level accuracy on attribution tasks such as the Who & When benchmark. This low performance precludes reliable self-debugging or efficient retraining in practical agentic deployments (Zhang et al., 3 Sep 2025).

2. TracerTraj Dataset Construction

The development of AgenTracer-8B centers on the TracerTraj (“–2.5K”) dataset, which comprises over 2,000 multi-agent execution traces annotated with (agent, step) attribution labels. The construction leverages two complementary automated techniques:

Counterfactual Replay: For each failed trajectory from six source frameworks (MetaGPT, AutoGen, AgentPrune, AFlow, OWL-Workforce, Smolagents) across six benchmarks (MBPP+, KodCode, Blackjack, GAIA, MATH, GSM8K), an analyzer LLM (DeepSeek-R1) proposes minimal corrective actions. The outcome is used to annotate the earliest step whose correction transitions the trace from failure to success, yielding the negative set $t^*$ 2.
Programmatic Fault Injection: For successful trajectories, lightweight perturbations (such as flipping a function’s return value or corrupting an API call) are injected at randomly chosen steps. If this injection causes failure, the perturbation step and responsible agent are annotated, yielding the positive set $t^*$ 3.

Table: Breakdown of the TracerTraj “–2.5K” Dataset

Domain	#Curated $t^*$ 4	#Annotated (–2.5K)
Coding	2,170	1,288
Math	1,185	630
Agentic	1,300	558

The dataset thus captures a wide spectrum of real, complex multi-agent failure modes, supporting supervised training of specialized tracers.

3. Model Architecture and Input/Output Protocol

AgenTracer-8B is constructed atop the Qwen3-8B backbone. During inference, it consumes a serialized multi-agent trace—encoding agent names, step indices, agent actions, tool invocation logs, and environment feedback—in linearized, interleaved format:

$\Omega(\tau)$ 1

A lightweight prompt template instructs the model to produce both a free-text rationale between > ... tags and a structured output delineating agent and step in <answer><agentID>|<stepID></answer>.

The token embedding layer and transformer parameters remain identical to Qwen3-8B; fine-tuning is performed exclusively on the –2.5K dataset, using special end-of-trace tokens and prompt engineering to target the attribution output format.

4. Multi-Granular Reinforcement Learning Optimization

Training leverages a multi-granular reinforcement learning process using an online variant of Proximal Policy Optimization (PPO), termed GRPO. This method directly optimizes for failure attribution accuracy by sampling candidate (agent, step) pairs $t^*$ 5 for each trajectory $t^*$ 6 and computing per-candidate rewards:

$t^*$ 7

where

$t^*$ 8 is $t^*$ 9 if the output is formatted correctly,
$\tau$ 0 if $\tau$ 1, else $\tau$ 2,
$\tau$ 3, with $\tau$ 4 and $\tau$ 5.

The PPO-style objective $\tau$ 6 with dynamically annealed clipping encourages effective exploration early in training and exploitation later.

5. Empirical Performance and Evaluation

AgenTracer-8B is evaluated on the Who & When benchmark (both handcrafted and automated), in addition to held-out splits of the –2.5K test set across the coding, math, and agentic domains. Agent-level accuracy measures correct identification of $\tau$ 7; step-level accuracy measures exact localization of $\tau$ 8.

Benchmark	Model	Agent-Level	Step-Level
Who & When (handcrafted)	Gemini-2.5-Pro	51.72%	9.72%
(w/ ground-truth)	Claude-Sonnet-4	56.90%	17.24%
	AgenTracer-8B	69.10%	20.68%
Coding split (w/o GT)	Gemini-2.5-Pro	66.92%	6.29%
	Claude-Sonnet-4	63.78%	11.02%
	AgenTracer-8B	72.21%	18.85%

AgenTracer-8B surpasses major proprietary LLMs in both agent and step-level accuracy, with step gains up to 22.68% and agent gains up to 18.18%.

6. Application Scenarios and Downstream Utility

AgenTracer-8B supports downstream integration for actionable, self-correcting loops within LLM agentic workflows. On failed trajectories from frameworks such as MetaGPT and MaAS, repeated cycles of tracing (i.e., identify $\tau$ 9), formatting feedback, and system-level correction yield cumulative gains up to 14.21% (MaAS+MATH-500) and 4.8% (OWL+GAIA) over three iterations. Classical approaches to self-refinement, such as Self-Refine and CRITIC, degrade performance in these settings.

A case study demonstrates the tracing of a subtle error by a Web Surfer agent in a multi-agent document analysis pipeline, where AgenTracer-8B accurately attributes the failure to an early, otherwise unobvious step. Competing models either misattribute or provide ambiguous rationales, highlighting AgenTracer-8B’s specificity.

7. Limitations and Prospective Advancements

While AgenTracer-8B’s model size is reduced relative to other LLMs, inference on very long traces ( $\Omega(\tau)$ 0) can introduce several seconds of latency. Incorporation of windowed attention or retrieval-based mechanisms is proposed to address these bottlenecks. Although –2.5K covers six multi-agent frameworks and six tasks, agentic paradigms in deployment may vary significantly; unsupervised domain adaptation and trace clustering are proposed directions for broader generalization. The current utilization of free-text <think> rationales, while enabling rich explanations, may remain ambiguous for non-expert users; integration of structured, causal explanation formats is identified as a future enhancement (Zhang et al., 3 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgenTracer-8B.