TracerTraj: Multi-Agent Failure Attribution
- TracerTraj is a large-scale, high-fidelity corpus designed for precise agent-level and step-level error attribution in complex LLM-driven systems.
- It uses counterfactual replay and controlled fault injection to identify decisive correction points, significantly enhancing automated debugging and reliability.
- TracerTraj underpins advanced monitoring methods, including the TRACE system's adaptive TIJ architecture, to detect covert sabotage in multi-agent workflows.
TracerTraj is a large-scale, high-fidelity, automatically annotated corpus constructed to support rigorous training and evaluation of models for failure attribution and covert sabotage detection within multi-agent, LLM-driven agentic systems. It provides explicit supervisory signals for credit assignment at the agent and step level, facilitating research on robust diagnosis, accountability, and credit assignment in complex automated workflows. TracerTraj has also been adapted as a naming convention for trajectory tracing components within advanced agentic system monitors.
1. Origin and Motivation
LLM agentic systems—comprising heterogeneous agents, toolchains, and often multi-stage orchestration—offer substantial gains over monolithic paradigms but are inherently more fragile due to increased system complexity. In such systems, pinpointing which agent (e.g., “PlannerAgent,” “CoderAgent,” “TesterAgent,” “WebSurferAgent”) and which specific time-step precipitated a failure is nontrivial but crucial for meaningful error attribution and automated debugging. Pre-TracerTraj, state-of-the-art reasoning LLMs exhibited less than 10% accuracy for agentic failure localization, demonstrating the absence of such targeted datasets for supervised learning and evaluation (Zhang et al., 3 Sep 2025).
2. Construction Methodology
TracerTraj’s construction involves a multi-phase, automated pipeline to obtain ground-truth annotations of failure points in multi-agent trajectories. The process comprises:
- Raw Trajectory Collection: Six representative agentic frameworks are instrumented (MetaGPT, AutoGen, Smolagents, AgentPrune, AFlow, OWL-Workforce) across six task domains (coding: MBPP+, KodCode, Blackjack; mathematical reasoning: MATH, GSM8K; general agentic: GAIA, HotpotQA).
- Counterfactual Replay for Failures: For each failed trajectory , an “analyzer” LLM (DeepSeek-R1) proposes step-wise minimally invasive corrections . Operator re-simulates the trajectory from step onward. If the intervention flips the outcome from failure () to success (), step is marked “decisive.” The set of such decisive points is:
The root cause is defined as the earliest such step: .
- Programmatic Fault Injection for Successes: To diversify the corpus, successful trajectories are synthetically corrupted via action perturbations at random steps, followed by re-simulation. If the injected fault flips the label to failure, the corrupted agent/step is labeled as ground truth.
The resulting dataset forms a unified resource of 2,476 annotated (trajectory, ⟨agent,step⟩) pairs (–2.5K), curated from 4,655 total trajectories (Zhang et al., 3 Sep 2025).
3. Data Schema and Labeling Taxonomy
Each TracerTraj instance is a JSON object encompassing the user query, full agentic trajectory (step-by-step logs), environment feedback (e.g., tool errors), reference solution, and critical annotation fields:
"error_agent": the ID of the agent responsible for the root-cause error (0)."error_step": the trajectory step at which correction yields recovery (1).
Entries are strictly labeled at the agent and step level, with no additional sub-categorization (such as “hallucination” or “role-disobedience”) to preserve clarity of credit assignment.
Example Entry:
8 This design allows downstream models to train and validate precise multi-agent debugging protocols.
4. Statistical Overview
| Statistic | Total | Coding | Math | Agentic |
|---|---|---|---|---|
| Curated trajectories | 4,655 | 2,170 | 1,185 | 1,300 |
| Annotated error pairs (–2.5K) | 2,476 | 1,288 | 630 | 558 |
| Test-set (9:1 split) | N/A | 147 | 63 | 56 |
| Avg. trajectory length | ~12 steps (range 5–25) | |||
| Avg. agents/system | 3–6 |
These statistics demonstrate TracerTraj’s breadth across system configurations and operational domains.
5. Benchmarking and Downstream Use
TracerTraj is foundational for training and evaluating failure attribution models. It powers the ATracer pipeline, which employs a Qwen3-8B backbone fine-tuned via multi-granular RL (Group Relative Policy Optimization). The benchmark evaluation, Who&When, utilizes a held-out subset of 127 trajectories, reporting:
- Up to 69.62% agent-level accuracy
- Up to 42.86% step-level accuracy
These results surpass leading proprietary models (GPT-4.1, Claude-4-Sonnet, Gemini-2.5-Pro, DeepSeek-R1) by margins of 12–18%. This demonstrates TracerTraj’s utility as both a rich training resource and a rigorous testbed for credit assignment and agentic system diagnosis (Zhang et al., 3 Sep 2025).
Moreover, the actionable supervision provided by TracerTraj enables integration with, and measurable performance improvement in, off-the-shelf multi-agent stacks (e.g., MetaGPT, MaAS), delivering end-to-end performance gains of 4.8–14.2%.
6. Role in Trajectory Tracing and Sabotage Detection (TRACE System)
The term “TracerTraj” has also been adopted in the context of monitoring for covert sabotage within LLM-agent trajectories. In TRACE, TracerTraj refers to the core tracing subsystem, notable for its training-free, adaptive Triage-Inspect-Judge (TIJ) architecture (Mittapalli et al., 5 Jun 2026):
- Triage: Score and select up to 2 “suspect windows” (adjacent or pattern steps) based on LLM-based anomaly detection.
- Inspect: Maintain a cross-step evidence buffer 3; perform targeted analyses within each window, accumulating severity/confidence tuples.
- Judge: Aggregate evidence via weighted mean severity; return trajectory-level verdict.
Mathematical formulation for suspicion score:
4
The trajectory is flagged as “malicious” iff 5.
Empirical results on SHADE-Arena show that TRACE (TracerTraj) achieves 6 and recall 7, particularly excelling at long-range manipulation/sabotage tasks compared to both full-trajectory and sequential monitors (Mittapalli et al., 5 Jun 2026).
7. Significance and Future Directions
TracerTraj constitutes the first large-scale resource systematically annotated for step-level, agent-level credit assignment within multi-agent LLM systems. Its counterfactual replay and controlled fault injection methodology sets a precedent for high-fidelity, actionable supervision in agentic workflows. This enables not only precise failure localization but also supports advanced architectural paradigms (such as adaptive, cross-step evidence aggregation for sabotage detection) that move beyond i.i.d. or windowed diagnostics.
A plausible implication is that TracerTraj, by anchoring both the training and the benchmarking of agentic error attribution, constitutes an indispensable asset for the development of self-correcting and self-evolving agentic AI. The dataset’s architecture ensures extensibility to new agent types, task distributions, and failure modalities, indicating ongoing relevance as agent frameworks and monitoring protocols advance (Zhang et al., 3 Sep 2025, Mittapalli et al., 5 Jun 2026).