- The paper presents a formal error taxonomy categorizing reasoning, planning, and execution failures in agentic systems.
- It introduces the TRAIL dataset with 148 human-annotated execution traces and 841 detailed error annotations.
- Benchmark evaluations show that state-of-the-art LLMs, with the best at 18% joint accuracy, struggle to debug complex agent traces.
Evaluating and debugging complex agentic systems is challenging due to their non-deterministic, multi-step nature, interplay of external tools, and LLM reasoning complexities, which current manual or simple end-to-end evaluation methods struggle to handle scalably. The increasing adoption of agentic workflows in real-world applications like software engineering and information retrieval necessitates more robust, dynamic evaluation methods that provide granular insights into system behavior and failures. The paper "TRAIL: Trace Reasoning and Agentic Issue Localization" (2505.08638) addresses this critical need by introducing a formal taxonomy of agentic errors and presenting a new human-annotated dataset for benchmarking LLM capabilities in debugging structured agent traces.
The core contributions of the paper are:
- Formal Error Taxonomy: A detailed taxonomy categorizing agentic errors across three key areas: Reasoning, Planning and Coordination, and System Execution.
- TRAIL Dataset: A dataset of 148 human-annotated execution traces derived from established agentic benchmarks (GAIA for open-world IR and SWE-Bench for software engineering). These traces are structured using the OpenTelemetry standard and contain 841 unique errors annotated at the turn level based on the proposed taxonomy.
- Benchmark Evaluation: Evaluation of state-of-the-art LLMs on the TRAIL dataset to assess their ability to identify and localize errors in complex agent traces, revealing significant limitations of current models for this task.
The proposed error taxonomy provides a structured way to diagnose agent failures:
- Reasoning Errors: Related to the LLM's core linguistic and cognitive functions.
- Hallucinations: Generating factually incorrect or nonsensical content (text-only or tool-related).
- Information Processing: Issues with retrieving relevant information or misinterpreting retrieved context/tool outputs.
- Decision Making: Failures in understanding the task or selecting appropriate tools.
- Output Generation: Problems with formatting outputs correctly or adhering to instructions.
- System Execution Errors: Related to the agent's interaction with its environment and tools.
- Configuration Issues: Errors due to incorrect environment setup or tool definitions.
- API and System Issues: Failures when interacting with external APIs or system services (e.g., rate limiting, authentication, service errors, resource not found).
- Resource Management: Problems like resource exhaustion or infinite loops/timeout issues when using system tools.
- Planning and Coordination Errors: Related to the agent's ability to manage its state and sequence of actions.
- Context Management: Failing to retain relevant information across turns or abusing resources by repeating tool calls.
- Task Management: Deviating from the intended goal or issues with orchestrating sub-tasks, especially in multi-agent systems.
The TRAIL dataset is constructed using traces generated by orchestrating LLM-powered agents on tasks from GAIA and SWE-Bench. For GAIA, a multi-agent system (inspired by Hugging Face OpenDeepResearch) with a manager and search agents using tools like web search, page visit, and file inspection was used. For SWE-Bench, a single CodeAct agent interacting with a sandboxed environment, Python interpreter, and gitingest
library was employed to solve GitHub issues. The traces are collected in OpenTelemetry format using the OpenInference standard, reflecting real-world observability practices. Annotators, experts in software engineering and debugging, labeled errors at the span level, providing category, location, evidence, description, and impact level (Low/Medium/High). The dataset exhibits a wide distribution of errors per trace and across categories, with Output Generation errors being the most frequent, while System Execution errors are less common but often high-impact.
Evaluations using prominent LLMs (OpenAI's o1, o3, gpt-4.1, Anthropic's Claude 3.7 Sonnet, Google's Gemini 2.5 Pro/Flash, Meta's Llama-4 Scout/Maverick) reveal that they perform poorly at trace debugging. The best model, Gemini 2.5 Pro, achieved only 18% joint accuracy (correct category and location) on the GAIA split and 5% on the SWE Bench split. The paper found that TRAIL is a non-trivial benchmark due to the long context lengths of the traces, which often exceed model limits, and the need for robust reasoning. Performance is negatively correlated with input trace length. Models with explicit reasoning capabilities generally outperformed non-reasoning ones, and higher reasoning effort levels improved performance for a given model.
Analysis of performance across error categories showed varying degrees of difficulty. Categories like "Context Handling Failures" and "Tool Selection Errors" were particularly challenging for most models, while "Language-Only Hallucinations" and "Formatting Errors" were relatively easier. The low overall performance highlights that current LLMs struggle with the detailed, contextual reasoning required to accurately identify and classify errors within complex agent execution traces, despite advancements in long-context understanding and reasoning.
In conclusion, the paper establishes TRAIL as a valuable benchmark and taxonomy for evaluating LLMs as judges for agentic workflows. The results demonstrate a significant gap between the capabilities of current SOTA LLMs and the requirements for scalable, systematic evaluation and debugging of agent traces, particularly given the challenges posed by long contexts and nuanced error types. Future work is needed to develop models and evaluation frameworks better equipped to handle the complexity and structured nature of agent traces, potentially by incorporating multimodal data and addressing the imbalance in error category distribution through synthetic data generation. The public release of the dataset and code aims to foster further research in this critical area for the advancement of reliable agentic systems.