AgentFail Dataset: Annotated Failure Logs

Updated 21 October 2025

AgentFail is a curated dataset comprising 307 annotated failure logs from diverse agentic systems, integrating detailed log traces and metadata.
It employs a three-level taxonomy and counterfactual repair validation to systematically identify failures at agent, workflow, and platform levels.
The dataset serves as a benchmark for LLM-assisted diagnosis, demonstrating improvements in fault identification accuracy by up to 20 percentage points.

AgentFail is a curated dataset comprising 307 annotated failure logs from ten agentic systems, purpose-built to advance the paper of failure root cause identification in platform-orchestrated agentic systems. Such agentic systems, often orchestrated via low-code development environments like Dify and Coze, involve multiple LLM-driven agents structured into complex workflows to tackle tasks including code generation, program repair, question answering, travel planning, and deep research. Despite their power, these orchestrations exhibit fragility and diagnostic opacity, motivating the creation of AgentFail, which systematically couples rich log traces, workflow metadata, and fine-grained cause labels supported by counterfactual validation.

1. Dataset Composition and Structure

AgentFail consists of 307 failure logs selected from real deployments of ten agentic systems operating across diverse domains and workflow architectures—spanning serial, parallel, branching, looping, and hybrid control flow. Each record encapsulates four integral components:

Original Query: The user-issued prompt or task request.
Execution Trace: The complete, stepwise log of agent-tool interactions and outputs, documenting the trajectory that led to failure.
Workflow Configuration: Metadata specifying node orchestration, agent identities, prompt templates, tool assignments, and structural dependencies.
Expert Annotations: Fine-grained root cause labels, each explicitly linked to decisive error steps via a counterfactual reasoning methodology.

Annotation quality is validated with a counterfactual repair strategy: a log step is deemed the decisive error if its replacement by a correct action in the failed trajectory τ alters the terminal outcome from failure (φ(τ)=1) to success (φ(τ^E(i,t))=0). Defining the indicator as

$\Delta_{E(i, t)}(\tau) = \begin{cases} 1, & \text{if}~\phi(\tau)=1~\text{and}~\phi(\tau^{E(i,t)})=0 \ 0, & \text{otherwise} \end{cases}$

the earliest decisive error satisfies

$E(i^*, t^*) = \underset{E(i, t) \in C(\tau)}{\arg\min}~t \qquad C(\tau)=\{E(i,t)\mid \Delta_{E(i, t)}(\tau)=1\}.$

2. Taxonomy of Failure Modes

To systematically elucidate error provenance, the AgentFail project introduces a three-level taxonomy:

Agent-level Failures: Errors isolated to a single agent, such as knowledge or reasoning limitations, deficiencies in prompt design, and response formatting problems.
Workflow-level Failures: Pathologies arising from inter-agent orchestration, including absent input validation, flawed node dependencies, and logical deadlocks in cycles or conditionals.
Platform-level Failures: Failures rooted in execution environment or service infrastructure, e.g., network instability, resource contention, or service unavailability.

The taxonomy is grounded in iterative annotation using GT (Grounded Theory) principles; annotator consensus was achieved through independent review and reconciliation, reflected by improvement in inter-rater reliability (Cohen’s κ from 0.85 to 1.0). This structure supports both spatial and functional localization of failure, enabling both precise debugging and systemic repair.

3. Evaluation Benchmark and LLM-Assisted Diagnosis

AgentFail provides a benchmark for automatic root cause diagnosis using LLMs, evaluated across several architectures (gpt-4o, LLaMA-3.1-70B, DeepSeek-R1, QWEN3-32B, GEMINI-2.5-PRO, CLAUDE-SONNET-4) and three diagnostic strategies:

Setting Name	Input Presentation	Search Strategy
All-at-once	Complete log and query provided at once	Direct identification
Step-by-step	Log revealed sequentially	Stop at cause detection
Binary search	Log split recursively (divide-and-conquer)	Recursive localization

A key empirical result is that taxonomy-guided prompting substantially improves LLM diagnostic accuracy. Baseline accuracy without taxonomy ranges from 8.3% to 13.0%, while taxonomy inclusion elevates performance to 24.1%–33.6%, an absolute gain of 15–20 percentage points. Nevertheless, the best configuration reaches only 33.6% accuracy, underscoring the inherent complexity and diagnostic challenges posed by these multi-agent, workflow-centric systems.

4. Counterfactual Repair as Annotation Validation

Annotation reliability is reinforced by a counterfactual repair protocol: after hypothesizing a decisive error, the annotated fault is substituted with an ideal action, and the trajectory replayed. A successful flip from failure to success empirically confirms the root cause label. Analysis via a repair rate confusion matrix reveals strong alignment—diagonal (self-consistent) repairs yield high success rates (90–96% depending on failure type), substantially exceeding off-diagonal attempts. This provides quantitative corroboration of annotation fidelity and taxonomy utility.

5. Design Principles for Robust Agentic Systems

Derived from their analysis, the AgentFail authors compile actionable recommendations to guide agentic system design:

Explicit Role Specification/Modular Prompts: Mitigate agent-level issues by enforcing clear agent responsibilities and reusable prompt logic.
Rigorous Validation: Implement comprehensive input/output validation such as schema checks (e.g., JSON formatting).
Verification and Fallback Paths: Add secondary checker agents or alternative workflow branches for early error capture.
Progressive Complexity: Begin with simple serial/parallel templates before expanding to advanced graph topologies, reducing susceptibility to orchestration faults.
Efficiency-Robustness Tradeoff: Evaluate the balance between increased error-detection overhead and system responsiveness or computational cost.

A plausible implication is that robust multi-agent system engineering depends on integrating these guidelines with iterative diagnostic evaluation, leveraging both expert insight and machine learning–driven analysis.

6. Limitations, Open Challenges, and Prospects

AgentFail highlights several persistent obstacles. Chief among them, the diagnostic challenge presented by long, interconnected agent logs, in which causality may be indirect and errors propagate in non-obvious ways. Even with detailed taxonomy guidance, LLM-based automated root cause identification is constrained by context length and reasoning limits, as reflected by the ceiling at 33.6% accuracy. The complexity of functional dependencies and latent intermediate states exacerbates these issues.

Future prospects include the advancement of LLMs with improved causal reasoning over long context traces, more sophisticated prompt-taxonomy integration techniques, and incorporation of human-in-the-loop frameworks to buttress reliability. Further, the systematic expansion of annotated datasets is anticipated to underpin new verification, debugging, and repair strategies for agentic systems. This suggests that the availability of AgentFail establishes a foundational resource to drive methodological improvements and empirical progress in debugging and system repair for platform-orchestrated agentic systems.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to AgentFail Dataset.