AgentErrorTaxonomy Overview

Updated 9 January 2026

AgentErrorTaxonomy is a modular, hierarchical system that defines and categorizes failure modes in intelligent agent and RAG pipelines.
It organizes error types into modules such as Memory, Reflection, Planning, Action, and System-level, enabling root-cause identification and targeted remediation.
The taxonomy underpins benchmark tools like AgentErrorBench and AgentDebug, improving error detection metrics and recovery strategies.

AgentErrorTaxonomy provides a modular, hierarchical system for the precise categorization, analysis, and correction of errors occurring within intelligent agent pipelines, especially those employing LLMs and Retrieval-Augmented Generation (RAG) architectures. It systematically organizes failure modes at both the module and pipeline stage level, enabling granular annotation, root-cause identification, and targeted remediation. The taxonomy has demonstrable utility, underpinning benchmark construction (AgentErrorBench), debugging frameworks (AgentDebug), and the development and evaluation of RAG systems. Its structure, definitions, metrics, and applications are described in detail below (Zhu et al., 29 Sep 2025, Leung et al., 15 Oct 2025).

1. Hierarchical Structure and Module Decomposition

AgentErrorTaxonomy employs a structured organization of agent failure types, grouping them into coherent high-level modules or pipeline stages. In LLM agent pipelines, the following five principal modules are defined, each with fine-grained error subcategories:

Module	Error Subcategories
Memory	Over-simplification / Incomplete summary, Hallucination (False Memory), Retrieval Failure
Reflection	Progress Misassessment, Outcome Misinterpretation, Causal Misattribution, Hallucination
Planning	Constraint Ignorance, Impossible Action, Inefficient Planning
Action	Planning–Action Disconnect, Format Error, Parameter Error
System-level	Step-Limit Exhaustion, Tool Execution Error, LLM Limit, Environment Error

For RAG systems, the pipeline stages and subtypes include chunking, retrieval, reranking, and generation, with each stage broken down into specific error types such as overchunking, missed retrieval, low recall, abstention failure, and numerical error (Leung et al., 15 Oct 2025). This modular approach enables systemic understanding and tractable annotation.

2. Formal Definitions, Root Causes, and Exemplars

Each taxonomy node is defined through:

Precise definition: Technical description of what constitutes the failure mode.
Root cause: Typical triggering mechanism or architectural flaw.
Failure exemplar: Empirical occurrence in a real system.

For instance, within the Memory module:

Over-simplification / Incomplete Summary: Compression of contextual information such that later-relevant detail is lost (e.g., omitting the storage location of an item in task agents).
Hallucination (False Memory): Spurious recall of actions/observations, usually due to degeneracies in unconstrained LLM outputs.
Retrieval Failure: Failure to access available context, arising from an inadequate retrieval window or poorly-constructed prompts.

Analogous, fine-grained behaviors are articulated for Reflection, Planning, Action, and System-level errors. RAG-specific error definitions include overchunking (splitting documents into excessively fine granularity), missed retrieval (failure to fetch ground-truth-supporting context), fabricated content (generation of non-evidenced output), and numerical error (miscomputation) (Leung et al., 15 Oct 2025). Empirical scenarios from ALFWorld, GAIA, WebShop, and other datasets ground these categories (Zhu et al., 29 Sep 2025).

3. Quantitative Metrics and Formal Error Characterization

The taxonomy supports granular annotation via mappings such as:

$e_t^m = \mathrm{MapAET}(s_t, a_t, m; \text{Taxonomy})$

where $e_t^m$ denotes the error type at time $t$ , module $m$ , and $s_t$ , $a_t$ are system state and action, respectively. Key quantitative concepts include:

Critical Error Step ( $t^*$ ): The earliest intervention point whose correction flips the trajectory from failure ( $\mathrm{Evaluate}(\tau)=0$ ) to success ( $\mathrm{Evaluate}(\tau')=1$ ) via counterfactual replacement $a_t\to a_t^*$ .
Error Propagation Length: Counts subsequent steps affected by the initial fault:

$\mathrm{PropagationLength}(\tau) = \max_t |\{u>t : \exists m, e_u^m \neq \text{no\_error}\}|$

Annotation and detection metrics: Include step accuracy (prediction of $t^*$ ), step+module accuracy (identifying both $t^*$ and responsible module), and all-correct (joint identification of $t^*$ , module, and error type).

For RAG, the framework introduces chunk-level recall, coverage, and error-type agreement metrics, with human-LLM annotation comparisons and explicit formulas for error localization (Leung et al., 15 Oct 2025).

4. Application in Annotation and Benchmarks

AgentErrorTaxonomy is operationalized in both agent and RAG settings through benchmark datasets:

AgentErrorBench: Contains 200 manually annotated trajectories across ALFWorld, WebShop, and GAIA with module- and step-level critical error tags, emphasizing root-cause errors that, if corrected, reverse the overall task outcome (Zhu et al., 29 Sep 2025).
RAG Error Dataset: Consists of 406 annotated erroneous QA responses from DragonBall pipeline, with error stages and subtypes determined by manual and LLM-assisted protocols. Agreement rates (e.g., 92.9% for error identification in RAG) indicate effective annotation methodology (Leung et al., 15 Oct 2025).

The annotation process involves step-wise labeling per module, assignment of critical error(s), and, for RAG, pipeline stage-first error typing, supporting empirical analysis and methodology benchmarking.

5. Taxonomy-Driven Debugging and Corrective Feedback

The taxonomy serves as the foundation for systematic debugging and intervention. In AgentDebug:

Each agent rollout is analyzed to localize ( $t^*, m^*, e^*$ ).
The matching error label maps to a dedicated debugging or feedback strategy, formulated as precise guidance $\varphi$ .
The agent executes a re-rollout from $t^*$ under this targeted correction. The correction process iterates upon persistent task failure.

Examples of feedback include: leveraging RAG retrieval to recall lost context (Memory::Retrieval Failure), enforcing subgoal verification (Reflection::Progress Misassessment), incorporating operational constraints (Planning::Constraint Ignorance), enforcing strict output schema (Action::Format Error), or prioritizing high-leverage actions (System::Step-Limit Exhaustion). Empirical results show that such targeted intervention raises all-correct and step accuracy by 24% and 17%, respectively, and increases task success by up to 26% across benchmarks (Zhu et al., 29 Sep 2025).

In RAG pipelines, error type diagnosis guides practical mitigations: chunk resizing, semantic chunking (for context errors), hybrid or fine-tuned retrieval (for missed or low-relevance retrieval), reranker strengthening (hard negatives, in-domain data), and generation-side abstention or post-generation fact-checking modules (Leung et al., 15 Oct 2025).

6. Comparative Scope and Practical Implications

AgentErrorTaxonomy encompasses both general agent-based system failures and specialized RAG artifacts. Its modular structure and fine resolution:

Enable systemic, root-cause-first debugging, distinguishing between upstream (e.g., chunking, memory) and downstream (e.g., generation, action) faults.
Support methodology-agnostic expansion to new agent types or domains by appending relevant modules and classifying subtypes accordingly.

The benchmarking and debugging frameworks built atop this taxonomy represent state-of-the-art diagnostic infrastructure for error analysis and reliability improvement in LLM and RAG-based pipeline architectures (Zhu et al., 29 Sep 2025, Leung et al., 15 Oct 2025).

AgentErrorTaxonomy differs from general error taxonomies by grounding every failure mode in a well-defined module or pipeline phase, supporting both LLM agent (memory, planning, action, system-level) and RAG (chunking, retrieval, reranking, generation) paradigms. Annotation agreements, error propagation analysis, and LLM-based auto-classification (e.g., RAGEC) provide quantifiable evidence of usability and discriminative power. Its design prioritizes actionable linkage between error type, measurement, and remediative action, distinct from purely descriptive frameworks (Leung et al., 15 Oct 2025).