Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 25 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 134 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Where LLM Agents Fail and How They can Learn From Failures (2509.25370v1)

Published 29 Sep 2025 in cs.AI

Abstract: LLM agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug

Summary

  • The paper introduces a systematic error taxonomy (AgentErrorTaxonomy) to classify failure modes in planning, action, reflection, memory, and system modules.
  • It presents the AgentDebug framework that pinpoints root-cause errors and provides iterative, actionable feedback to enhance LLM agent performance.
  • Experimental results demonstrate up to a 26% improvement in task success rates across environments like ALFWorld, GAIA, and WebShop.

Where LLM Agents Fail and How They Can Learn From Failures

This paper focuses on the vulnerabilities of LLM agents in complex tasks and introduces a methodology to address these challenges. LLM agents have shown potential in integrating planning, memory, reflection, and tool-use modules to solve complex tasks. However, these sophisticated architectures are prone to cascading failures where a single error propagates through decision-making steps leading to task failure. This paper provides solutions through the introduction of the AgentErrorTaxonomy, AgentErrorBench, and AgentDebug.

Introduction

LLM agents have become critical in diverse fields such as scientific discovery, web interaction, and research support. Despite their potential, they face robustness challenges, often making errors in reasoning, tool use, and instruction interpretation. Prior research has focused on qualitatively enumerating error types without systematic mechanisms to trace and fix these failures. This paper addresses this gap by offering a systematic approach to error diagnosis and correction. Figure 1

Figure 1: Motivation for {AgentDebug: A single root-cause failure (b) can propagate through subsequent steps (c), compounding errors and leading to task failure. {AgentDebug (d) addresses this bottleneck by tracing failures back to their source and providing actionable feedback that enables agents to evolve into more robust versions.

AgentErrorTaxonomy and AgentErrorBench

AgentErrorTaxonomy

An extensive collection of failure modes is categorized into five main modules: Planning, Action, Reflection, Memory, and System. These modules help in pinpointing where errors occur and how they contribute to overall failure. Memory errors, like false recall, distort later reasoning; reflection failures block course adjustments; and planning errors often lead to logically unsound strategies. Figure 2

Figure 2: Pipeline of proposed {AgentErrorTaxonomy and {AgentErrorBench. Failed trajectories are collected, analyzed to develop a taxonomy of errors, and then annotated with root causes and actionable feedback to form the benchmark.

AgentErrorBench

This is the first systematically annotated failure trajectory dataset derived from ALFWorld, GAIA, and WebShop environments. The benchmark facilitates the comparison and paper of agent debugging methods by providing a structured testbed for error analysis.

AgentDebug Framework

AgentDebug serves as a debugging framework that identifies root-cause failures and provides corrective feedback, enabling LLM agents to recover and improve iteratively. It operates in three stages:

  1. Fine-Grained Analysis - Errors are categorized based on the AgentErrorTaxonomy.
  2. Critical Error Detection - Pinpoints the root-cause error that directly leads to task failure.
  3. Iterative Debugging - Provides feedback and allows the agent to iterate through the task again with corrections applied. Figure 3

    Figure 3: Overview of {AgentDebug. (Left) LLM agent rollouts alternate between memory, planning, reflection, and action. (Right) {AgentDebug debugs trajectories in three stages: (1) fine-grained analysis across steps and modules, (2) detection of the critical error that triggers failure, and (3) iterative re-rollouts with actionable feedback to turn failures into successes.

Experimental Results

Experiments demonstrate that AgentDebug achieves significantly higher accuracy in root-cause error detection and improves task success rates by up to 26% across ALFWorld, GAIA, and WebShop. The framework outperforms previous baselines by emphasizing root-cause errors rather than attempting to fix every surface-level issue. Figure 4

Figure 4: Downstream debugging performance on ALFWorld. Results are shown across three backbone models (GPT-4o-mini, Qwen3-8B, Qwen3-Next-80B) and different methods. {AgentDebug consistently outperforms strong baselines.

Conclusion

The proposed framework successfully addresses the challenge of cascading failures in LLM agents by providing a systematic approach to error identification and correction. This work paves the way for the development of more reliable and adaptive LLM agents by enabling them to learn and evolve from their failures, ultimately enhancing their robustness in real-world applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 6 likes.

Upgrade to Pro to view all of the tweets about this paper: