Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

An Empirical Study on Failures in Automated Issue Solving (2509.13941v1)

Published 17 Sep 2025 in cs.SE, cs.AI, and cs.CL

Abstract: Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic tools show great promise, they still fail on a substantial portion of tasks. Moreover, current evaluations primarily report aggregate issue-solving rates, which obscure the underlying causes of success and failure, making it challenging to diagnose model weaknesses or guide targeted improvements. To bridge this gap, we first analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified under varying task characteristics. Furthermore, to move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances. From this analysis, we developed a comprehensive taxonomy of failure modes comprising 3 primary phases, 9 main categories, and 25 fine-grained subcategories. Then we systematically analyze the distribution of the identified failure modes, the results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks. Motivated by these insights, we propose a collaborative Expert-Executor framework. It introduces a supervisory Expert agent tasked with providing strategic oversight and course-correction for a primary Executor agent. This architecture is designed to correct flawed reasoning and break the cognitive deadlocks that frequently lead to failure. Experiments show that our framework solves 22.2% of previously intractable issues for a leading single agent. These findings pave the way for building more robust agents through diagnostic evaluation and collaborative design.

Summary

  • The paper investigates failure modes in automated defect repair tools by analyzing the performance of OpenHands, Tools Claude, and Agentless across 500 issues.
  • It reveals that task complexity and flawed reasoning cause significant performance drops and distinct failure patterns in different tool architectures.
  • The paper introduces a collaborative Expert-Executor framework, improving issue resolution by achieving a 22.2% success rate on previously failed cases.

An Empirical Study on Failures in Automated Issue Solving

The paper "An Empirical Study on Failures in Automated Issue Solving" (2509.13941) investigates the challenges and failure modes encountered by automated tools designed to autonomously identify and repair defective code snippets. Through a structured analysis of tools evaluated on the SWE-Bench-Verified benchmark, the authors aim to uncover the underlying causes of these failures and propose an innovative framework to improve issue solving capabilities.

Introduction to Automated Issue Solving

Automated issue solving is a pivotal challenge in software engineering, aiming to autonomously identify and repair defects in codebases. Although tools leveraging LLMs have shown promise, they often fail on substantial portions of tasks. SWE-Bench has established itself as a standard for evaluating progress in this area. This paper focuses on understanding these failures more deeply by examining the performance of three tools—OpenHands, Tools Claude, and Agentless—and constructing a taxonomy of their failure modes.

Performance and Efficiency Analysis

The paper evaluates three representative tools across 500 issues from SWE-Bench-Verified: OpenHands, Tools Claude, and Agentless. Results indicate similar overall issue-solving rates (49-53%). However, the tools exhibit distinct strengths across problem types, suggesting potential benefits from ensemble methods. Figure 1

Figure 1: Workflow of the pipeline-based and agent-based tools.

Task Characteristics and Performance

Task complexity significantly impacts performance. Issue-solving rates degrade notably as task complexity increases, particularly regarding multi-file coordination. OpenHands demonstrates resilience in complex scenarios, whereas Agentless struggles more, underscoring architectural design needs for handling intricate tasks. Figure 2

Figure 2

Figure 2: Issue solving rates across different difficulty levels.

Interaction Efficiency

Interaction efficiency analysis reveals diminishing returns for agent-based processes. Most successes occur within the first 25 rounds. Beyond this point, attempts to solve tasks often lead to extended failures, highlighting inefficiencies in tool design. Figure 3

Figure 3

Figure 3: Interaction rounds for Tools Claude.

Taxonomy of Failure Modes in Automated Issue Solving

The authors develop a taxonomy encompassing three stages of the issue-solving process: Localization, Repair, and Iterative Validation. This structured analysis reveals distinct failure types, such as issue misleading and superficial information matching during localization, and flawed reasoning leading to cognitive deadlocks in iterative validation. Figure 4

Figure 4: Distribution of failures across different tools.

Analysis of Failure Modes

Distinct patterns emerge across different tool architectures. Pipeline-based tools often fail during localization, while agent-based tools encounter more iteration anomalies due to flawed reasoning. As tasks grow more complex, failure shifts toward iterative validation, driven by errors requiring deep cognitive ability.

Mitigation Exploration with a Collaborative Architecture

Given that flawed reasoning is a major cause of failure, the authors propose a collaborative Expert-Executor framework. This system emulates human peer review, addressing strategic oversight and providing corrective guidance. Evaluation on previously failed cases demonstrated a 22.2% success rate, evidencing the framework's effectiveness in mitigating reasoning deadlocks.

Conclusion

This empirical paper provides valuable insights into why automation in issue resolution fails, identifying clear architecture-specific vulnerabilities. The proposed Expert-Executor framework showcases the potential improvement through collaboration, thus paving the way for more robust automated issue solving tools. The comprehensive taxonomy and strategic oversight mechanisms presented in the paper highlight the need for tools designed to diagnose, collaborate, and adaptively reason beyond mere scale.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 2 likes.