Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study on LLM-based Agents for Automated Bug Fixing (2411.10213v1)

Published 15 Nov 2024 in cs.SE and cs.AI
An Empirical Study on LLM-based Agents for Automated Bug Fixing

Abstract: LLMs and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code modification. However, systematic analysis of these agent and non-agent systems remain limited, particularly regarding performance variations among top-performing ones. In this paper, we examine seven proprietary and open-source systems on the SWE-bench Lite benchmark for automated bug fixing. We first assess each system's overall performance, noting instances solvable by all or none of these sytems, and explore why some instances are uniquely solved by specific system types. We also compare fault localization accuracy at file and line levels and evaluate bug reproduction capabilities, identifying instances solvable only through dynamic reproduction. Through analysis, we concluded that further optimization is needed in both the LLM itself and the design of Agentic flow to improve the effectiveness of the Agent in bug fixing.

An Empirical Study on LLM-based Agents for Automated Bug Fixing

The paper "An Empirical Study on LLM-based Agents for Automated Bug Fixing" offers a rigorous evaluation of LLM-based agents in the domain of automated bug fixing across code repositories. The paper contributes a comprehensive analysis of the performance of various systems, both proprietary and open-source, using the SWE-bench Lite benchmark. This benchmark evaluates the systems on real-world software defect datasets, providing a robust platform for assessing automated bug-fixing capabilities.

Systematic Assessment of LLM-based Agents

The paper investigates seven systems that employ LLMs—a mix of commercial (MarsCode Agent, Honeycomb, Gru, and Alibaba Lingma Agent) and open-source systems (AutoCodeRover, Agentless + RepoGraph, and Agentless). The systems are scrutinized based on their ability to interact with the development environment, execute iterative validation, and apply code modifications to fix bugs.

Key Contributions and Findings

  1. Comparative Performance Analysis: The systems were evaluated for their ability to successfully fix bugs across 300 test cases from SWE-bench Lite. MarsCode Agent emerged as the top performer with a 39.33% issue resolution rate, though the performance varied significantly among systems, highlighting differences in handling problem-solving tasks. The analysis suggested that higher issue resolution often correlates with more accurate and nuanced fault localization.
  2. Fault Localization Metrics: Fault localization is critical for effective bug fixing. The paper reports variations in localization accuracy, particularly highlighting the importance of line-level localization over file-level localization for achieving successful bug fixes. MarsCode Agent, given its multi-layered approach integrating code graphs, LLMs, and software analysis, achieved superior results. In contrast, some systems demonstrated strong line-level localization, correlating with better patch generation success.
  3. Role of Reproduction: Bug reproduction was found to be a key component in verifying and diagnosing defects, especially when issue descriptions lacked detail. However, the reliance on reproduction also presented challenges. Systems implementing reproduction methods sometimes experienced misguidance due to extraneous data derived from reproduction attempts, signaling the need for improved design strategies to handle such data reliably.

Implications of the Findings

The implications of these findings are twofold—practical and theoretical. Practically, insight into the need for enhanced reasoning capabilities in LLMs and careful design of agentic interaction flows can lead to improved system designs for better automated bug fixing. Theoretically, this reinforces the pursuit of more intelligent systems capable of nuanced fault localization and handling multiple, potentially erroneous, reproduction data sources.

Speculations on Future Directions

Future developments in AI for software engineering could see an increased integration of more sophisticated reasoning abilities within LLMs, paired with better contextual understanding and environmental feedback mechanisms. Additionally, more refined agent designs that incorporate advanced reconciliation methods for discrepancies in reproduction results could further boost the efficacy of automated bug-fixing systems.

The paper presents a detailed and thoughtful evaluation of current systems, promoting a deeper understanding of how LLM-based agents can evolve to overcome designated challenges in automated software maintenance. It also hints at promising directions for research and improvement, ensuring digital robustness and efficiency in providing automated solutions to complex programming errors.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiangxin Meng (5 papers)
  2. Zexiong Ma (7 papers)
  3. Pengfei Gao (24 papers)
  4. Chao Peng (66 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com