Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Analysis of the Automatic Bug Fixing Performance of ChatGPT (2301.08653v1)

Published 20 Jan 2023 in cs.SE

Abstract: To support software developers in finding and fixing software bugs, several automated program repair techniques have been introduced. Given a test suite, standard methods usually either synthesize a repair, or navigate a search space of software edits to find test-suite passing variants. Recent program repair methods are based on deep learning approaches. One of these novel methods, which is not primarily intended for automated program repair, but is still suitable for it, is ChatGPT. The bug fixing performance of ChatGPT, however, is so far unclear. Therefore, in this paper we evaluate ChatGPT on the standard bug fixing benchmark set, QuixBugs, and compare the performance with the results of several other approaches reported in the literature. We find that ChatGPT's bug fixing performance is competitive to the common deep learning approaches CoCoNut and Codex and notably better than the results reported for the standard program repair approaches. In contrast to previous approaches, ChatGPT offers a dialogue system through which further information, e.g., the expected output for a certain input or an observed error message, can be entered. By providing such hints to ChatGPT, its success rate can be further increased, fixing 31 out of 40 bugs, outperforming state-of-the-art.

An Analysis of the Automatic Bug Fixing Performance of ChatGPT

The paper "An Analysis of the Automatic Bug Fixing Performance of ChatGPT" presents a comparative paper on the use of ChatGPT for automated program repair (APR). This paper specifically focuses on evaluating ChatGPT against the established benchmarks such as Codex, CoCoNut, and other traditional APR methods using the QuixBugs benchmark suite, which consists of small, challenging programming problems.

Key Findings

  1. Competitive Performance: ChatGPT demonstrates competitive performance when compared to other deep learning-based methods like Codex and CoCoNut. It successfully fixes 19 out of 40 benchmark problems, which aligns closely with Codex's 21 successes and CoCoNut's similar performance. Notably, this performance substantially surpasses that of standard APR approaches, which only solve seven problems.
  2. Enhanced by User Interaction: A standout feature of ChatGPT is its dialogue system, enabling users to provide additional context or hints. This interactivity allows for improved results; when users supply targeted hints, ChatGPT's success rate climbs to 31 out of 40 problems solved. These findings underline ChatGPT's adaptability and potential for improved performance through human interaction.
  3. Diverse Response Patterns: The research categorizes ChatGPT's responses into several classes, including asking for more information, not finding a bug, providing correct fixes, attempting irrelevant fixes, introducing new bugs along with fixes, and suggesting alternative implementations. It is observed that ChatGPT frequently seeks more information, indicating the potential benefits of its conversational capabilities to achieve better results.

Methodology and Comparative Evaluation

The paper evaluates ChatGPT using the QuixBugs benchmark set that consists of 40 Python problems. Each problem was posed to ChatGPT independently four times, and the correctness of the provided fixes was manually verified. In order to provide a comprehensive evaluation, the performance of ChatGPT was compared with state-of-the-art methods like Codex and CoCoNut, as well as traditional APR systems.

The standard APR systems often rely heavily on test suites to determine the correctness of the suggested repairs, leading to limitations such as high computational costs and challenges with generalization. In contrast, deep learning-based approaches, including ChatGPT, focus on learning repair patterns, demonstrating potential despite not always ensuring compilation or functional verification.

Implications and Future Work

The results outlined in this paper highlight the potential advantages of using conversational models like ChatGPT in the field of automated program repair. ChatGPT not only competes with existing DL-based APR tools but also introduces a unique opportunity for human-program interaction, which can significantly enhance the bug-fixing process.

However, there is an inherent challenge in verifying the outputs of such systems, raising the question of where to strike a balance between manual input verification and leveraged machine output. Improving ChatGPT's context understanding and incorporating verification strategies such as automated testing could bridge this gap and provide more reliable automated repair systems.

As future work, integrating automated tools to generate informed hints for ChatGPT and develop strategies for automated validation of the repaired code responses could make ChatGPT an even more effective tool for developers. This work lays the foundation for embracing conversational AI in software maintenance and evolution tasks, alluding to a new dimension of human-AI collaboration in software engineering.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Dominik Sobania (15 papers)
  2. Martin Briesch (10 papers)
  3. Carol Hanna (5 papers)
  4. Justyna Petke (16 papers)
Citations (267)
X Twitter Logo Streamline Icon: https://streamlinehq.com