Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done? (2403.00448v1)

Published 1 Mar 2024 in cs.SE

Abstract: In recent years, LLMs have demonstrated substantial potential in addressing automatic program repair (APR) tasks. However, the current evaluation of these models for APR tasks focuses solely on the limited context of the single function or file where the bug is located, overlooking the valuable information in the repository-level context. This paper investigates the performance of popular LLMs in handling repository-level repair tasks. We introduce RepoBugs, a new benchmark comprising 124 typical repository-level bugs from open-source repositories. Preliminary experiments using GPT3.5 based on the function where the error is located, reveal that the repair rate on RepoBugs is only 22.58%, significantly diverging from the performance of GPT3.5 on function-level bugs in related studies. This underscores the importance of providing repository-level context when addressing bugs at this level. However, the repository-level context offered by the preliminary method often proves redundant and imprecise and easily exceeds the prompt length limit of LLMs. To solve the problem, we propose a simple and universal repository-level context extraction method (RLCE) designed to provide more precise context for repository-level code repair tasks. Evaluations of three mainstream LLMs show that RLCE significantly enhances the ability to repair repository-level bugs. The improvement reaches a maximum of 160% compared to the preliminary method. Additionally, we conduct a comprehensive analysis of the effectiveness and limitations of RLCE, along with the capacity of LLMs to address repository-level bugs, offering valuable insights for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
  2. Andrea Arcuri and Xin Yao. 2008. A novel co-evolutionary approach to automatic software bug fixing. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence). https://doi.org/10.1109/cec.2008.4630793
  3. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  4. A study on prompt design, advantages and limitations of chatgpt for deep learning program repair. arXiv preprint arXiv:2304.08191 (2023).
  5. SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Transactions on Software Engineering (Jan 2021), 1–1. https://doi.org/10.1109/tse.2019.2940179
  6. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion. arXiv preprint arXiv:2310.11248 (2023).
  7. An empirical study on api-misuse bugs in open-source c programs. In 2019 IEEE 43rd annual computer software and applications conference (COMPSAC), Vol. 1. IEEE, 11–20.
  8. DeepFix: Fixing Common C Language Errors by Deep Learning. Proceedings of the AAAI Conference on Artificial Intelligence (Jun 2022). https://doi.org/10.1609/aaai.v31i1.10742
  9. Sketchfix: a tool for automated program repair approach using lazy candidate generation. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 888–891.
  10. Impact of code language models on automated program repair. arXiv preprint arXiv:2302.05020 (2023).
  11. Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
  12. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
  13. Brian W Kernighan and Dennis M Ritchie. 1988. The C programming language. Prentice-Hall (1988).
  14. Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
  15. A Large-scale Study on API Misuses in the Wild. In 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 241–252.
  16. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
  17. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42.
  18. LSRepair: Live Search of Fix Ingredients for Automated Program Repair. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC). https://doi.org/10.1109/apsec.2018.00085
  19. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  20. Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022).
  21. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 101–114.
  22. Robert HB Netzer and Barton P Miller. 1992. What are race conditions? Some issues and formalizations. ACM Letters on Programming Languages and Systems (LOPLAS) 1, 1 (1992), 74–88.
  23. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  25. David Lorge Parnas. 1972. On the criteria to be used in decomposing systems into modules. Commun. ACM 15, 12 (1972), 1053–1058.
  26. Julian Aron Prenner and Romain Robbes. 2021. Automatic Program Repair with OpenAI’s Codex: Evaluating QuixBugs. arXiv preprint arXiv:2111.03922 (2021).
  27. An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653 (2023).
  28. The structure and value of modularity in software design. ACM SIGSOFT Software Engineering Notes 26, 5 (2001), 99–108.
  29. Yuchi Tian and Baishakhi Ray. 2017. Automatically diagnosing and repairing error handling bugs in c. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 752–762.
  30. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  31. Automatically finding patches using genetic programming. In 2009 IEEE 31st International Conference on Software Engineering. https://doi.org/10.1109/icse.2009.5070536
  32. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  33. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).
  34. ExcePy: A Python Benchmark for Bugs with Python Built-in Types. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 856–866.
  35. Towards an understanding of change types in bug fixing code. Information and software technology 86 (2017), 37–53.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yuxiao Chen (66 papers)
  2. Jingzheng Wu (9 papers)
  3. Xiang Ling (12 papers)
  4. Changjiang Li (22 papers)
  5. Zhiqing Rui (2 papers)
  6. Tianyue Luo (8 papers)
  7. Yanjun Wu (26 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com