Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Far Can We Go with Practical Function-Level Program Repair? (2404.12833v2)

Published 19 Apr 2024 in cs.SE

Abstract: Recently, multiple Automated Program Repair (APR) techniques based on LLMs have been proposed to enhance the repair performance. While these techniques mainly focus on the single-line or hunk-level repair, they face significant challenges in real-world application due to the limited repair task scope and costly statement-level fault localization. However, the more practical function-level APR, which broadens the scope of APR task to fix entire buggy functions and requires only cost-efficient function-level fault localization, remains underexplored. In this paper, we conduct the first comprehensive study of LLM-based function-level APR including investigating the effect of the few-shot learning mechanism and the auxiliary repair-relevant information. Specifically, we adopt six widely-studied LLMs and construct a benchmark in both the Defects4J 1.2 and 2.0 datasets. Our study demonstrates that LLMs with zero-shot learning are already powerful function-level APR techniques, while applying the few-shot learning mechanism leads to disparate repair performance. Moreover, we find that directly applying the auxiliary repair-relevant information to LLMs significantly increases function-level repair performance. Inspired by our findings, we propose an LLM-based function-level APR technique, namely SRepair, which adopts a dual-LLM framework to leverage the power of the auxiliary repair-relevant information for advancing the repair performance. The evaluation results demonstrate that SRepair can correctly fix 300 single-function bugs in the Defects4J dataset, largely surpassing all previous APR techniques by at least 85%, without the need for the costly statement-level fault location information. Furthermore, SRepair successfully fixes 32 multi-function bugs in the Defects4J dataset, which is the first time achieved by any APR technique ever to our best knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. 2024. Github Repository. https://github.com/GhabiX/SRepair.
  2. 2024-02-29. Hugging Face. https://huggingface.co.
  3. 2024-02-29. OpenAI API. https://openai.com/api.
  4. 2024-03. Defects4J-GitHub. https://github.com/rjust/defects4j.
  5. 2024-03. Junit-5. https://junit.org/junit5/.
  6. 2024-03-04. CodeLlama huggingface. https://huggingface.co/codellama.
  7. 2024-03-04. ise-uiuc/magicoder. https://huggingface.co/ise-uiuc/Magicoder-S-CL-7B.
  8. 2024-03-04. New GPT-3 capabilities: Edit & insert. https://openai.com/blog/gpt-3-edit-insert.
  9. 2024-03-17. Defects4j Cli-26. https://github.com/rjust/defects4j/blob/master/ framework/projects/Cli/patches/26.src.patch.
  10. 2024-03-17a. Defects4j Closure-112. https://github.com/rjust/defects4j/blob/master/ framework/projects/Closure/patches/112.src.patch.
  11. 2024-03-17b. Defects4j Closure-66. https://github.com/rjust/defects4j/blob/master/ framework/projects/Closure/patches/66.src.patch.
  12. 2024-03-17. Defects4j GitHub Repository. https://github.com/rjust/defects4j/.
  13. 2024-03-17. Defects4j JacksonDatabind-69. https://github.com/rjust/defects4j/blob/ master/framework/projects/JacksonDatabind/patches/69.src.patch.
  14. 2024-03-17a. Defects4j Math-2. https://github.com/rjust/defects4j/blob/master/ framework/projects/Math/patches/2.src.patch.
  15. 2024-03-17b. Defects4j Math-80. https://github.com/rjust/defects4j/blob/master/ framework/projects/Math/patches/80.src.patch.
  16. 2024-03-17c. Defects4j Math-91. https://github.com/rjust/defects4j/blob/master/ framework/projects/Math/patches/91.src.patch.
  17. 2024-03-17d. Defects4j Math-95. https://github.com/rjust/defects4j/blob/master/ framework/projects/Math/patches/95.src.patch.
  18. 2024-03-17. Flagship Model gpt-3.5-turbo-1106. https://platform.openai.com/docs/models/gpt-3-5-turbo.
  19. 2024-03-17. Implementation of multiple example selection. https://zenodo.org/records/7592886/files/patch.tar.gz?download=1.
  20. Large language models are few-shot clinical information extractors. arXiv preprint arXiv:2205.12689 (2022).
  21. Gabin An and Shin Yoo. 2022. FDG: a precise measurement of fault diagnosability gain of test cases. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 14–26.
  22. Coping with an open bug repository. In Proceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange. 35–39.
  23. Program Synthesis with Large Language Models.
  24. Improving test suites for efficient fault localization. In Proceedings of the 28th international conference on Software engineering. 82–91.
  25. What makes a good bug report?. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. 308–318.
  26. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. 1877–1901.
  27. Detecting missing information in bug descriptions. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 396–407.
  28. Evaluating large language models trained on code. arXiv (Cornell University) (July 2021). https://doi.org/10.48550/arxiv.2107.03374
  29. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307 (2023).
  30. Sequencer: Sequence-to-sequence learning for end-to-end program repair. IEEE Transactions on Software Engineering 47, 9 (2019), 1943–1959.
  31. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435.
  32. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014 (2023).
  33. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1469–1481.
  34. Automatic software repair: A survey. In Proceedings of the 40th International Conference on Software Engineering. 1219–1219.
  35. Deepfix: Fixing common c language errors by deep learning. In Proceedings of the aaai conference on artificial intelligence, Vol. 31.
  36. Mark Harman. 2010. Automated patching techniques: the fix is in: technical perspective. Commun. ACM 53, 5 (2010), 108–108.
  37. An empirical study on fine-tuning large language models of code for automated program repair. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1162–1174.
  38. Impact of code language models on automated program repair. arXiv preprint arXiv:2302.05020 (2023).
  39. Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
  40. Inferfix: End-to-end program repair with llms. arXiv preprint arXiv:2303.07263 (2023).
  41. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
  42. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323.
  43. Automatic patch generation learned from human-written patches. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 802–811.
  44. Practitioners’ expectations on automated fault localization. In Proceedings of the 25th international symposium on software testing and analysis. 165–176.
  45. Patch generation with language models: Feasibility and scaling behavior. In Deep Learning for Code Workshop.
  46. iFixR: Bug report driven program repair. In Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 314–325.
  47. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931.
  48. Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization. In Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 169–180.
  49. Dlfix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42nd international conference on software engineering. 602–614.
  50. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
  51. R2Fix: Automatically generating bug fixes from bug reports. In 2013 IEEE Sixth international conference on software testing, verification and validation. IEEE, 282–291.
  52. You cannot fix what you cannot find! an investigation of fault localization bias in benchmarking automated program repair systems. In 2019 12th IEEE conference on software testing, validation and verification (ICST). IEEE, 102–113.
  53. Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 166–178.
  54. Fan Long and Martin Rinard. 2016. An analysis of the search spaces for generate and validate patch generation systems. In Proceedings of the 38th International Conference on Software Engineering. 702–713.
  55. Can automated program repair refine fault localization? a unified debugging approach. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 75–87.
  56. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 101–114.
  57. Automatic repair of real bugs in java: A large-scale experiment on the defects4j dataset. Empirical Software Engineering 22 (2017), 1936–1964.
  58. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th international conference on software engineering. 691–701.
  59. Deepdelta: learning to repair compilation errors. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 925–936.
  60. Semfix: Program repair via semantic analysis. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 772–781.
  61. Language models as knowledge bases? arXiv preprint arXiv:1909.01066 (2019).
  62. Can OpenAI’s codex fix bugs? an evaluation on QuixBugs. In Proceedings of the Third International Workshop on Automated Program Repair. 69–75.
  63. Code llama: Open Foundation Models for code. arXiv (Cornell University) (Aug. 2023). https://doi.org/10.48550/arxiv.2308.12950
  64. Elixir: Effective object-oriented program repair. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 648–659.
  65. Dissection of a bug dataset: Anatomy of 395 patches from defects4j. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 130–140.
  66. Evaluating the impact of experimental assumptions in automated fault localization. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 159–171.
  67. How to fine-tune bert for text classification?. In Chinese computational linguistics: 18th China national conference, CCL 2019, Kunming, China, October 18–20, 2019, proceedings 18. Springer, 194–206.
  68. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems, Vol. 27.
  69. LLAMA: Open and Efficient Foundation Language Models. arXiv (Cornell University) (Feb. 2023). https://doi.org/10.48550/arxiv.2302.13971
  70. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
  71. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 146–158.
  72. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  73. MagicOder: Source code is all you need. arXiv (Cornell University) (Dec. 2023). https://doi.org/10.48550/arxiv.2312.02120
  74. Copiloting the copilots: Fusing large language models with completion engines for automated program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 172–184.
  75. Revisiting the Plastic Surgery Hypothesis via Large Language Models. arXiv preprint arXiv:2303.10494 (2023).
  76. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  77. Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971.
  78. Chunqiu Steven Xia and Lingming Zhang. 2023a. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
  79. Chunqiu Steven Xia and Lingming Zhang. 2023b. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
  80. Precise condition synthesis for program repair. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 416–426.
  81. Selfapr: Self-supervised program repair with test execution diagnostics. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.
  82. Neural program repair with execution-based backpropagation. In Proceedings of the 44th International Conference on Software Engineering. 1506–1518.
  83. He Ye and Martin Monperrus. 2024. ITER: Iterative Neural Repair for Multi-Location Patches. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.
  84. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
  85. Gamma: Revisiting template-based automated program repair via mask prediction. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 535–547.
  86. A critical review of large language model on software engineering: An example from chatgpt and automated program repair. arXiv preprint arXiv:2310.08879 (2023).
  87. Bug report enrichment with application of automated fixer recommendation. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE, 230–240.
  88. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023).
  89. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 341–353.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiahong Xiang (2 papers)
  2. Xiaoyang Xu (8 papers)
  3. Fanchu Kong (1 paper)
  4. Mingyuan Wu (11 papers)
  5. Haotian Zhang (107 papers)
  6. Yuqun Zhang (13 papers)
  7. Zizheng Zhang (3 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com