Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ContrastRepair: Enhancing Conversation-Based Automated Program Repair via Contrastive Test Case Pairs (2403.01971v2)

Published 4 Mar 2024 in cs.SE

Abstract: Automated Program Repair (APR) aims to automatically generate patches for rectifying software bugs. Recent strides in LLMs (LLM), such as ChatGPT, have yielded encouraging outcomes in APR, especially within the conversation-driven APR framework. Nevertheless, the efficacy of conversation-driven APR is contingent on the quality of the feedback information. In this paper, we propose ContrastRepair, a novel conversation-based APR approach that augments conversation-driven APR by providing LLMs with contrastive test pairs. A test pair consists of a failing test and a passing test, which offer contrastive feedback to the LLM. Our key insight is to minimize the difference between the generated passing test and the given failing test, which can better isolate the root causes of bugs. By providing informative and specific feedback, ContrastRepair enables the LLM to produce effective bug fixes. The implementation of ContrastRepair is based on the state-of-the-art LLM, ChatGPT, and it iteratively interacts with ChatGPT until plausible patches are generated. We evaluate ContrastRepair on multiple benchmark datasets, including Defects4j, QuixBugs, and HumanEval-Java. The results demonstrate that ContrastRepair significantly outperforms existing methods, achieving a new state-of-the-art in program repair. For instance, among Defects4j 1.2 and 2.0, ContrastRepair correctly repairs 143 out of all 337 bug cases, while the best-performing baseline fixes 124 bugs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.
  2. Fatih Kadir Akin. 2023. The Art of ChatGPT Prompting. https://fka.gumroad.com/l/art-of-chatgpt-prompting.
  3. Anonymous. 2023a. javaobj-py3. https://pypi.org/project/javaobj-py3/.
  4. Anonymous. 2023b. Javassist. https://www.javassist.org/.
  5. On the effectiveness of unified debugging: An extensive study on 16 program repair systems. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 907–918.
  6. Quantify the time and cost saved using reversible debuggers. Cambridge Judge Business School, Tech. Rep (2012).
  7. Reversible debugging software-quantify the time and cost saved using reversible debuggers. University of Cambridge (2013).
  8. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  9. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs.. In OSDI, Vol. 8. 209–224.
  10. Program-adaptive mutational fuzzing. In 2015 IEEE Symposium on Security and Privacy. IEEE, 725–741.
  11. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
  12. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  13. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
  14. Fred J Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (1964), 171–176.
  15. Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590 (2023).
  16. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1469–1481.
  17. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  18. Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines 30 (2020), 681–694.
  19. Crash-avoiding program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 8–18.
  20. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).
  21. Exploring the Potential of ChatGPT in Automated Code Refinement: An Empirical Study. arXiv preprint arXiv:2309.08221 (2023).
  22. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics. PMLR, 5549–5581.
  23. Fuzzing with code fragments. In 21st USENIX Security Symposium (USENIX Security 12). 445–458.
  24. Impact of code language models on automated program repair. arXiv preprint arXiv:2302.05020 (2023).
  25. Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
  26. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  27. James C King. 1976. Symbolic execution and program testing. Commun. ACM 19, 7 (1976), 385–394.
  28. Maintaining mental models: a study of developer work habits. In Proceedings of the 28th international conference on Software engineering. 492–501.
  29. S3: syntax-and semantic-guided repair synthesis via programming by examples. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 593–604.
  30. History driven program repair. In 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), Vol. 1. IEEE, 213–224.
  31. Genprog: A generic method for automatic software repair. Ieee transactions on software engineering 38, 1 (2011), 54–72.
  32. Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707–710.
  33. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
  34. Avatar: Fixing semantic bugs with fix patterns of static analysis violations. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 1–12.
  35. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42.
  36. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  37. Retrieval-augmented generation for code summarization via hybrid gnn. arXiv preprint arXiv:2006.05405 (2020).
  38. Yang Liu. 2019. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318 (2019).
  39. Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 166–178.
  40. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 101–114.
  41. Rupak Majumdar and Koushik Sen. 2007. Hybrid concolic testing. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 416–426.
  42. Matias Martinez and Martin Monperrus. 2016. Astor: A program repair library for java. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 441–444.
  43. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th international conference on software engineering. 691–701.
  44. An empirical study of the reliability of UNIX utilities. Commun. ACM 33, 12 (1990), 32–44.
  45. Recent advances in natural language processing via large pre-trained language models: A survey. Comput. Surveys 56, 2 (2023), 1–40.
  46. Peter Oehlert. 2005. Violating assumptions with fuzzing. IEEE Security & Privacy 3, 2 (2005), 58–62.
  47. OpenAI. 2023a. ChatGPT. https://chat.openai.com/.
  48. OpenAI. 2023b. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf
  49. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  50. Sebastian Poeplau and Aurélien Francillon. 2020. Symbolic execution with {{\{{SymCC}}\}}: Don’t interpret, compile!. In 29th USENIX Security Symposium (USENIX Security 20). 181–198.
  51. Can OpenAI’s codex fix bugs? an evaluation on QuixBugs. In Proceedings of the Third International Workshop on Automated Program Repair. 69–75.
  52. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  53. CUTE: A concolic unit testing engine for C. ACM SIGSOFT Software Engineering Notes 30, 5 (2005), 263–272.
  54. Kosta Serebryany. 2016. Continuous fuzzing with libfuzzer and addresssanitizer. In 2016 IEEE Cybersecurity Development (SecDev). IEEE, 157–157.
  55. An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653 (2023).
  56. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
  57. Attention is all you need. Advances in neural information processing systems 30 (2017).
  58. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.
  59. Context-aware patch generation for better automated program repair. In Proceedings of the 40th international conference on software engineering. 1–11.
  60. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  61. On the unusual effectiveness of type-aware operator mutations for testing SMT solvers. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–25.
  62. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  63. Chunqiu Steven Xia and Lingming Zhang. 2022. Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 959–971. https://doi.org/10.1145/3540250.3549101
  64. Chunqiu Steven Xia and Lingming Zhang. 2023a. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
  65. Chunqiu Steven Xia and Lingming Zhang. 2023b. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
  66. Better test cases for better automated program repair. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 831–841.
  67. Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 283–294.
  68. Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint arXiv:2302.08081 (2023).
  69. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
  70. Selfapr: Self-supervised program repair with test execution diagnostics. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.
  71. Neural program repair with execution-based backpropagation. In Proceedings of the 44th International Conference on Software Engineering. 1506–1518.
  72. M. Zalewski. 2018. American fuzzing lop (afl). https://lcamtuf.coredump.cx/afl/.
  73. Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering 28, 2 (2002), 183–200.
  74. An extensive study on pre-trained models for program understanding and generation. In Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. 39–51.
  75. Prompting large language model for machine translation: A case study. arXiv preprint arXiv:2301.07069 (2023).
  76. A Survey of Learning-based Automated Program Repair. arXiv preprint arXiv:2301.03270 (2023).
  77. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
  78. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 341–353.
  79. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiaolong Kong (1 paper)
  2. Mingfei Cheng (16 papers)
  3. Xiaofei Xie (104 papers)
  4. Shangqing Liu (28 papers)
  5. Xiaoning Du (27 papers)
  6. Qi Guo (237 papers)
Citations (8)