Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PatchZero: Zero-Shot Automatic Patch Correctness Assessment (2303.00202v3)

Published 1 Mar 2023 in cs.SE

Abstract: Automated Program Repair (APR) techniques have shown more and more promising results in fixing real-world bugs. Despite the effectiveness, APR techniques still face an overfitting problem: a generated patch can be incorrect although it passes all tests. It is time-consuming to manually evaluate the correctness of generated patches that can pass all tests. To address this problem, many approaches have been proposed to automatically assess the correctness of patches generated by APR techniques. These approaches are mainly evaluated within the cross-validation setting. However, for patches generated by a new or unseen APR tool, users are implicitly required to manually label a significant portion of these patches in the cross-validation setting before inferring the remaining patches. To mitigate the issue, in this study, we propose \toolname, the patch correctness assessment by adopting a LLM for code. Specifically, for patches generated by a new or unseen APR tool, \toolname does not need labeled patches of this new or unseen APR tool for training but directly queries the LLM for code to get predictions on the correctness labels without training. In this way, \toolname can reduce the manual labeling effort when building a model to automatically assess the correctness of generated patches of new APR tools. \toolname prioritizes labeled patches from existing APR tools that exhibit semantic similarity to those generated by new APR tools, enhancing the accuracy achieved by \toolname for patches from new APR tools. Our experimental results showed that \toolname can achieve an accuracy of 84.4% and an F1-score of 86.5% on average although no labeled patch of the new or unseen APR tool is available. In addition, our proposed technique outperformed the prior state-of-the-art by a large margin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. C. L. Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A generic method for automatic software repair,” IEEE Transactions on Software Engineering, vol. 38, pp. 54–72, 2012.
  2. F. Long and M. C. Rinard, “Staged program repair with condition synthesis,” Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015.
  3. X.-B. D. Le, D. Lo, and C. L. Goues, “History driven program repair,” 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol. 1, pp. 213–224, 2016.
  4. Q. Xin and S. P. Reiss, “Leveraging syntax-related code for automated program repair,” 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 660–670, 2017.
  5. J. Jiang, Y. Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping program repair space with existing patches and similar code,” Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2018.
  6. K. Liu, A. Koyuncu, K. Kim, D. Kim, and T. F. Bissyandé, “Lsrepair: Live search of fix ingredients for automated program repair,” 2018 25th Asia-Pacific Software Engineering Conference (APSEC), pp. 658–662, 2018.
  7. X. Liu and H. Zhong, “Mining stackoverflow for program repair,” 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 118–129, 2018.
  8. B. Lin, S. Wang, M. Wen, Z. Zhang, H. Wu, Y. Qin, and X. Mao, “Understanding the non-repairability factors of automated program repair techniques,” 2020 27th Asia-Pacific Software Engineering Conference (APSEC), pp. 71–80, 2020.
  9. Y. Qin, S. Wang, K. Liu, X. Mao, and T. F. Bissyandé, “On the impact of flaky tests in automated program repair,” 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 295–306, 2021.
  10. M. Martin, “Automatic software repair: a bibliography,” 2020.
  11. K. Liu, A. Koyuncu, D. Kim, and T. F. Bissyandé, “Tbar: revisiting template-based automated program repair,” Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019.
  12. Q. Zhu, Z. Sun, Y. an Xiao, W. Zhang, K. Yuan, Y. Xiong, and L. Zhang, “A syntax-guided edit decoder for neural program repair,” Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021.
  13. N. Jiang, T. Lutellier, and L. Tan, “Cure: Code-aware neural machine translation for automatic program repair,” 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 1161–1173, 2021.
  14. F. Long and M. C. Rinard, “An analysis of the search spaces for generate and validate patch generation systems,” 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 702–713, 2016.
  15. X.-B. D. Le, F. Thung, D. Lo, and C. L. Goues, “Overfitting in semantics-based automated program repair,” Empirical Software Engineering, vol. 23, pp. 3007–3033, 2017.
  16. X.-B. D. Le, L. Bao, D. Lo, X. Xia, and S. Li, “On reliability of patch correctness assessment,” 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 524–535, 2019.
  17. S. Wang, M. Wen, L. Chen, X. Yi, and X. Mao, “How different is it between machine-generated and developer-provided patches? : An empirical study on the correct patches generated by automated program repair techniques,” 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–12, 2019.
  18. A. Nilizadeh, G. T. Leavens, X.-B. D. Le, C. S. Pasareanu, and D. R. Cok, “Exploring true test overfitting in dynamic automated program repair using formal methods,” 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST), pp. 229–240, 2021.
  19. X.-B. D. Le, F. Thung, D. Lo, and C. L. Goues, “Overfitting in semantics-based automated program repair,” Empirical Software Engineering, vol. 23, pp. 3007–3033, 2018.
  20. Z. Qi, F. Long, S. Achour, and M. C. Rinard, “An analysis of patch plausibility and correctness for generate-and-validate patch generation systems,” Proceedings of the 2015 International Symposium on Software Testing and Analysis, 2015.
  21. B. Johnson, Y. Song, E. R. Murphy-Hill, and R. W. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” 2013 35th International Conference on Software Engineering (ICSE), pp. 672–681, 2013.
  22. P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations on automated fault localization,” Proceedings of the 25th International Symposium on Software Testing and Analysis, 2016.
  23. H. Tian, K. Liu, A. K. Kaboré, A. Koyuncu, L. Li, J. Klein, and T. F. Bissyandé, “Evaluating representation learning of code changes for predicting patch correctness in program repair,” in 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2020, pp. 981–992.
  24. S. H. Tan, H. Yoshida, M. R. Prasad, and A. Roychoudhury, “Anti-patterns in search-based program repair,” Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016.
  25. Y. Xiong, X. Liu, M. Zeng, L. Zhang, and G. Huang, “Identifying patch correctness in test-based program repair,” in Proceedings of the 40th international conference on software engineering, 2018, pp. 789–799.
  26. J. Yang, A. Zhikhartsev, Y. Liu, and L. Tan, “Better test cases for better automated program repair,” Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017.
  27. H. Ye, J. Gu, M. Martinez, T. Durieux, and M. Martin, “Automated classification of overfitting patches with statically extracted code features,” IEEE Transactions on Software Engineering, vol. 48, pp. 2920–2938, 2022.
  28. H. Ye, M. Martinez, and M. Martin, “Automated patch assessment for program repair at scale,” Empir. Softw. Eng., vol. 26, p. 20, 2021.
  29. S. Wang, M. Wen, B. Lin, H. Wu, Y. Qin, D. Zou, X. Mao, and H. Jin, “Automated patch correctness assessment: How far are we?” 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 968–980, 2020.
  30. Q. Xin and S. P. Reiss, “Identifying test-suite-overfitted patches through test case generation,” Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2017.
  31. M. Wen, J. Chen, R. Wu, D. Hao, and S. C. Cheung, “Context-aware patch generation for better automated program repair,” 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 1–11, 2018.
  32. X.-B. D. Le, D.-H. Chu, D. Lo, C. L. Goues, and W. Visser, “S3: syntax- and semantic-guided repair synthesis via programming by examples,” Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017.
  33. B. Lin, S. Wang, M. Wen, and X. Mao, “Context-aware code change embedding for better patch correctness assessment,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, pp. 1 – 29, 2022.
  34. H. Tian, X. Tang, A. Habib, S. Wang, K. Liu, X. Xia, J. Klein, and T. F. Bissyandé, “Is this change the answer to that problem? correlating descriptions of bug and code changes for evaluating patch correctness,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13.
  35. T. Le-Cong, D.-M. Luong, X. B. D. Le, D. Lo, N.-H. Tran, B. Quang-Huy, and Q.-T. Huynh, “Invalidator: Automated patch correctness assessment via semantic and syntactic reasoning,” arXiv preprint arXiv:2301.01113, 2023.
  36. C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” in OOPSLA ’07, 2007.
  37. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
  38. R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
  39. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019. [Online]. Available: http://arxiv.org/abs/1907.11692
  40. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for programming and natural languages,” CoRR, vol. abs/2002.08155, 2020. [Online]. Available: https://arxiv.org/abs/2002.08155
  41. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pre-training code representations with data flow,” ArXiv, vol. abs/2009.08366, 2021.
  42. Y. Wang, W. Wang, S. R. Joty, and S. C. H. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” CoRR, vol. abs/2109.00859, 2021. [Online]. Available: https://arxiv.org/abs/2109.00859
  43. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. W. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, I. Babuschkin, S. A. Balaji, S. Jain, A. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” ArXiv, vol. abs/2107.03374, 2021.
  44. T. L. Scao, T. Wang, D. Hesslow, L. Saulnier, S. Bekman, S. Bari, S. R. Biderman, H. ElSahar, J. Phang, O. Press, C. Raffel, V. Sanh, S. Shen, L. A. Sutawika, J. Tae, Z. X. Yong, J. Launay, and I. Beltagy, “What language model to train if you have one million gpu hours?” in 60th Annual Meeting of the Association for Computational Linguistics (ACL) Workshop “Challenges & Perspectives in Creating Large Language Models”, 2022.
  45. B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  46. H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436, 2019.
  47. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-074.html
  48. S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset for code understanding and generation,” ArXiv, vol. abs/2102.04664, 2021.
  49. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” ArXiv, vol. abs/2106.09685, 2022.
  50. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. J. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” ArXiv, vol. abs/2005.14165, 2020.
  51. R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” ArXiv, vol. abs/1508.07909, 2016.
  52. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
  53. T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” ArXiv, vol. abs/2104.08821, 2021.
  54. R.-M. Karampatsis and C. Sutton, “How often do single-statement bugs occur?: The manysstubs4j dataset,” Proceedings of the 17th International Conference on Mining Software Repositories, 2020.
  55. E. Mashhadi and H. Hemmati, “Applying codebert for automated program repair of java simple bugs,” in 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).   IEEE, 2021, pp. 505–509.
  56. M. Allamanis, H. Jackson-Flux, and M. Brockschmidt, “Self-supervised bug detection and repair,” in NeurIPS, 2021.
  57. S. Wang, K. Liu, B. Lin, L. Li, J. Klein, X. Mao, and T. F. Bissyand’e, “Beep: Fine-grained fix localization by learning to predict buggy code elements,” ArXiv, vol. abs/2111.07739, 2021.
  58. N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
  59. R. Just, D. Jalali, and M. D. Ernst, “Defects4j: a database of existing faults to enable controlled testing studies for java programs,” in ISSTA 2014, 2014.
  60. “Quantization in huggingface. https://huggingface.co/docs/transformers/main_classes/quantization,” Feb. 2024.
  61. V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. A. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Févry, J. A. Fries, R. Teehan, S. R. Biderman, L. Gao, T. Bers, T. Wolf, and A. M. Rush, “Multitask prompted training enables zero-shot task generalization,” ArXiv, vol. abs/2110.08207, 2022.
  62. T. Hoang, H. J. Kang, J. Lawall, and D. Lo, “Cc2vec: Distributed representations of code changes,” CoRR, vol. abs/2003.05620, 2020. [Online]. Available: https://arxiv.org/abs/2003.05620
  63. Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32, no. 2.   Bejing, China: PMLR, 22–24 Jun 2014, pp. 1188–1196. [Online]. Available: https://proceedings.mlr.press/v32/le14.html
  64. F. Madeiral, S. Urli, M. Maia, and M. Monperrus, “Bears: An extensible java bug benchmark for automatic program repair studies,” in 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).   IEEE, 2019, pp. 468–478.
  65. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
  66. E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou, “Codegen2: Lessons for training llms on programming and natural languages,” arXiv preprint arXiv:2305.02309, 2023.
  67. S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri, “Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t),” in 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2015, pp. 201–211.
  68. Q.-N. Phung, M. Kim, and E. Lee, “Identifying incorrect patches in program repair based on meaning of source code,” IEEE Access, vol. 10, pp. 12 012–12 030, 2022.
  69. A. Ghanbari and A. Marcus, “Patch correctness assessment in automated program repair based on the impact of patches on production and test code,” Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022.
  70. J. Yang, Y. Wang, Y. Lou, M. Wen, and L. Zhang, “A large-scale empirical review of patch correctness checking approaches,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1203–1215.
  71. M. Motwani, M. Soto, Y. Brun, R. Just, and C. Le Goues, “Quality of automated program repair on real-world defects,” IEEE Transactions on Software Engineering, vol. 48, no. 2, pp. 637–661, 2020.
  72. C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 959–971.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com