Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models of Code Fail at Completing Code with Potential Bugs (2306.03438v2)

Published 6 Jun 2023 in cs.LG, cs.AI, cs.CL, and cs.SE

Abstract: LLMs of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CODEGEN-2B-MONO on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a significant gap in post-mitigation performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. A study of visual studio usage in practice. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), volume 1, pages 124–134. IEEE, 2016.
  2. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51(4):1–37, 2018.
  3. Deep learning for source code modeling and generation: Models, applications, and challenges. ACM Computing Surveys (CSUR), 53(3):1–38, 2020.
  4. On the naturalness of software. Communications of the ACM, 59(5):122–131, 2016.
  5. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 532–542, 2013.
  6. Code completion with neural attention and pointer networks. arXiv preprint arXiv:1711.09573, 2017.
  7. Generative code modeling with graphs. arXiv preprint arXiv:1805.08490, 2018.
  8. Neural code completion. OpenReview, 2016.
  9. Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 473–485, 2020.
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  11. Codexglue: A machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  12. A conversational paradigm for program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  13. Ariel Assaraf. This is what your developers are doing 75% of the time, and this is the cost you pay. https://coralogix.com/blog/this-is-what-your-developers-are-doing-75-of-the-time-and-this-is-the-cost-you-pay/, 2015.
  14. Reversible debugging software “quantify the time and cost saved using reversible debuggers”. Citeseer, 2013.
  15. When code completion fails: A case study on real-world completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 960–970. IEEE, 2019.
  16. Learning autocompletion from real-world datasets. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 131–139. IEEE, 2021.
  17. Marc Otten. User evaluation of incoder based on statement completion. Bachelor’s Thesis, Delft University of Technology, 2022.
  18. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  19. Fixeval: Execution-based evaluation of program fixes for competitive programming problems. arXiv preprint arXiv:2206.07796, 2022.
  20. Probabilistic model for code with decision trees. ACM SIGPLAN Notices, pages 731–747, 2016.
  21. Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR), pages 207–216. IEEE, 2013.
  22. Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  23. Deepfix: Fixing common c language errors by deep learning. In Thirty-First AAAI conference on artificial intelligence, 2017.
  24. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology (TOSEM), 28(4):1–29, 2019.
  25. Break-it-fix-it: Unsupervised learning for program repair. In International Conference on Machine Learning (ICML), 2021.
  26. Review4repair: Code review aided automatic program repairing. Information and Software Technology, 143:106765, 2022.
  27. The manybugs and introclass benchmarks for automated repair of c programs. IEEE Transactions on Software Engineering, 41(12):1236–1256, 2015.
  28. Quixbugs: A multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity, pages 55–56, 2017.
  29. Re-factoring based program repair applied to programming assignments. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 388–398. IEEE, 2019.
  30. Codenet: A large-scale AI for code dataset for learning a diversity of coding tasks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  31. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
  32. Can we learn from developer mistakes? learning to localize and repair real bugs from real bug fixes. arXiv preprint arXiv:2207.00301, 2022.
  33. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
  34. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  35. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020.
  36. On distribution shift in learning-based bug detectors. In International Conference on Machine Learning, pages 8559–8580. PMLR, 2022.
  37. How often do single-statement bugs occur? the manysstubs4j dataset. In Proceedings of the 17th International Conference on Mining Software Repositories, MSR ’20, page 573–577, New York, NY, USA, 2020. Association for Computing Machinery.
  38. Learning to extend program graphs to work-in-progress code, 2021.
  39. Learning from examples to improve code completion systems. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, pages 213–222, 2009.
  40. Intelligent code completion with bayesian networks. ACM Transactions on Software Engineering and Methodology (TOSEM), 25(1):1–31, 2015.
  41. Suggesting accurate method and class names. In Proceedings of the 2015 10th joint meeting on foundations of software engineering, pages 38–49, 2015.
  42. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 21–29, 2022.
  43. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 269–280, 2014.
  44. Mining idioms from source code. In Proceedings of the 22nd acm sigsoft international symposium on foundations of software engineering, pages 472–483, 2014.
  45. Phog: probabilistic model for code. In International Conference on Machine Learning, pages 2933–2942. PMLR, 2016.
  46. Structural language models of code. In International conference on machine learning, pages 245–256. PMLR, 2020.
  47. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1433–1443, 2020.
  48. Code prediction by feeding trees to transformers. in 2021 ieee/acm 43rd international conference on software engineering (icse), 2021.
  49. Pymt5: multi-mode translation of natural language and python code with transformers. arXiv preprint arXiv:2010.03150, 2020.
  50. Learning to complete code with sketches. In International Conference on Learning Representations, 2021.
  51. ReACC: A retrieval-augmented code completion framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6227–6240, 2022.
  52. Better context makes better code language models: A case study on function call argument completion. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
  53. Learning to represent programs with graphs. arXiv preprint arXiv:1711.00740, 2017.
  54. Neural program repair by jointly learning to localize and repair. arXiv preprint arXiv:1904.01720, 2019.
  55. Global relational models of source code. In International conference on learning representations, 2019.
  56. Heat: Hyperedge attention networks. arXiv preprint arXiv:2201.12113, 2022.
  57. Sequencer: Sequence-to-sequence learning for end-to-end program repair. IEEE Transactions on Software Engineering, 47(9):1943–1959, 2019.
  58. Dlfix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 602–614, 2020.
  59. Graph-based, self-supervised program repair from diagnostic feedback. In International Conference on Machine Learning, pages 10799–10808. PMLR, 2020.
  60. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333, 2021.
  61. Pre-trained contextual embedding of source code. ArXiv, abs/2001.00059, 2019.
  62. Semantic bug seeding: a learning-based approach for creating realistic bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 906–918, 2021.
  63. Learning realistic mutations: Bug creation for neural bug detectors. In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST), pages 162–173. IEEE, 2022.
  64. Learning how to mutate source code from bug-fixes. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 301–312. IEEE, 2019.
  65. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pages 101–114, 2020.
  66. Self-supervised bug detection and repair. Advances in Neural Information Processing Systems, 34:27865–27876, 2021.
  67. Threat of adversarial attacks on deep learning in computer vision: A survey. Ieee Access, 6:14410–14430, 2018.
  68. Opportunities and challenges in deep learning adversarial robustness: A survey. arXiv preprint arXiv:2007.00753, 2020.
  69. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020.
  70. Semantic robustness of models of source code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 526–537. IEEE, 2022.
  71. Backdoors in neural models of source code. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2892–2899. IEEE, 2022.
  72. Defects4j: A database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, pages 437–440, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Tuan Dinh (10 papers)
  2. Jinman Zhao (20 papers)
  3. Samson Tan (21 papers)
  4. Renato Negrinho (8 papers)
  5. Leonard Lausen (12 papers)
  6. Sheng Zha (25 papers)
  7. George Karypis (110 papers)
Citations (18)
Github Logo Streamline Icon: https://streamlinehq.com