Automated Bug Generation in the era of Large Language Models (2310.02407v2)
Abstract: Bugs are essential in software engineering; many research studies in the past decades have been proposed to detect, localize, and repair bugs in software systems. Effectiveness evaluation of such techniques requires complex bugs, i.e., those that are hard to detect through testing and hard to repair through debugging. From the classic software engineering point of view, a hard-to-repair bug differs from the correct code in multiple locations, making it hard to localize and repair. Hard-to-detect bugs, on the other hand, manifest themselves under specific test inputs and reachability conditions. These two objectives, i.e., generating hard-to-detect and hard-to-repair bugs, are mostly aligned; a bug generation technique can change multiple statements to be covered only under a specific set of inputs. However, these two objectives are conflicting for learning-based techniques: A bug should have a similar code representation to the correct code in the training data to challenge a bug prediction model to distinguish them. The hard-to-repair bug definition remains the same but with a caveat: the more a bug differs from the original code, the more distant their representations are and easier to be detected. We propose BugFarm, to transform arbitrary code into multiple complex bugs. BugFarm leverages LLMs to mutate code in multiple locations (hard-to-repair). To ensure that multiple modifications do not notably change the code representation, BugFarm analyzes the attention of the underlying model and instructs LLMs to only change the least attended locations (hard-to-detect). Our comprehensive evaluation of 435k+ bugs from over 1.9M mutants generated by BUGFARM and two alternative approaches demonstrates our superiority in generating bugs that are hard to detect by learning-based bug prediction approaches and hard-to-repair by state-of-the-art learning-based program repair technique.
- 2023. BigQuery Dataset. https://console.cloud.google.com/marketplace/details/github/github-repos
- 2023. ManySStuBs4J Dataset. https://github.com/mast-group/mineSStuBs
- Open AI. 2023. Open AI ChatGPT. https://openai.com/blog/chatgpt
- ” False negative–that one is going to kill you”: Understanding Industry Perspectives of Static Analysis based Security Testing. arXiv preprint arXiv:2307.16325 (2023).
- Anonymous authors. 2023. Anonymous repository. https://github.com/projectinvestigator/BUGFARM
- What it would take to use mutation testing in industry—a study at facebook. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 268–277.
- CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39.
- Paul E Black. 2017. Sard: a software assurance reference dataset. (2017).
- Marcel Böhme and Abhik Roychoudhury. 2014. Corebench: Studying complexity of regression errors. In Proceedings of the 2014 international symposium on software testing and analysis. 105–115.
- The national vulnerability database (nvd): Overview. (2013).
- The care and feeding of wild-caught mutants. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 511–522.
- Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023).
- BugsInPy. 2023. BugsInPy: Dataset of Real-world Python Bugs. https://github.com/soarsmu/BugsInPy
- NatGen: generative pre-training by “naturalizing” source code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 18–30.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Pit: a practical mutation testing tool for java. In Proceedings of the 25th international symposium on software testing and analysis. 449–452.
- Interface mutation: An approach for integration testing. IEEE transactions on software engineering 27, 3 (2001), 228–247.
- Towards mutation analysis of android apps. In 2015 IEEE Eighth International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 1–10.
- Mutation operators for testing Android apps. Information and Software Technology 81 (2017), 154–168.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- AC/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
- Markus Freitag and Yaser Al-Onaizan. 2017. Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017).
- GitHub. 2023. GitHub Copilot. https://github.com/features/copilot
- Google. 2023a. Google Bard. https://bard.google.com/
- Google. 2023b. Google PaLM. https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html
- AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–11.
- On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131.
- Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
- Perfect is the enemy of test oracle. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 70–81.
- Reyhaneh Jabbarvand and Sam Malek. 2017. μ𝜇\muitalic_μdroid: an energy-aware mutation testing framework for android. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 208–219.
- Yue Jia and Mark Harman. 2009. Higher order mutation testing. Information and Software Technology 51, 10 (2009), 1379–1393.
- René Just. 2014. The Major mutation framework: Efficient and scalable mutation analysis for Java. In Proceedings of the 2014 international symposium on software testing and analysis. 433–436.
- Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
- Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 654–665.
- Efficient Mutation Testing via Pre-Trained Language Models. arXiv preprint arXiv:2301.03543 (2023).
- Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. Soviet Union, 707–710.
- Enlightened debugging. In Proceedings of the 40th International Conference on Software Engineering. 82–92.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
- Bugbench: Benchmarks for evaluating bug detection tools. In Workshop on the evaluation of software defect detection tools, Vol. 5. Chicago, Illinois.
- MuJava: an automated class mutation system. Software Testing, Verification and Reliability 15, 2 (2005), 97–133.
- Evan Martin and Tao Xie. 2007. A fault model and mutation testing of access control policies. In Proceedings of the 16th international conference on World Wide Web. 667–676.
- Meta. 2023. Meta LLaMA. https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
- Efficient JavaScript mutation testing. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation. IEEE, 74–83.
- Tai Nguyen and Eric Wong. 2023. In-context Example Selection with Influences. arXiv preprint arXiv:2302.11042 (2023).
- Efficient Mutation Testing via Pre-Trained Language Models. 45th IEEE/ACM International Conference on Software Engineering (2023).
- Report on the static analysis tool exposition (sate) iv. NIST Special Publication 500 (2013), 297.
- OpenAI. 2023. GPT-4 Technical Report. https://arxiv.org/abs/2303.08774
- Understanding the Effectiveness of Large Language Models in Code Translation. arXiv preprint arXiv:2308.03109 (2023).
- Jibesh Patra and Michael Pradel. 2021. Semantic bug seeding: a learning-based approach for creating realistic bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 906–918.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297 (2020).
- Bugs. jar: A large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th international conference on mining software repositories. 10–13.
- Hazards of data leakage in machine learning: a study on classification of breast cancer using deep neural networks. In Medical Imaging 2020: Computer-Aided Diagnosis, Vol. 11314. SPIE, 279–284.
- David Schuler and Andreas Zeller. 2009. Javalanche: Efficient mutation testing for Java. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. 297–298.
- Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond. arXiv preprint arXiv:2304.05216 (2023).
- RegMiner: towards constructing a large regression dataset from code evolution history. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 314–326.
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
- Learning to Construct Better Mutation Faults. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13.
- Bugswarm: Mining and continuously growing a dataset of reproducible failures and fixes. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 339–349.
- Learning how to mutate source code from bug-fixes. In 2019 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 301–312.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- What do they capture? a structural analysis of pre-trained language models for source code. In Proceedings of the 44th International Conference on Software Engineering. 2377–2388.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
- Memory mutation testing. Information and Software Technology 81 (2017), 97–111.
- Revisiting the Plastic Surgery Hypothesis via Large Language Models. arXiv preprint arXiv:2303.10494 (2023).
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
- Deep learning for just-in-time defect prediction. In 2015 IEEE International Conference on Software Quality, Reliability and Security. IEEE, 17–26.
- A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? arXiv preprint arXiv:2303.11717 (2023).
- Ali Reza Ibrahimzada (6 papers)
- Yang Chen (535 papers)
- Ryan Rong (1 paper)
- Reyhaneh Jabbarvand (10 papers)