Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval (2407.02395v2)
Abstract: LLMs have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.
- Synopsys 2022. [n. d.]. Open Source Security and Risk Analysis Report. Technical report, Synopsys Inc.
- JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 5436–5446. https://aclanthology.org/D19-1546
- Anthropic. 2024. Introducing the next generation of Claude. Accessed: March 13, 2024. 2024. url: https://www.anthropic.com/news/claude-3-family.
- Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. Acm Sigplan Notices 49, 6 (2014), 259–269.
- Is github’s copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering 28, 6 (2023), 129.
- Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732
- CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 30–39.
- Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724 (2023).
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Neural transfer learning for repairing security vulnerabilities in c code. IEEE Transactions on Software Engineering 49, 1 (2022), 147–165.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
- CodeQL. 2022. CodeQL. https://github.com/github/codeq.
- DeKeDVer: A deep learning-based multi-type software vulnerability classification framework using vulnerability description and source code. Information and Software Technology 163 (2023), 107290.
- AC/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th International Conference on Mining Software Repositories. 508–512.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
- Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
- Nat Friedman. 2021. Introducing GitHub Copilot: your AI pair programmer. URL https://github. blog/2021-06-29-introducing-github-copilot-ai-pair-programmer (2021).
- Automatic software repair: A survey. In Proceedings of the 40th International Conference on Software Engineering. 1219–1219.
- Measuring Coding Challenge Competence With APPS. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Mapping Language to Code in Programmatic Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 1643–1652. https://doi.org/10.18653/v1/D18-1192
- Codefill: Multi-token code completion by jointly learning from structure and naming sequences. In Proceedings of the 44th International Conference on Software Engineering. 401–412.
- Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
- Self-planning code generation with large language model. arXiv preprint arXiv:2303.06689 (2023).
- Repair is nearly generation: Multilingual program repair with llms. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 5131–5140.
- How Secure is Code Generated by ChatGPT? arXiv preprint arXiv:2304.09655 (2023).
- DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. arXiv preprint arXiv:2211.11501 (2022).
- Automatic program repair. IEEE Software 38, 4 (2021), 22–27.
- StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
- Think outside the code: Brainstorming boosts large language models in code generation. arXiv preprint arXiv:2305.10679 (2023).
- Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
- QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
- Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022).
- Experiences from using code explanations generated by large language models in a web software development e-book. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 931–937.
- Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2. 37–39.
- The MITRE Corporation (MITRE). 2022. Common Weakness Enumeration.
- Comparing software developers with chatgpt: An empirical investigation. arXiv preprint arXiv:2305.11837 (2023).
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
- CrossVul: a cross-language vulnerability dataset with commit data. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1565–1569.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
- Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768.
- Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2339–2356.
- Do users write more insecure code with AI assistants? arXiv preprint arXiv:2211.03622 (2022).
- Do users write more insecure code with AI assistants?. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 2785–2799.
- A manually-curated dataset of fixes to vulnerabilities of open-source software. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 383–387.
- Can OpenAI’s codex fix bugs? an evaluation on QuixBugs. In Proceedings of the Third International Workshop on Automated Program Repair. 69–75.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
- SonarSource S.A. 2022. SonarSource static code analysis. https://rules.sonarsource.com.
- Automatic static bug detection for machine learning libraries: Are we there yet? arXiv preprint arXiv:2307.04080 (2023).
- Mohammed Latif Siddiq and Joanna CS Santos. 2022. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. 29–33.
- An analysis of the automatic bug fixing performance of chatgpt. In 2023 IEEE/ACM International Workshop on Automated Program Repair (APR). IEEE, 23–30.
- Llmseceval: A dataset of natural language prompts for security evaluations. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 588–592.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- An automatic classification algorithm for software vulnerability based on weighted word vector and fusion neural network. Computers & Security 126 (2023), 103070.
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
- How effective are neural networks for fixing security vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1282–1294.
- Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971.
- An empirical study of functional bugs in android apps. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1319–1331.
- A comprehensive study of automatic program repair on the QuixBugs benchmark. Journal of Systems and Software 171 (2021), 110825.
- CERT: Continual Pre-training on Sketches for Library-oriented Code Generation. In The 2022 International Joint Conference on Artificial Intelligence.
- Jiexin Wang (14 papers)
- Xitong Luo (2 papers)
- Liuwen Cao (3 papers)
- Hongkui He (1 paper)
- Hailin Huang (4 papers)
- Jiayuan Xie (6 papers)
- Adam Jatowt (57 papers)
- Yi Cai (83 papers)