Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models (2312.04724v1)
Abstract: This paper presents CyberSecEval, a comprehensive benchmark developed to help bolster the cybersecurity of LLMs employed as coding assistants. As what we believe to be the most extensive unified cybersecurity safety benchmark to date, CyberSecEval provides a thorough evaluation of LLMs in two crucial security domains: their propensity to generate insecure code and their level of compliance when asked to assist in cyberattacks. Through a case study involving seven models from the Llama 2, Code Llama, and OpenAI GPT LLM families, CyberSecEval effectively pinpointed key cybersecurity risks. More importantly, it offered practical insights for refining these models. A significant observation from the study was the tendency of more advanced models to suggest insecure code, highlighting the critical need for integrating security considerations in the development of sophisticated LLMs. CyberSecEval, with its automated test case generation and evaluation pipeline covers a broad scope and equips LLM designers and researchers with a tool to broadly measure and enhance the cybersecurity safety properties of LLMs, contributing to the development of more secure AI systems.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Thomas Dohmke. Github copilot for business is now available, Feb 2023. URL https://github.blog/2023-02-14-github-copilot-for-business-is-now-available/.
- Systematically finding security vulnerabilities in black-box code generation models. arXiv preprint arXiv:2302.04012, 2023.
- How secure is code generated by chatgpt? arXiv preprint arXiv:2304.09655, 2023.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Corporation MITRE. Common weakness enumeration: A community-developed list of software & hardware weakness types. https://cwe.mitre.org, 2023a. Online; accessed 4 December 2023.
- Corporation MITRE. Mitre att&ck®. https://attack.mitre.org, 2023b. Online; accessed 4 December 2023.
- Codecompose: A large-scale industrial deployment of ai-assisted code authoring. arXiv preprint arXiv:2305.12050, 2023.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE, 2022.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Lost at c: A user study on the security implications of large language model code assistants. arXiv preprint arXiv:2208.09727, 2023.
- Inc. Semgrep. semgrep. https://semgrep.dev/, 2023. Online; accessed 5 December 2023.
- Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, pages 29–33, 2022.
- An empirical study of code smells in transformer-based code generation techniques. In 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 71–82, 2022. 10.1109/SCAM55253.2022.00014.
- The formai dataset: Generative ai in software security through the lens of formal verification. arXiv preprint arXiv:2307.02192, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- weggli. https://github.com/weggli-rs/weggli, 2023. Online; accessed 5 December 2023.
- Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt. arXiv preprint arXiv:2304.10778, 2023.
- Li Zhong and Zilong Wang. A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335, 2023.
- Manish Bhatt (10 papers)
- Sahana Chennabasappa (6 papers)
- Cyrus Nikolaidis (5 papers)
- Shengye Wan (6 papers)
- Ivan Evtimov (24 papers)
- Dominik Gabi (2 papers)
- Daniel Song (6 papers)
- Faizan Ahmad (4 papers)
- Cornelius Aschermann (4 papers)
- Lorenzo Fontana (1 paper)
- Sasha Frolov (1 paper)
- Ravi Prakash Giri (1 paper)
- Dhaval Kapil (3 papers)
- Yiannis Kozyrakis (1 paper)
- David LeBlanc (1 paper)
- James Milazzo (1 paper)
- Aleksandar Straumann (1 paper)
- Gabriel Synnaeve (97 papers)
- Varun Vontimitta (2 papers)
- Spencer Whitman (5 papers)