Ocassionally Secure: A Comparative Analysis of Code Generation Assistants (2402.00689v1)
Abstract: $ $LLMs are being increasingly utilized in various applications, with code generations being a notable example. While previous research has shown that LLMs have the capability to generate both secure and insecure code, the literature does not take into account what factors help generate secure and effective code. Therefore in this paper we focus on identifying and understanding the conditions and contexts in which LLMs can be effectively and safely deployed in real-world scenarios to generate quality code. We conducted a comparative analysis of four advanced LLMs--GPT-3.5 and GPT-4 using ChatGPT and Bard and Gemini from Google--using 9 separate tasks to assess each model's code generation capabilities. We contextualized our study to represent the typical use cases of a real-life developer employing LLMs for everyday tasks as work. Additionally, we place an emphasis on security awareness which is represented through the use of two distinct versions of our developer persona. In total, we collected 61 code outputs and analyzed them across several aspects: functionality, security, performance, complexity, and reliability. These insights are crucial for understanding the models' capabilities and limitations, guiding future development and practical applications in the field of automated code generation.
- Google AI updates: Bard and new AI features in Search. https://blog.google/technology/ai/bard-google-ai-search-updates/. (????). (Accessed on 12/11/2023).
- OWASP Top Ten Project. (????). https://owasp.org/www-project-top-ten/ Accessed: 11/2023.
- TIOBE Index. (????). https://www.tiobe.com/tiobe-index/ Accessed: 11/2023.
- Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25 pages.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
- Wikipedia contributors. Cyclomatic Complexity. (????). https://en.wikipedia.org/wiki/Cyclomatic_complexity Accessed: 11/2023.
- Toxicity in chatgpt: Analyzing persona-assigned language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 1236–1270. https://doi.org/10.18653/v1/2023.findings-emnlp.88
- GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts. (2023). arXiv:cs.CL/2305.12477
- Stack Overflow Considered Harmful? The Impact of Copy&Paste on Android Application Security. In 2017 IEEE Symposium on Security and Privacy (SP). 121–136. https://doi.org/10.1109/SP.2017.31
- GitHub, Inc. 2023. Github Copilot. (2023). https://codeql.github.com/docs/
- Google. 2023. Bard - Chat Based AI Tool from Google, Powered by PaLM 2. https://bard.google.com/. (11 2023). (Accessed on 11/07/2023).
- CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models. (2023). arXiv:cs.CR/2302.04012
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
- James Manyika and Sissie Hsiao. 2023. An overview of Bard: an early experiment with generative AI. (2023).
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. (2023). arXiv:cs.LG/2203.13474
- BARD: A Structured Technique for Group Elicitation of Bayesian Networks to Support Analytic Reasoning. Risk Analysis 42, 6 (jun 2021), 1155–1178. https://doi.org/10.1111/risa.13759
- OpenAI. Models - OpenAI API. https://platform.openai.com/docs/models. (????). (Accessed on 11/07/2023).
- OpenAI. 2023. GPT-4 Technical Report. (2023). arXiv:cs.CL/2303.08774
- Do Users Write More Insecure Code with AI Assistants?. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23). ACM. https://doi.org/10.1145/3576915.3623157
- Security and emotion: Sentiment analysis of security discussions on GitHub. 348–351. https://doi.org/10.1145/2597073.2597117
- Maciej P Polak and Dane Morgan. 2023. Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering–Example of ChatGPT. arXiv preprint arXiv:2303.05352 (2023).
- Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training. https://api.semanticscholar.org/CorpusID:49313245
- Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
- The Seven Sins: Security Smells in Infrastructure as Code Scripts. In Proceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, 164–175. https://doi.org/10.1109/ICSE.2019.00033
- Mohammed Latif Siddiq and Joanna C. S. Santos. 2022. SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S 2022). Association for Computing Machinery, New York, NY, USA, 29–33. https://doi.org/10.1145/3549035.3561184
- Mohammed Latif Siddiq and Joanna C. S. Santos. 2023. Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code. (2023). arXiv:cs.SE/2311.00889
- Gemini: A Family of Highly Capable Multimodal Models. (2023). arXiv:cs.CL/2312.11805
- The Common Weakness Enumeration (CWE) Initiative, MITRE Corporation. 2023. Common Weakness Enumeration (CWE) Initiative. (2023). http://cwe.mitre.org/
- LaMDA: Language Models for Dialog Applications. (2022). arXiv:cs.CL/2201.08239
- LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations. arXiv preprint arXiv:2303.09384 (2023).
- Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 332, 7 pages. https://doi.org/10.1145/3491101.3519665
- Alana Variano. 2023. Context Matters: Using Personas in Prompt Engineering. (April 10 2023). https://www.linkedin.com/pulse/context-matters-using-personas-prompt-engineering-alana-smith/
- Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
- An Empirical Study of C++ Vulnerabilities in Crowd-Sourced Code Examples. (2021). arXiv:cs.SE/1910.01321
- Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation. (2023). arXiv:cs.SE/2310.16263
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
- A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. (2023). arXiv:cs.CL/2303.10420
- An Empirical Study of Security Issues Posted in Open Source Projects. https://doi.org/10.24251/HICSS.2018.686
- Li Zhong and Zilong Wang. 2023a. Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation. (2023). arXiv:cs.CL/2308.10335
- Li Zhong and Zilong Wang. 2023b. A Study on Robustness and Reliability of Large Language Model Code Generation. ArXiv abs/2308.10335 (2023). https://api.semanticscholar.org/CorpusID:261048682
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
- Ran Elgedawy (3 papers)
- John Sadik (3 papers)
- Senjuti Dutta (8 papers)
- Anuj Gautam (8 papers)
- Konstantinos Georgiou (35 papers)
- Farzin Gholamrezae (1 paper)
- Fujiao Ji (2 papers)
- Kyungchan Lim (2 papers)
- Qian Liu (252 papers)
- Scott Ruoti (17 papers)