Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants (2402.00689v1)

Published 1 Feb 2024 in cs.CR and cs.AI

Abstract: $ $LLMs are being increasingly utilized in various applications, with code generations being a notable example. While previous research has shown that LLMs have the capability to generate both secure and insecure code, the literature does not take into account what factors help generate secure and effective code. Therefore in this paper we focus on identifying and understanding the conditions and contexts in which LLMs can be effectively and safely deployed in real-world scenarios to generate quality code. We conducted a comparative analysis of four advanced LLMs--GPT-3.5 and GPT-4 using ChatGPT and Bard and Gemini from Google--using 9 separate tasks to assess each model's code generation capabilities. We contextualized our study to represent the typical use cases of a real-life developer employing LLMs for everyday tasks as work. Additionally, we place an emphasis on security awareness which is represented through the use of two distinct versions of our developer persona. In total, we collected 61 code outputs and analyzed them across several aspects: functionality, security, performance, complexity, and reliability. These insights are crucial for understanding the models' capabilities and limitations, guiding future development and practical applications in the field of automated code generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Google AI updates: Bard and new AI features in Search. https://blog.google/technology/ai/bard-google-ai-search-updates/. (????). (Accessed on 12/11/2023).
  2. OWASP Top Ten Project. (????). https://owasp.org/www-project-top-ten/ Accessed: 11/2023.
  3. TIOBE Index. (????). https://www.tiobe.com/tiobe-index/ Accessed: 11/2023.
  4. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25 pages.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  7. Wikipedia contributors. Cyclomatic Complexity. (????). https://en.wikipedia.org/wiki/Cyclomatic_complexity Accessed: 11/2023.
  8. Toxicity in chatgpt: Analyzing persona-assigned language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 1236–1270. https://doi.org/10.18653/v1/2023.findings-emnlp.88
  9. GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts. (2023). arXiv:cs.CL/2305.12477
  10. Stack Overflow Considered Harmful? The Impact of Copy&Paste on Android Application Security. In 2017 IEEE Symposium on Security and Privacy (SP). 121–136. https://doi.org/10.1109/SP.2017.31
  11. GitHub, Inc. 2023. Github Copilot. (2023). https://codeql.github.com/docs/
  12. Google. 2023. Bard - Chat Based AI Tool from Google, Powered by PaLM 2. https://bard.google.com/. (11 2023). (Accessed on 11/07/2023).
  13. CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models. (2023). arXiv:cs.CR/2302.04012
  14. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
  15. James Manyika and Sissie Hsiao. 2023. An overview of Bard: an early experiment with generative AI. (2023).
  16. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. (2023). arXiv:cs.LG/2203.13474
  17. BARD: A Structured Technique for Group Elicitation of Bayesian Networks to Support Analytic Reasoning. Risk Analysis 42, 6 (jun 2021), 1155–1178. https://doi.org/10.1111/risa.13759
  18. OpenAI. Models - OpenAI API. https://platform.openai.com/docs/models. (????). (Accessed on 11/07/2023).
  19. OpenAI. 2023. GPT-4 Technical Report. (2023). arXiv:cs.CL/2303.08774
  20. Do Users Write More Insecure Code with AI Assistants?. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23). ACM. https://doi.org/10.1145/3576915.3623157
  21. Security and emotion: Sentiment analysis of security discussions on GitHub. 348–351. https://doi.org/10.1145/2597073.2597117
  22. Maciej P Polak and Dane Morgan. 2023. Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering–Example of ChatGPT. arXiv preprint arXiv:2303.05352 (2023).
  23. Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training. https://api.semanticscholar.org/CorpusID:49313245
  24. Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
  25. The Seven Sins: Security Smells in Infrastructure as Code Scripts. In Proceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, 164–175. https://doi.org/10.1109/ICSE.2019.00033
  26. Mohammed Latif Siddiq and Joanna C. S. Santos. 2022. SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S 2022). Association for Computing Machinery, New York, NY, USA, 29–33. https://doi.org/10.1145/3549035.3561184
  27. Mohammed Latif Siddiq and Joanna C. S. Santos. 2023. Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code. (2023). arXiv:cs.SE/2311.00889
  28. Gemini: A Family of Highly Capable Multimodal Models. (2023). arXiv:cs.CL/2312.11805
  29. The Common Weakness Enumeration (CWE) Initiative, MITRE Corporation. 2023. Common Weakness Enumeration (CWE) Initiative. (2023). http://cwe.mitre.org/
  30. LaMDA: Language Models for Dialog Applications. (2022). arXiv:cs.CL/2201.08239
  31. LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations. arXiv preprint arXiv:2303.09384 (2023).
  32. Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 332, 7 pages. https://doi.org/10.1145/3491101.3519665
  33. Alana Variano. 2023. Context Matters: Using Personas in Prompt Engineering. (April 10 2023). https://www.linkedin.com/pulse/context-matters-using-personas-prompt-engineering-alana-smith/
  34. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
  35. An Empirical Study of C++ Vulnerabilities in Crowd-Sourced Code Examples. (2021). arXiv:cs.SE/1910.01321
  36. Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation. (2023). arXiv:cs.SE/2310.16263
  37. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  38. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. (2023). arXiv:cs.CL/2303.10420
  39. An Empirical Study of Security Issues Posted in Open Source Projects. https://doi.org/10.24251/HICSS.2018.686
  40. Li Zhong and Zilong Wang. 2023a. Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation. (2023). arXiv:cs.CL/2308.10335
  41. Li Zhong and Zilong Wang. 2023b. A Study on Robustness and Reliability of Large Language Model Code Generation. ArXiv abs/2308.10335 (2023). https://api.semanticscholar.org/CorpusID:261048682
  42. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ran Elgedawy (3 papers)
  2. John Sadik (3 papers)
  3. Senjuti Dutta (8 papers)
  4. Anuj Gautam (8 papers)
  5. Konstantinos Georgiou (35 papers)
  6. Farzin Gholamrezae (1 paper)
  7. Fujiao Ji (2 papers)
  8. Kyungchan Lim (2 papers)
  9. Qian Liu (252 papers)
  10. Scott Ruoti (17 papers)
Citations (6)