Security Weaknesses of Copilot Generated Code in GitHub (2310.02059v2)
Abstract: Modern code generation tools, utilizing AI models like LLMs, have gained popularity for producing functional code. However, their usage presents security challenges, often resulting in insecure code merging into the code base. Evaluating the quality of generated code, especially its security, is crucial. While prior research explored various aspects of code generation, the focus on security has been limited, mostly examining code produced in controlled environments rather than real-world scenarios. To address this gap, we conducted an empirical study, analyzing code snippets generated by GitHub Copilot from GitHub projects. Our analysis identified 452 snippets generated by Copilot, revealing a high likelihood of security issues, with 32.8% of Python and 24.5% of JavaScript snippets affected. These issues span 38 different Common Weakness Enumeration (CWE) categories, including significant ones like CWE-330: Use of Insufficiently Random Values, CWE-78: OS Command Injection, and CWE-94: Improper Control of Generation of Code. Notably, eight CWEs are among the 2023 CWE Top-25, highlighting their severity. Our findings confirm that developers should be careful when adding code generated by Copilot and should also run appropriate security checks as they accept the suggested code. It also shows that practitioners should cultivate corresponding security awareness and skills.
- Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering 28, 6 (2023), Article No. 129.
- Grounded copilot: How programmers interact with code-generating models. arXiv preprint arXiv:2206.15000 (2022).
- Programming Is Hard–Or at Least It Used to Be: Educational Opportunities And Challenges of AI Code Generation. arXiv preprint arXiv:2212.01020 (2022).
- Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- GitHub Copilot AI pair programmer: Asset or Liability? arXiv preprint arXiv:2206.15331 (2022).
- Finding Fixed Vulnerabilities with Off-the-Shelf Static Analysis. In 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P). IEEE, 489–505.
- Selecting Empirical Methods for Software Engineering Research. Springer, 285–311.
- Michael D Ernst. 2003. Static and dynamic analysis: Synergy and duality. In Proceedings of the ICSE Workshop on Dynamic Analysis (WODA). ACM, 24–27.
- Dataset of the Paper “Security Weaknesses of Copilot Generated Code in GitHub”.
- GitHub. [n. d.]. GitHub Copilot for Individuals. https://docs.github.com/en/copilot/overview-of-github-copilot/about-github-copilot-for-individuals
- GitHub. 2021a. CodeQL (1.6 ed.). GitHub. https://securitylab.github.com/tools/codeql.
- GitHub. 2021b. Using the CodeQL CLI. GitHub. https://docs.github.com/zh/code-security/codeql-cli/using-the-codeql-cli/analyzing-databases-with-the-codeql-cli.
- GitHub. 2023. GitHub CopilotX Preview. https://github.com/features/preview/copilot-x Accessed: 2023-07-28.
- Jingxuan He and Martin Vechev. 2023. Controlling Large Language Models to Generate Secure and Vulnerable Code. arXiv preprint arXiv:2302.05319 (2023).
- The secret life of software vulnerabilities: A large-scale empirical study. IEEE Transactions on Software Engineering 49, 1 (2022), 44–63.
- Detecting false alarms from automatic static analysis tools: How far are we?. In Proceedings of the 44th International Conference on Software Engineering (ICSE). ACM, 698–709.
- Arvinder Kaur and Ruchikaa Nayyar. 2020. A comparative study of static code analysis tools for vulnerability detection in c/c++ and java source code. Procedia Computer Science 171 (2020), 2023–2029.
- How Secure is Code Generated by ChatGPT? arXiv preprint arXiv:2304.09655 (2023).
- An Empirical Evaluation of Competitive Programming AI: A Case Study of AlphaCode. In Proceedings of the 16th IEEE International Workshop on Software Clones (IWSC). IEEE, 10–15.
- Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
- CCTEST: Testing and Repairing Code Completion Systems. arXiv preprint arXiv:2208.08289 (2022).
- mend.io. 2023. The Most Secure Programming Languages. https://www.mend.io/most-secure-programming-languages/
- Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. arXiv preprint arXiv:2210.14306 (2022).
- Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s code suggestions. In Proceedings of the 19th IEEE/ACM International Conference on Mining Software Repositories (MSR). IEEE, 1–5.
- OpenAI. [n. d.]. Codex. https://openai.com/blog/openai-codex.
- OWASP. [n. d.]. Source Code Analysis Tools. https://owasp.org/www-community/Source_Code_Analysis_Tools.
- Asleep at the keyboard? assessing the security of github copilot’s code contributions. In Proceedings of the 43rd IEEE Symposium on Security and Privacy (SP). IEEE, 754–768.
- Do Users Write More Insecure Code with AI Assistants? arXiv preprint arXiv:2211.03622 (2022).
- Rohith Pudari and Neil A Ernst. 2023. From Copilot to Pilot: Towards AI Supported Software Development. arXiv preprint arXiv:2303.04142 (2023).
- SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub. In Proceedings of the 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID). USENIX, 149–163.
- Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering 14 (2009), 131–164.
- Security Implications of Large Language Model Code Assistants: A User Study. arXiv preprint arXiv:2208.09727 (2022).
- What is it like to program with artificial intelligence? arXiv preprint arXiv:2208.06213 (2022).
- An Empirical Study of Code Smells in Transformer-based Code Generation Techniques. In Proceedings of the 22nd IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 71–82.
- Mohammed Latif Siddiq and Joanna CS Santos. 2022. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. In Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S). ACM, 29–33.
- Choose your programming copilot: a comparison of the program synthesis performance of github copilot and genetic programming. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO). ACM, 1019–1027.
- Stackscale. 2021. The 9 most popular programming languages to learn in 2021. https://www.stackscale.com/blog/most-popular-programming-languages-to-learn-in-2021/ Accessed on 2023-07-27.
- On the recall of static call graph construction in practice. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE). ACM, 1049–1060.
- The MITRE Corporation. [n. d.]. CWE - Common Weakness Enumeration. https://cwe.mitre.org/data/index.html.
- The MITRE Corporation. 2022. CWE - 2022 CWE Top 25. https://cwe.mitre.org/top25/archive/2022/2022_cwe_top25.html#cwe_top_25.
- The MITRE Corporation. 2023. CWE VIEW: Software Development. https://cwe.mitre.org/data/definitions/699.html Accessed: 2023-07-28.
- The adoption of javascript linters in practice: A case study on eslint. IEEE Transactions on Software Engineering 46, 8 (2018), 863–891.
- LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations. arXiv preprint arXiv:2303.09384 (2023).
- Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Proceedings of the 40th ACM Conference on Human Factors in Computing Systems (CHI). ACM, 1–7.
- Richard J Waldinger and Richard CT Lee. 1969. PROW: A step toward automatic program writing. In Proceedings of the 1st International Joint Conference on Artificial Intelligence (IJCAI). ACM, 241–252.
- Exploring the Verifiability of Code Generated by GitHub Copilot. arXiv preprint arXiv:2209.01766 (2022).
- Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv preprint arXiv:2304.10778 (2023).
- Assessing the quality of GitHub copilot’s code generation. In Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, 62–71.
- Yujia Fu (8 papers)
- Peng Liang (94 papers)
- Amjed Tahir (34 papers)
- Zengyang Li (23 papers)
- Mojtaba Shahin (54 papers)
- Jiaxin Yu (16 papers)
- Jinfu Chen (12 papers)