Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions (2108.09293v3)

Published 20 Aug 2021 in cs.CR and cs.AI

Abstract: There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described `AI pair programmer', GitHub Copilot, a LLM trained over open-source GitHub code. However, code often contains bugs - and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the LLM will have learned from exploitable, buggy code. This raises concerns on the security of Copilot's code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE's "Top 25" list). We explore Copilot's performance on three distinct code generation axes -- examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,689 programs. Of these, we found approximately 40% to be vulnerable.

PDF Abstract

Security Analysis of AI-Generated Code: A Study on GitHub Copilot

The paper "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions" investigates the potential security vulnerabilities in code suggestions generated by GitHub Copilot, an AI-based tool for code generation. This work systematically evaluates how Copilot's ML model might suggest insecure code by analyzing its behavior across various scenarios derived from known cybersecurity weaknesses.

Methodology

The authors design a comprehensive and methodical approach to evaluate Copilot's code generation concerning security vulnerabilities. They focus on a subset of vulnerabilities from the Common Weakness Enumeration (CWE) list, notably using the "2021 CWE Top 25 Most Dangerous Software Weaknesses" as a foundation for testing. By creating 89 scenarios corresponding to different CWEs, they examine Copilot's performance via automated and manual analyses. These scenarios involve prompting Copilot to generate code in various programming languages, such as C, Python, and Verilog, and analyzing the security of generated code using tools like GitHub's CodeQL.

Key Findings

The paper finds that a significant portion of code generated by Copilot is potentially insecure. Approximately 40% of Copilot's suggestions were found to be vulnerable across 1,689 analyzed program samples. The paper highlights that certain classes of vulnerabilities, such as SQL injection (CWE-89) and command injection (CWE-78), are more prevalent in Copilot's outputs. Furthermore, a pivotal observation is that Copilot's suggestions can vary greatly in security strength based on slight changes in the prompt, affecting the security of the top code suggestion.

Interestingly, the results on language-specific analyses indicate variability: Copilot performs reasonably well in Python and C code but shows limitations in Verilog, a less commonly used language in its training data, which resulted in a higher rate of syntactically or semantically incorrect code.

Implications

The implications of this research are multifaceted. Practically, it serves as a caution for developers using AI-based coding assistants; maintaining vigilance is crucial as generated code may carry security flaws. Theoretically, it underscores the current challenges in AI training datasets: the need for high-quality, secure code samples in the training corpus is pivotal. Without these, AI tools risk perpetuating poor coding practices and vulnerabilities ingrained in legacy codebases.

The paper also indirectly advocates for advancing AI models—equip these models with better contextual understanding and security-awareness during training and operational phases. It is critical for AI coding assistants to evolve and use vulnerability-aware mechanisms to flag or mitigate insecure coding patterns.

Future Perspectives

This paper opens avenues for further research in AI code generation, particularly in improving AI's training datasets and developing models that can discern or correct potentially insecure patterns in real-time. Future developments could involve integrating security tools directly within AI models or promoting community-driven enhancements where secure coding practices are prioritized in open-source contributions.

Given the rise of AI in automating and assisting in software development, ensuring these tools enhance human capability without compromising security is imperative. Monitoring the evolving efficacy of AI-tool generated code through an ongoing evaluation like the paper presented is essential in achieving this objective.