Comparing Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written (2411.10565v1)

Published 15 Nov 2024 in cs.SE

Abstract: Thanks to the widespread adoption of LLMs in software engineering research, the long-standing dream of automated code generation has become a reality on a large scale. Nowadays, LLMs such as GitHub Copilot and ChatGPT are extensively used in code generation for enterprise and open-source software development and maintenance. Despite their unprecedented successes in code generation, research indicates that codes generated by LLMs exhibit vulnerabilities and security issues. Several studies have been conducted to evaluate code generated by LLMs, considering various aspects such as security, vulnerability, code smells, and robustness. While some studies have compared the performance of LLMs with that of humans in various software engineering tasks, there's a notable gap in research: no studies have directly compared human-written and LLM-generated code for their robustness analysis. To fill this void, this paper introduces an empirical study to evaluate the adversarial robustness of Pre-trained Models of Code (PTMCs) fine-tuned on code written by humans and generated by LLMs against adversarial attacks for software clone detection. These attacks could potentially undermine software security and reliability. We consider two datasets, two state-of-the-art PTMCs, two robustness evaluation criteria, and three metrics to use in our experiments. Regarding effectiveness criteria, PTMCs fine-tuned on human-written code always demonstrate more robustness than those fine-tuned on LLMs-generated code. On the other hand, in terms of adversarial code quality, in 75% experimental combinations, PTMCs fine-tuned on the human-written code exhibit more robustness than the PTMCs fine-tuned on the LLMs-generated code.

Authors (3)

Mrigank Rochan (20 papers)
Chanchal K. Roy (55 papers)
Md Abdul Awal (4 papers)

Summary

Comparison of Robustness Against Adversarial Attacks in Code Generation: LLM-Generated vs. Human-Written

The paper in question addresses a critical aspect of automated code generation, focusing on the robustness of LLM-generated code compared to human-written code against adversarial attacks. As the software engineering community increasingly embraces AI-driven code generation, assessing the resilience of these techniques against vulnerability exploitation becomes imperative. This paper fills a notable research gap by directly comparing the adversarial robustness of Pre-trained Models of Code (PTMCs) fine-tuned on code written by humans and on code generated by LLMs such as ChatGPT.

Methodological Approach

The researchers provide a comprehensive evaluation framework, employing two specific datasets—SemanticCloneBench, representative of human-written code, and GPTCloneBench, which extends SemanticCloneBench using GPT-3 generated equivalents. They employ two prominent PTMCs, CodeBERT and CodeGPT, fine-tuned using these datasets. Empirical analysis is conducted using four state-of-the-art black-box adversarial attack techniques: ALERT, WIR-Random, MHM, and StyleTransfer. These methodologies are extensively adopted in existing literature to test model robustness in software analytics tasks.

Key Findings

The results from this empirical paper provide a detailed comparative analysis using criteria based on attack effectiveness and adversarial code quality. The key findings suggest:

Effectiveness Against Adversarial Attacks: The Attack Success Rate (ASR) indicates that PTMCs fine-tuned on human-written code consistently demonstrate higher robustness across different adversarial scenarios, as evidenced by significantly lower ASR scores.
Adversarial Code Quality: When evaluating the quality of adversarial examples using Average Code Similarity (ACS) and Average Edit Distance (AED), PTMCs fine-tuned on human-written code show more substantial resistance to perturbations. Both the AED and ACS metrics reveal higher robustness in models using human-generated base datasets in 75% of the experimental combinations.

Overall, the evaluation criteria firmly establish that human-written code provides more formidable training grounds for robustness against adversarial attacks than code generated by LLMs.

Implications and Future Work

From a theoretical perspective, these empirical findings prompt reconsideration of the role that AI-generated code can currently play in the broader landscape of software engineering. While LLMs provide remarkable efficiency, there is an exigent need to enhance the security paradigms surrounding AI-generated code.

On a practical scale, software engineering practices may need to employ hybrid approaches, judiciously leveraging human oversight and input alongside AI-generated suggestions to strike an optimal balance between automation and security. This dual approach seems particularly promising in environments where security is paramount.

For future research avenues, bridging the identified gap in robustness between human-generated and LLM-generated code could involve enhancing the LLMs' training processes, integrating advanced adversarial training techniques, or redefining the interactive paradigms between human experts and LLMs for code validation. Moreover, extending such evaluations across more diverse tasks within software engineering could ascertain whether these findings hold in broader contexts or specific niche applications.

The contributions of this paper critically advance our understanding of the resilience of AI-generated code, motivating enhanced methodologies that integrate AI's remarkable capabilities with human expertise to create robust, secure software solutions.

PDF Markdown