- The paper introduces a transformer-based multilingual classifier to detect AI-generated source code with 84.1% accuracy.
- It presents the novel H-AIRosettaMP dataset comprising 121,247 code snippets from 10 programming languages for robust AI stylometry.
- It enhances policy enforcement and authorship attribution by reliably distinguishing AI-written code in diverse programming environments.
Recognizing AI-written Code with Multilingual Code Stylometry
The paper "Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry" presents an advanced methodology for identifying source code generated by LLMs, specifically those like GitHub Copilot. This recognition task is becoming increasingly critical in environments where the use of such AI tools is restricted due to concerns related to security, intellectual property, or academic integrity.
The authors introduce a novel technique leveraging a transformer-based encoder classifier capable of detecting AI-generated source code across 10 different programming languages. The model's multilingual ability is a substantial improvement over previous approaches that typically focus on a single programming language. The classifier maintains a high average accuracy of 84.1%, with a standard deviation of 3.8%, demonstrating its robustness in varying linguistic contexts.
A significant contribution of the paper is the creation and release of the H-AIRosettaMP dataset, specifically designed for AI code stylometry tasks. This dataset includes 121,247 code snippets labeled as either human-written or AI-generated, covering 10 widely-used programming languages. All AI-generated snippets in the dataset are produced using open LLMs, such as StarCoder2, ensuring the reproducibility of the experimental pipeline—a notable distinction from prior studies that relied on proprietary models like ChatGPT, which are not open for public use and scrutiny.
The methodology for developing the dataset revolves around using Rosetta Code as a base for human-written code in various languages. For AI-generated code, language translation capabilities of open LLMs are utilized to generate authentic snippets, thereby preserving a fair and balanced training set. This balanced approach ensures that the multilingual model is trained to recognize AI-generated code irrespective of the source language, a common issue in environments where different programming languages coexist.
The training of the transformer-based encoder classifier incorporates state-of-the-art deep learning techniques, moving beyond the traditional methods employed in earlier works that predominantly used machine learning paradigms like random forest or decision trees. The integration of advanced techniques, such as a transformer encoder setup with a specific architecture fine-tuned for AI stylometry, enables the model to discern AI-written code effectively based on the learned stylometric features embedded in the code tokens.
The implications of this research are significant both practically and theoretically. Practically, the ability to automatically discern AI-generated code can inform policy enforcement in contexts where AI assistance is restricted, such as academic settings or proprietary enterprise software environments. Theoretically, it broadens our understanding of code stylometry, extending it into the domain of multilingual contexts and AI-detection tasks, thereby contributing to the broader field of computational authorship attribution.
Future research could delve into refining these models to increase precision and adapt them to different LLM architectures as they evolve. Additionally, exploring alternative methods for generating AI-labeled data could further enhance model robustness and applicability in increasingly complex software development landscapes. As the prevalence of LLMs continues to grow, the tools and techniques discussed in this paper will become increasingly relevant for both academic research and industry application, shaping the interactions between human coders and AI systems.