Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry (2412.14611v1)

Published 19 Dec 2024 in cs.SE

Abstract: With the increasing popularity of LLM-based code completers, like GitHub Copilot, the interest in automatically detecting AI-generated code is also increasing-in particular in contexts where the use of LLMs to program is forbidden by policy due to security, intellectual property, or ethical concerns.We introduce a novel technique for AI code stylometry, i.e., the ability to distinguish code generated by LLMs from code written by humans, based on a transformer-based encoder classifier. Differently from previous work, our classifier is capable of detecting AI-written code across 10 different programming languages with a single machine learning model, maintaining high average accuracy across all languages (84.1% $\pm$ 3.8%).Together with the classifier we also release H-AIRosettaMP, a novel open dataset for AI code stylometry tasks, consisting of 121 247 code snippets in 10 popular programming languages, labeled as either human-written or AI-generated. The experimental pipeline (dataset, training code, resulting models) is the first fully reproducible one for the AI code stylometry task. Most notably our experiments rely only on open LLMs, rather than on proprietary/closed ones like ChatGPT.

Summary

  • The paper introduces a transformer-based multilingual classifier to detect AI-generated source code with 84.1% accuracy.
  • It presents the novel H-AIRosettaMP dataset comprising 121,247 code snippets from 10 programming languages for robust AI stylometry.
  • It enhances policy enforcement and authorship attribution by reliably distinguishing AI-written code in diverse programming environments.

Recognizing AI-written Code with Multilingual Code Stylometry

The paper "Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry" presents an advanced methodology for identifying source code generated by LLMs, specifically those like GitHub Copilot. This recognition task is becoming increasingly critical in environments where the use of such AI tools is restricted due to concerns related to security, intellectual property, or academic integrity.

The authors introduce a novel technique leveraging a transformer-based encoder classifier capable of detecting AI-generated source code across 10 different programming languages. The model's multilingual ability is a substantial improvement over previous approaches that typically focus on a single programming language. The classifier maintains a high average accuracy of 84.1%, with a standard deviation of 3.8%, demonstrating its robustness in varying linguistic contexts.

A significant contribution of the paper is the creation and release of the H-AIRosettaMP dataset, specifically designed for AI code stylometry tasks. This dataset includes 121,247 code snippets labeled as either human-written or AI-generated, covering 10 widely-used programming languages. All AI-generated snippets in the dataset are produced using open LLMs, such as StarCoder2, ensuring the reproducibility of the experimental pipeline—a notable distinction from prior studies that relied on proprietary models like ChatGPT, which are not open for public use and scrutiny.

The methodology for developing the dataset revolves around using Rosetta Code as a base for human-written code in various languages. For AI-generated code, language translation capabilities of open LLMs are utilized to generate authentic snippets, thereby preserving a fair and balanced training set. This balanced approach ensures that the multilingual model is trained to recognize AI-generated code irrespective of the source language, a common issue in environments where different programming languages coexist.

The training of the transformer-based encoder classifier incorporates state-of-the-art deep learning techniques, moving beyond the traditional methods employed in earlier works that predominantly used machine learning paradigms like random forest or decision trees. The integration of advanced techniques, such as a transformer encoder setup with a specific architecture fine-tuned for AI stylometry, enables the model to discern AI-written code effectively based on the learned stylometric features embedded in the code tokens.

The implications of this research are significant both practically and theoretically. Practically, the ability to automatically discern AI-generated code can inform policy enforcement in contexts where AI assistance is restricted, such as academic settings or proprietary enterprise software environments. Theoretically, it broadens our understanding of code stylometry, extending it into the domain of multilingual contexts and AI-detection tasks, thereby contributing to the broader field of computational authorship attribution.

Future research could delve into refining these models to increase precision and adapt them to different LLM architectures as they evolve. Additionally, exploring alternative methods for generating AI-labeled data could further enhance model robustness and applicability in increasingly complex software development landscapes. As the prevalence of LLMs continues to grow, the tools and techniques discussed in this paper will become increasingly relevant for both academic research and industry application, shaping the interactions between human coders and AI systems.