I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution (2501.08165v1)

Published 14 Jan 2025 in cs.SE and cs.AI

Abstract: Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysis using LLMs, which have shown exceptional performance without task-specific tuning, this paper explores the use of LLMs for source code authorship attribution. We present a comprehensive study demonstrating that state-of-the-art LLMs can successfully attribute source code authorship across different languages. LLMs can determine whether two code snippets are written by the same author with zero-shot prompting, achieving a Matthews Correlation Coefficient (MCC) of 0.78, and can attribute code authorship from a small set of reference code snippets via few-shot learning, achieving MCC of 0.77. Additionally, LLMs show some adversarial robustness against misattribution attacks. Despite these capabilities, we found that naive prompting of LLMs does not scale well with a large number of authors due to input token limitations. To address this, we propose a tournament-style approach for large-scale attribution. Evaluating this approach on datasets of C++ (500 authors, 26,355 samples) and Java (686 authors, 55,267 samples) code from GitHub, we achieve classification accuracy of up to 65% for C++ and 68.7% for Java using only one reference per author. These results open new possibilities for applying LLMs to code authorship attribution in cybersecurity and software engineering.

Summary

The paper shows that LLMs can achieve up to 0.78 MCC in zero-shot authorship attribution of source code.
It employs both zero-shot and few-shot in-context strategies, demonstrating robust performance even under adversarial attacks.
The study introduces a tournament-style approach to overcome token limitations, attaining 65%-68.7% accuracy on C++ and Java datasets.

Leveraging LLMs for Code Authorship Attribution

The paper "I Can Find You in Seconds! Leveraging LLMs for Code Authorship Attribution" explores the utility of LLMs in the domain of code authorship attribution. This task, which determines the authorship of a given piece of source code, is of significant value in various applications spanning software forensics, plagiarism detection, and the safeguarding of software integrity. Traditional methods often rely on supervised machine learning techniques, which require extensive labeled datasets but struggle with generalizability across programming languages and coding styles. This research explores whether LLMs can bridge these gaps effectively.

Methodology and Experimentation

In an extensive empirical paper, the authors employ state-of-the-art LLMs to assess their efficacy in code authorship attribution across multiple programming languages. They utilized prominent LLMs including ChatGPT, Gemini, Mistral, and Llama. The methodology involves:

Zero-Shot Prompting: The LLMs are prompted without prior task-specific tuning to determine if two source code samples are authored by the same individual. Results indicate successful authorship attribution with Matthews Correlation Coefficient (MCC) scores reaching up to 0.78.
Few-Shot In-Context Learning: Authors also investigate the attribution of a sample using limited reference snippets. In this context, LLMs achieved MCC scores up to 0.77.
Adversarial Robustness: LLMs demonstrated robustness against misattribution attacks, maintaining a degree of resilience not typically observed in traditional ML-based models.

Despite these encouraging results, challenges were observed, particularly concerning the scalability of LLMs in handling a large number of authors due to input token limitations. To alleviate this, a novel tournament-style approach was proposed. This method divides authors into smaller groups and processes them iteratively using LLMs to manage their input efficiently. An assessment on datasets with 500 C++ authors and 686 Java authors from GitHub showed the model could achieve classification accuracy of up to 65% for C++ and 68.7% for Java.

Implications and Future Directions

The findings of this paper have broad implications for both theoretical and practical adaptations of LLMs in software engineering and cybersecurity. On the theoretical front, the research advances the understanding of LLMs' capabilities in pattern recognition and generalization to code-specific tasks without being constrained by language structure. Practically, it opens pathways for deploying LLMs in real-world scenarios where source code authorship verification is vital.

The paper suggests several avenues for future exploration. Further model fine-tuning or enhanced prompt engineering may bolster LLM performance in more varied authorship scenarios. Additionally, approaches to mitigate LLM input token limitations, such as through the proposed tournament method, can be refined for even greater scalability. Furthermore, extending the investigation to other programming languages or hybrid languages can provide deeper insights into LLM generalization abilities in multilingual contexts.

Overall, the paper projects that with continuing advancements in LLM architectures, future models might inherently possess more robust capabilities for code authorship tasks, rendering them crucial contributors to the landscape of intelligent software engineering tools.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1881866585947902080