Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distinguishing LLM-generated from Human-written Code by Contrastive Learning (2411.04704v1)

Published 7 Nov 2024 in cs.SE

Abstract: LLMs, such as ChatGPT released by OpenAI, have attracted significant attention from both industry and academia due to their demonstrated ability to generate high-quality content for various tasks. Despite the impressive capabilities of LLMs, there are growing concerns regarding their potential risks in various fields, such as news, education, and software engineering. Recently, several commercial and open-source LLM-generated content detectors have been proposed, which, however, are primarily designed for detecting natural language content without considering the specific characteristics of program code. This paper aims to fill this gap by proposing a novel ChatGPT-generated code detector, CodeGPTSensor, based on a contrastive learning framework and a semantic encoder built with UniXcoder. To assess the effectiveness of CodeGPTSensor on differentiating ChatGPT-generated code from human-written code, we first curate a large-scale Human and Machine comparison Corpus (HMCorp), which includes 550K pairs of human-written and ChatGPT-generated code (i.e., 288K Python code pairs and 222K Java code pairs). Based on the HMCorp dataset, our qualitative and quantitative analysis of the characteristics of ChatGPT-generated code reveals the challenge and opportunity of distinguishing ChatGPT-generated code from human-written code with their representative features. Our experimental results indicate that CodeGPTSensor can effectively identify ChatGPT-generated code, outperforming all selected baselines.

Insightful Overview of Distinguishing LLM-generated from Human-written Code by Contrastive Learning

The paper, "Distinguishing LLM-generated from Human-written Code by Contrastive Learning," presents a novel approach to detect code generated by LLMs, specifically ChatGPT, using a contrastive learning framework. As LLMs, exemplified by ChatGPT, improve in generating high-quality content across various domains, distinguishing between machine-generated and human-written code becomes critical. This differentiation can support more reliable software development practices and education systems by preventing the unsanctioned use of AI-generated resources.

Motivation and Dataset

As AI-generated code potentially introduces security, quality, and ethical concerns into real-world software projects, detecting these codes before they infiltrate production environments is essential. Current detectors inadequately address the detection of LLM-generated code, often focusing solely on natural language content. Addressing this gap, the authors introduce CodeGPTSensor, leveraging contrastive learning and the UniXcoder model pre-trained on programming data, to tease out the subtle differences between machine and human-generated code.

Central to their paper is a carefully curated dataset, Human and Machine Comparison Corpus (HMCorp). This dataset consists of 550,000 pairs of Python and Java functions, balancing human-written code with code generated by ChatGPT. The dataset serves as a robust foundation for both the training and evaluation phases of their model.

Methodology and System Design

CodeGPTSensor employs a dual-phase architecture: offline training and online inference. The training phase uses UniXcoder to encode syntactic and semantic features of code snippets, which are then fine-tuned via contrastive learning. This process maximizes representational differences between code pairs from differing classes (LLM-generated and human-written), effectively refining CodeGPTSensor's accuracy at inference time.

Contrastive learning reinforces the model's capacity to delineate nuanced coding differences, observed as one of the weaknesses of human coders in the paper’s preliminary assessments. Developers typically struggle to distinguish AI-generated code from human-written code, as evidenced by a human paper showing nearly 50% identification accuracy. By minimizing such challenges, CodeGPTSensor approaches near-perfect precision on the HMCorp dataset.

Experimental Evaluation

In comprehensive experiments, CodeGPTSensor was juxtaposed against several commercial and open-source detectors, such as Writer, ZeroGPT, DetectGPT, and others. It consistently outperformed these approaches in several metrics — including precision, recall, and F1-score — emphasizing the model's robustness and efficacy.

Despite being tested under varied conditions, including removal of code comments and evaluations on datasets outside the usual benchmark corpuses like CodeSearchNet, CodeGPTSensor maintained its superior performance. Detective efficacy remained robust even when compared with promising models recently developed for similar purposes, highlighting its generalizability and dependable utility across different codebases.

Implications and Future Work

CodeGPTSensor's success in reliably categorizing code sources offers significant practical implications. Firstly, it can be leveraged in software development environments to flag potentially problematic LLM-generated code — managing risks before they escalate into expensive or dangerous problems. Moreover, its insights contribute to academic understandings of AI-generated content's trait differences, aiding in creating ethical and educational frameworks to regulate their utilization.

Future research could investigate the application range of CodeGPTSensor across different LLMs architectures and languages. As AI models evolve, adapting CodeGPTSensor to support detection across these variants would enhance its applicability further. In addition, extending research into embedding explanatory mechanisms within systems like CodeGPTSensor could empower developers not just to identify but also understand the reasons behind code categorizations, facilitating more informed decision-making.

In conclusion, the paper thoroughly addresses the growing necessity of distinguishing LLM-generated code from human-written equivalents, providing a well-supported and effective solution through contrastive learning that holds the promise of integration into real-world development and educational processes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiaodan Xu (12 papers)
  2. Chao Ni (17 papers)
  3. Xinrong Guo (1 paper)
  4. Shaoxuan Liu (1 paper)
  5. Xiaoya Wang (3 papers)
  6. Kui Liu (55 papers)
  7. Xiaohu Yang (198 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com