Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis (2306.14397v2)

Published 26 Jun 2023 in cs.SE and cs.CY

Abstract: The ubiquitous adoption of Large Language Generation Models (LLMs) in programming has underscored the importance of differentiating between human-written code and code generated by intelligent models. This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. Our investigation reveals disparities in programming style, technical level, and readability between these two sources. Consequently, we develop a discriminative feature set for differentiation and evaluate its efficacy through ablation experiments. Additionally, we devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets and to secure high-caliber, uncontaminated datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code. The salient contributions of our research include: proposing a discriminative feature set yielding high accuracy in differentiating ChatGPT-generated code from human-authored code in binary classification tasks; devising methods for generating extensive ChatGPT-generated codes; and introducing a dataset cleansing strategy that extracts immaculate, high-grade code datasets from open-source repositories, thus achieving exceptional accuracy in code authorship attribution tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Large-Scale and Language-Oblivious Code Authorship Identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Toronto, Canada) (CCS ’18). Association for Computing Machinery, New York, NY, USA, 101–114. https://doi.org/10.1145/3243734.3243738
  2. De-anonymizing programmers via code stylometry. In Proceedings of the 24th USENIX Security Symposium (Proceedings of the 24th USENIX Security Symposium). USENIX Association, 255–270. 24th USENIX Security Symposium ; Conference date: 12-08-2015 Through 14-08-2015.
  3. GPTutor: a ChatGPT-powered programming tool for code explanation. arXiv:2305.01863 [cs.HC]
  4. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
  5. Self-collaboration Code Generation via ChatGPT. arXiv:2304.07590 [cs.SE]
  6. GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation. arXiv:2211.10330 [cs.CL]
  7. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv:2301.07597 [cs.CL]
  8. How Secure is Code Generated by ChatGPT? arXiv:2304.09655 [cs.CR]
  9. Differentiate ChatGPT-generated and Human-written Medical Texts. arXiv:2304.11567 [cs.CL]
  10. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv:2305.01210 [cs.SE]
  11. ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text. arXiv:2301.13852 [cs.CL]
  12. An Analysis of the Automatic Bug Fixing Performance of ChatGPT.
  13. Is ChatGPT the Ultimate Programming Assistant – How far is it? arXiv:2304.11938 [cs.SE]
  14. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. arXiv:2303.07839 [cs.SE]
  15. Plan-And-Write: Towards Better Automatic Storytelling. arXiv:1811.05701 [cs.CL]
  16. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv:2304.10778 [cs.SE]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Li Ke (8 papers)
  2. Hong Sheng (1 paper)
  3. Fu Cai (1 paper)
  4. Zhang Yunhe (1 paper)
  5. Liu Ming (3 papers)
Citations (5)