Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers (2401.06461v5)
Abstract: LLMs have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine- and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose DetectCodeGPT, a novel method for detecting machine-generated code, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.
- Unified Pre-training for Program Understanding and Generation. In NAACL 2021. arXiv. https://doi.org/10.48550/arXiv.2103.06333 arXiv:2103.06333 [cs]
- Allan J. Albrecht and John E. Gaffney. 1983. Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation. IEEE transactions on software engineering 6 (1983), 639–648.
- SantaCoder: Don’t Reach for the Stars! arXiv preprint arXiv:2301.03988 (2023). arXiv:2301.03988
- Program Synthesis with Large Language Models. https://doi.org/10.48550/arXiv.2108.07732 arXiv:2108.07732 [cs]
- Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature. arXiv:2310.05130 [cs]
- The Minimum Description Length Principle in Coding and Modeling. IEEE transactions on information theory 44, 6 (1998), 2743–2760.
- Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs] (July 2021). arXiv:2107.03374 [cs]
- GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content. arXiv preprint arXiv:2305.07969 (2023). arXiv:2305.07969
- Christian Collberg and Clark Thomborson. 1999. Software Watermarking: Models and Dynamic Embeddings. In Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 311–324.
- Digital Watermarking. Journal of Electronic Imaging 11, 3 (2002), 414–414.
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In arXiv:2002.08155 [Cs]. arXiv:2002.08155 [cs]
- InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations. arXiv. arXiv:2204.05999 [cs]
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. https://doi.org/10.48550/arXiv.2101.00027 arXiv:2101.00027 [cs]
- GLTR: Statistical Detection and Visualization of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 111–116.
- Textbooks Are All You Need. arXiv:2306.11644
- Harold Stanley Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc.
- On the Naturalness of Software. Commun. ACM 59, 5 (April 2016), 122–131. https://doi.org/10.1145/2902362
- The Curious Case of Neural Text Degeneration. In arXiv:1904.09751 [Cs]. arXiv:1904.09751 [cs]
- Radar: Robust Ai-Text Detection via Adversarial Learning. arXiv preprint arXiv:2307.03838 (2023). arXiv:2307.03838
- CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436 [cs, stat] (June 2020). arXiv:1909.09436 [cs, stat]
- Automatic Detection of Generated Text Is Easiest When Humans Are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1808–1822. https://doi.org/10.18653/v1/2020.acl-main.164
- The Stack: 3 TB of Permissively Licensed Source Code. https://doi.org/10.48550/arXiv.2211.15533 arXiv:2211.15533 [cs]
- Paraphrasing Evades Detectors of AI-generated Text, but Retrieval Is an Effective Defense. https://doi.org/10.48550/arXiv.2303.13408 arXiv:2303.13408
- Who Wrote This Code? Watermarking for Code Generation. https://doi.org/10.48550/arXiv.2305.15060 arXiv:2305.15060
- StarCoder: May the Source Be with You! https://doi.org/10.48550/arXiv.2305.06161 arXiv:2305.06161 [cs]
- Competition-Level Code Generation with AlphaCode. (2022), 74.
- Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. https://doi.org/10.48550/arXiv.2305.01210 arXiv:2305.01210 [cs]
- CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data Limitation With Contrastive Learning. arXiv preprint arXiv:2212.10341 (2022). arXiv:2212.10341
- Roberta: A Robustly Optimized Bert Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019). arXiv:1907.11692
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct. https://doi.org/10.48550/arXiv.2306.08568 arXiv:2306.08568 [cs]
- The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation. https://doi.org/10.48550/arXiv.2305.06156 arXiv:2305.06156 [cs]
- Smaller Language Models Are Better Black-box Machine-Generated Text Detectors. https://doi.org/10.48550/arXiv.2305.09859 arXiv:2305.09859 [cs]
- DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 24950–24962.
- CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. https://doi.org/10.48550/arXiv.2305.02309 arXiv:2305.02309 [cs]
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. https://doi.org/10.48550/arXiv.2203.13474 arXiv:2203.13474
- OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. Technical Report.
- OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774 [cs]
- Improving Language Understanding by Generative Pre-Training. (2018).
- Language Models Are Unsupervised Multitask Learners. OpenAI blog 1, 8 (2019), 9.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- Partha Pratim Ray. 2023. ChatGPT: A Comprehensive Review on Background, Applications, Key Challenges, Bias, Ethics, Limitations and Future Scope. Internet of Things and Cyber-Physical Systems 3 (Jan. 2023), 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003
- Jarrett Rosenberg. 1997. Some Misconceptions about Lines of Code. In Proceedings Fourth International Software Metrics Symposium. IEEE, 137–142.
- Code Llama: Open Foundation Models for Code. https://doi.org/10.48550/arXiv.2308.12950 arXiv:2308.12950 [cs]
- ChatGPT: Optimizing Language Models for Dialogue. OpenAI blog (2022).
- Release Strategies and the Social Impacts of Language Models. arXiv preprint arXiv:1908.09203 (2019). arXiv:1908.09203
- DetectLLM: Leveraging Log Rank Information for Zero-Shot Detection of Machine-Generated Text.
- CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models. In FSE2023. https://doi.org/10.1145/3611643.3616297 arXiv:2308.14401 [cs]
- Multiscale Positive-Unlabeled Detection of AI-Generated Texts. arXiv:2305.18149 [cs]
- LLaMA: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971 arXiv:2302.13971 [cs]
- Attention Is All You Need. In Advances in Neural Information Processing Systems. 5998–6008.
- CodeT5+: Open Code Large Language Models for Code Understanding and Generation. https://doi.org/10.48550/arXiv.2305.07922 arXiv:2305.07922
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708. https://doi.org/10.18653/v1/2021.emnlp-main.685
- A Survey on LLM-generated Text Detection: Necessity, Methods, and Future Directions. https://doi.org/10.48550/arXiv.2310.14724 arXiv:2310.14724 [cs]
- LLMDet: A Third Party Large Language Models Generated Text Detection Tool. arXiv:2305.15004 [cs]
- A Systematic Evaluation of Large Language Models of Code. arXiv:2202.13169 [cs] (March 2022). arXiv:2202.13169 [cs]
- A Survey on Detection of LLMs-Generated Content. arXiv:2310.15654 [cs]
- Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. https://doi.org/10.48550/arXiv.2304.10778 arXiv:2304.10778 [cs]
- G3Detector: General GPT-Generated Text Detector. arXiv preprint arXiv:2305.12680 (2023). arXiv:2305.12680
- Hongyu Zhang. 2008. Exploring Regularity in Source Code: Software Science and Zipf’s Law. In 2008 15th Working Conference on Reverse Engineering. IEEE, 101–110.
- Hongyu Zhang. 2009. Discovering Power Laws in Computer Programs. Information processing & management 45, 4 (2009), 477–483.
- A Survey of Large Language Models. https://doi.org/10.48550/arXiv.2303.18223 arXiv:2303.18223 [cs]
- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X. https://doi.org/10.48550/arXiv.2303.17568 arXiv:2303.17568 [cs]
- George Kingsley Zipf. 2016. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.