Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting (2405.16133v3)

Published 25 May 2024 in cs.SE and cs.AI

Abstract: LLMs have demonstrated remarkable proficiency in generating code. However, the misuse of LLM-generated (synthetic) code has raised concerns in both educational and industrial contexts, underscoring the urgent need for synthetic code detectors. Existing methods for detecting synthetic content are primarily designed for general text and struggle with code due to the unique grammatical structure of programming languages and the presence of numerous ''low-entropy'' tokens. Building on this, our work proposes a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants. Our method is based on the observation that differences between LLM-rewritten and original code tend to be smaller when the original code is synthetic. We utilize self-supervised contrastive learning to train a code similarity model and evaluate our approach on two synthetic code detection benchmarks. Our results demonstrate a significant improvement over existing SOTA synthetic content detectors, with AUROC scores increasing by 20.5% on the APPS benchmark and 29.1% on the MBPP benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023).
  3. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]
  4. Real or Fake? Learning to Discriminate Machine from Human Generated Text. arXiv:1906.03351 [cs.LG]
  5. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  7. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL]
  8. Computing education in the era of generative AI. Commun. ACM 67, 2 (2024), 56–67.
  9. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  10. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
  11. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 (2021).
  12. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043 (2019).
  13. GPTZero. 2023. The Global Standard for AI Detection Humans Deserve the Truth. https://gptzero.me/
  14. New and improved embedding model. https://openai.com/blog/new-and-improved-embedding-model
  15. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 7212–7225. https://doi.org/10.18653/v1/2022.acl-long.499
  16. GraphCode{BERT}: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations. https://openreview.net/forum?id=jLoC4ez43PZ
  17. Jingxuan He and Martin Vechev. 2023. Large language models for code: Security hardening and adversarial testing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1865–1879.
  18. Measuring Coding Challenge Competence With APPS. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=sD93GOzH3i5
  19. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436 [cs.LG]
  20. Automatic Detection of Generated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 1808–1822. https://doi.org/10.18653/v1/2020.acl-main.164
  21. How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research. 1–12.
  22. A Watermark for Large Language Models. arXiv:2301.10226 [cs.LG]
  23. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  24. Soft-Labeled Contrastive Pre-Training for Function-Level Code Representation. In Findings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 118–129. https://doi.org/10.18653/v1/2022.findings-emnlp.9
  25. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  26. Competition-level code generation with AlphaCode. Science 378, 6624 (Dec. 2022), 1092–1097. https://doi.org/10.1126/science.abq1158
  27. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
  28. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. arXiv:2301.11305 [cs.CL]
  29. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 1797–1807. https://doi.org/10.18653/v1/D18-1206
  30. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  31. OpenAI. 2019. gpt-2-output-dataset. https://github.com/openai/gpt-2-output-dataset
  32. OpenAi. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
  33. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  34. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
  35. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 754–768.
  36. Deepfake text detection: Limitations and opportunities. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 1613–1630.
  37. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21, 1, Article 140 (jan 2020), 67 pages.
  38. Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]
  39. StackOverflow. 2023. Developer sentiment around AI/ML. https://stackoverflow.co/labs/developer-sentiment-ai-ml/
  40. Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. arXiv preprint arXiv:2306.05540 (2023).
  41. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  42. Creating a Coding Assistant with StarCoder. Hugging Face Blog (2023). https://huggingface.co/blog/starchat.
  43. Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8384–8395.
  44. Evaluating AIGC Detectors on Code Content. arXiv:2304.05193 [cs.SE]
  45. SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation. arXiv:2108.04556 [cs.CL]
  46. CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training. In Findings of the Association for Computational Linguistics: NAACL 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Computational Linguistics, Seattle, United States, 1066–1077. https://doi.org/10.18653/v1/2022.findings-naacl.80
  47. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
  48. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR
  49. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
  50. Zero-Shot Detection of Machine-Generated Codes. arXiv:2310.05103 [cs.CL]
  51. MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure. arXiv:2006.05265 [cs.LG]
  52. Robust Multi-bit Natural Language Watermarking through Invariant Features. arXiv:2305.01904 [cs.CL]
  53. Defending against neural fake news. Advances in neural information processing systems 32 (2019).
  54. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568 (2023).
  55. Neural Deepfake Detection with Factual Structure of Text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 2461–2470. https://doi.org/10.18653/v1/2020.emnlp-main.193
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Tong Ye (34 papers)
  2. Yangkai Du (8 papers)
  3. Tengfei Ma (73 papers)
  4. Lingfei Wu (135 papers)
  5. Xuhong Zhang (61 papers)
  6. Shouling Ji (136 papers)
  7. Wenhai Wang (123 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.