CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code (2404.15639v3)
Abstract: LLMs have achieved remarkable progress in code generation. It now becomes crucial to identify whether the code is AI-generated and to determine the specific model used, particularly for purposes such as protecting Intellectual Property (IP) in industry and preventing cheating in programming exercises. To this end, several attempts have been made to insert watermarks into machine-generated code. However, existing approaches are limited to inserting only a single bit of information. In this paper, we introduce CodeIP, a novel multi-bit watermarking technique that inserts additional information to preserve crucial provenance details, such as the vendor ID of an LLM, thereby safeguarding the IPs of LLMs in code generation. Furthermore, to ensure the syntactical correctness of the generated code, we propose constraining the sampling process for predicting the next token by training a type predictor. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP in watermarking LLMs for code generation while maintaining the syntactical correctness of code.
- C Amazon. 2023. Ai code generator—amazon codewhisperer.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- The fortran automatic coding system. In Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability, pages 188–198.
- Real or fake? learning to discriminate machine from human generated text. arXiv preprint arXiv:1906.03351.
- A neural probabilistic language model. Advances in neural information processing systems, 13.
- A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39–71.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
- Speculative futures on chatgpt and generative artificial intelligence (ai): A collective reflection from the educational landscape. Asian Journal of Distance Education, 18(1):53–130.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Robert I Davidson and Nathan Myhrvold. 1996. Method and system for generating and auditing a signature for a computer program. US Patent 5,559,884.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186.
- Nat Friedman. 2021. Introducing github copilot: your ai pair programmer. URL https://github. blog/2021-06-29-introducing-github-copilot-ai-pair-programmer.
- Google. 2024. Gemini. https://deepmind.google/technologies/gemini/. [Online; accessed 1-Feb-2024].
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Compilers—principles, techniques, and tools. Pearson Addison Wesley Longman.
- CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436.
- A software watermarking method based on public-key cryptography and graph coloring. In 2009 Third International Conference on Genetic and Evolutionary Computing, pages 433–437. IEEE.
- A watermark for large language models. arXiv preprint arXiv:2301.10226.
- Who wrote this code? watermarking for code generation. arXiv preprint arXiv:2305.15060.
- Chuan Li. 2024. Demystifying gpt-3 language model: A technical overview. https://lambdalabs.com/blog/demystifying-gpt-3/. [Online; accessed 1-Feb-2024].
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
- Protecting intellectual property of large language model-based code generation apis via watermarks. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2336–2350.
- Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692.
- Zohar Manna and Richard J Waldinger. 1971. Toward automatic program synthesis. Communications of the ACM, 14(3):151–165.
- Microsoft. 2024. Microsoft Copilot. https://www.microsoft.com/zh-cn/microsoft-copilot. [Online; accessed 1-Feb-2024].
- Ginger Myles and Christian Collberg. 2004. Software watermarking through register allocation: Implementation, analysis, and attacks. In Information Security and Cryptology-ICISC 2003: 6th International Conference, Seoul, Korea, November 27-28, 2003. Revised Papers 6, pages 274–293. Springer.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
- OpenAI. 2023. chatgpt. http://chat.openai.com. [Online; accessed 1-Feb-2023].
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Terence J. Parr and Russell W. Quong. 1995. Antlr: A predicated-ll (k) parser generator. Software: Practice and Experience, 25(7):789–810.
- Gang Qu and Miodrag Potkonjak. 1998. Analysis of watermarking techniques for graph coloring problem. In Proceedings of the 1998 IEEE/ACM international conference on Computer-aided design, pages 190–193.
- Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203.
- Robust object watermarking: Application to code. In Information Hiding: Third International Workshop, IH’99, Dresden, Germany, September 29-October 1, 1999 Proceedings 3, pages 368–378. Springer.
- Codemark: Imperceptible watermarking for code datasets against neural code completion models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1561–1572.
- Tabnine. 2024. Tabnine. https://www.tabnine.com/. [Online; accessed 1-Feb-2024].
- Reverse engineering configurations of neural text generation models. arXiv preprint arXiv:2004.06201.
- Attention is all you need. Advances in neural information processing systems, 30.
- Richard J Waldinger and Richard CT Lee. 1969. Prow: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence, pages 241–252.
- Naturalcc: an open-source toolkit for code intelligence. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 149–153.
- Towards code watermarking with dual-channel transformations. arXiv preprint arXiv:2309.00860.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568.
- William Zhu and Clark Thomborson. 2006. Recognition in software watermarking. In Proceedings of the 4th ACM international workshop on Contents protection and security, pages 29–36.
Collections
Sign up for free to add this paper to one or more collections.