Towards Codable Watermarking for Injecting Multi-bits Information to LLMs (2307.15992v3)
Abstract: As LLMs generate texts with increasing fluency and realism, there is a growing need to identify the source of texts to prevent the abuse of LLMs. Text watermarking techniques have proven reliable in distinguishing whether a text is generated by LLMs by injecting hidden patterns. However, we argue that existing LLM watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs (such as encoding model version, generation time, user id, etc.). In this work, we conduct the first systematic study on the topic of Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information. First of all, we study the taxonomy of LLM watermarking technologies and give a mathematical formulation for CTWL. Additionally, we provide a comprehensive evaluation system for CTWL: (1) watermarking success rate, (2) robustness against various corruptions, (3) coding rate of payload information, (4) encoding and decoding efficiency, (5) impacts on the quality of the generated text. To meet the requirements of these non-Pareto-improving metrics, we follow the most prominent vocabulary partition-based watermarking direction, and devise an advanced CTWL method named Balance-Marking. The core idea of our method is to use a proxy LLM to split the vocabulary into probability-balanced parts, thereby effectively maintaining the quality of the watermarked text. Our code is available at https://github.com/lancopku/codable-watermarking-for-LLM.
- Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 121–140. IEEE, 2021.
- Real or Fake? Learning to Discriminate Machine from Human-Generated Text. arXiv preprint arXiv:1906.03351, 2019.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
- A Review of Digital Watermarking Techniques for Text Documents. In 2009 International Conference on Information and Multimedia Technology, pp. 230–234. IEEE, 2009.
- Automatic Detection of Machine-Generated Text: A Critical Survey. arXiv preprint arXiv:2011.01314, 2020.
- A Watermark for Large Language Models. arXiv preprint arXiv:2301.10226, 2023a.
- On the Reliability of Watermarks for Large Language Models. arXiv preprint arXiv:2306.04634, 2023b.
- OUTFOX: LLM-generated Essay Detection through In-context Learning with Adversarially Generated Examples. CoRR, abs/2307.11729, 2023.
- Paraphrasing Evades Detectors of AI-Generated Text, but Retrieval is an Effective Defense. arXiv preprint arXiv:2303.13408, 2023.
- Who Wrote this Code? Watermarking for Code Generation. arXiv preprint arXiv:2305.15060, 2023.
- Roberta: A Robustly Optimized Bert Pretraining Approach. arXiv preprint arXiv:1907.11692, 2019.
- Detectgpt: Zero-Shot Machine-Generated Text Detection using Probability Curvature. arXiv preprint arXiv:2301.11305, 2023.
- Chatgpt or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short Chatgpt-Generated Text. arXiv preprint arXiv:2301.13852, 2023.
- OpenAI. GPT-2: 1.5B Release. November 2019. URL https://openai.com/research/gpt-2-1-5b-release.
- OpenAI. ChatGPT: Optimizing Language Models for Dialogue. November 2022. URL https://openai.com/blog/chatgpt/.
- OpenAI. New AI Classifier for Indicating AI-Written Text, 2023. URL https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text.
- Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8):9, 2019.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv e-prints, 2019.
- Can AI-Generated Text be Reliably Detected? arXiv preprint arXiv:2303.11156, 2023.
- DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108, 2019.
- Release Strategies and the Social Impacts of Language Models. arXiv preprint arXiv:1908.09203, 2019.
- Edward Tian. GPTZero. 2023. URL https://gptzero.me/.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Watermarking the Outputs of Structured Prediction with an Application in Statistical Machine Translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1363–1372, 2011.
- HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv, abs/1910.03771, 2019.
- LLMDet: A Large Language Models Detection Tool. arXiv preprint arXiv:2305.15004, 2023.
- Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text. CoRR, abs/2307.11380, 2023a.
- Tracing Text Provenance via Context-Aware Lexical Substitution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 11613–11621, 2022.
- Watermarking Text Generated by Black-Box Language Models. arXiv preprint arXiv:2305.08883, 2023b.
- Robust Multi-bit Natural Language Watermarking through Invariant Features. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2092–2115, 2023.
- OPT: Open Pre-trained Transformer Language Models. ArXiv, abs/2205.01068, 2022.
- Provable Robust Watermarking for AI-Generated Text. 2023.
- Lean Wang (10 papers)
- Wenkai Yang (24 papers)
- Deli Chen (20 papers)
- Hao Zhou (351 papers)
- Yankai Lin (125 papers)
- Fandong Meng (174 papers)
- Jie Zhou (687 papers)
- Xu Sun (194 papers)