Provably Robust Multi-bit Watermarking for AI-generated Text (2401.16820v3)
Abstract: LLMs have demonstrated remarkable capabilities of generating texts resembling human language. However, they can be misused by criminals to create deceptive content, such as fake news and phishing emails, which raises ethical concerns. Watermarking is a key technique to address these concerns, which embeds a message (e.g., a bit string) into a text generated by an LLM. By embedding the user ID (represented as a bit string) into generated texts, we can trace generated texts to the user, known as content source tracing. The major limitation of existing watermarking techniques is that they achieve sub-optimal performance for content source tracing in real-world scenarios. The reason is that they cannot accurately or efficiently extract a long message from a generated text. We aim to address the limitations. In this work, we introduce a new watermarking method for LLM-generated text grounded in pseudo-random segment assignment. We also propose multiple techniques to further enhance the robustness of our watermarking algorithm. We conduct extensive experiments to evaluate our method. Our experimental results show that our method substantially outperforms existing baselines in both accuracy and robustness on benchmark datasets. For instance, when embedding a message of length 20 into a 200-token generated text, our method achieves a match rate of $97.6\%$, while the state-of-the-art work Yoo et al. only achieves $49.2\%$. Additionally, we prove that our watermark can tolerate edits within an edit distance of 17 on average for each paragraph under the same setting.
- S. Abdelnabi and M. Fritz, “Adversarial watermarking transformer: Towards tracing text provenance with data hiding,” in 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021, pp. 121–140.
- D. I. Adelani, H. Mai, F. Fang, H. H. Nguyen, J. Yamagishi, and I. Echizen, “Generating sentiment-preserving fake online reviews using neural language models and their human-and machine-based detection,” in Advanced Information Networking and Applications: Proceedings of the 34th International Conference on Advanced Information Networking and Applications (AINA-2020). Springer, 2020, pp. 1341–1354.
- E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malartic et al., “The falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023.
- R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
- M. J. Atallah, V. Raskin, M. Crogan, C. Hempelmann, F. Kerschbaum, D. Mohamed, and S. Naik, “Natural language watermarking: Design, analysis, and a proof-of-concept implementation,” in Information Hiding: 4th International Workshop, IH 2001 Pittsburgh, PA, USA, April 25–27, 2001 Proceedings 4. Springer, 2001, pp. 185–200.
- M. J. Atallah, V. Raskin, C. F. Hempelmann, M. Karahan, R. Sion, U. Topkara, and K. E. Triezenberg, “Natural language watermarking and tamperproofing,” in International workshop on information hiding. Springer, 2002, pp. 196–212.
- P. Bassia, I. Pitas, and N. Nikolaidis, “Robust audio watermarking in the time domain,” IEEE Transactions on multimedia, vol. 3, no. 2, pp. 232–241, 2001.
- R. C. Bose and D. K. Chaudhuri, “On a class of error correcting binary group codes,” Information and control, vol. 3, no. 1, pp. 68–79, 1960.
- S. Cai and W. Cui, “Evade chatgpt detectors via a single space,” arXiv preprint arXiv:2307.02599, 2023.
- M. Christ, S. Gunn, and O. Zamir, “Undetectable watermarks for language models,” arXiv preprint arXiv:2306.09194, 2023.
- I. Cox, M. Miller, J. Bloom, and C. Honsinger, “Digital watermarking,” Journal of Electronic Imaging, vol. 11, no. 3, pp. 414–414, 2002.
- J. Fairoze, S. Garg, S. Jha, S. Mahloujifar, M. Mahmoody, and M. Wang, “Publicly detectable watermarking for language models,” arXiv preprint arXiv:2310.18491, 2023.
- P. Fernandez, A. Chaffin, K. Tit, V. Chappelier, and T. Furon, “Three bricks to consolidate watermarks for large language models,” arXiv preprint arXiv:2308.00113, 2023.
- Github, “Copilot,” https://github.com/features/copilot, 2023, accessed: January 12, 2024.
- R. W. Hamming, “Error detecting and error correcting codes,” The Bell system technical journal, vol. 29, no. 2, pp. 147–160, 1950.
- X. He, Q. Xu, L. Lyu, F. Wu, and C. Wang, “Protecting intellectual property of language generation apis with lexical watermark,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 10 758–10 766.
- A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” arXiv preprint arXiv:1904.09751, 2019.
- IvyPanda, “IvyPanda,” https://ivypanda.com/essays/, 2024, accessed: January 19, 2024.
- R. Karanjai, “Targeted phishing campaigns using large scale language models,” arXiv preprint arXiv:2301.00665, 2022.
- J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 17 061–17 084. [Online]. Available: https://proceedings.mlr.press/v202/kirchenbauer23a.html
- D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y. Alon, V. Birodkar et al., “Videopoet: A large language model for zero-shot video generation,” arXiv preprint arXiv:2312.14125, 2023.
- K. Krishna, Y. Song, M. Karpinska, J. Wieting, and M. Iyyer, “Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense,” arXiv preprint arXiv:2303.13408, 2023.
- T. Lee, S. Hong, J. Ahn, I. Hong, H. Lee, S. Yun, J. Shin, and G. Kim, “Who wrote this code? watermarking for code generation,” arXiv preprint arXiv:2305.15060, 2023.
- A. Liu, L. Pan, X. Hu, S. Li, L. Wen, I. King, and P. S. Yu, “An unforgeable publicly verifiable watermark for large language models,” arXiv preprint arXiv:2307.16230, 2023.
- A. Liu, L. Pan, X. Hu, S. Meng, and L. Wen, “A semantic invariant robust watermark for large language models,” arXiv preprint arXiv:2310.06356, 2023.
- J. Lubin, J. A. Bloom, and H. Cheng, “Robust content-dependent high-fidelity watermark for tracking in digital cinema,” in Security and Watermarking of Multimedia Contents V, vol. 5020. SPIE, 2003, pp. 536–545.
- X. Luo, R. Zhan, H. Chang, F. Yang, and P. Milanfar, “Distortion agnostic deep watermarking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 548–13 557.
- J. L. Massey, “Deep-space communications and coding: A marriage made in heaven,” in Advanced Methods for Satellite and Deep Space Communications: Proceedings of an International Seminar Organized by Deutsche Forschungsanstalt für Luft-und Raumfahrt (DLR) Bonn, Germany, September 1992. Springer, 1992, pp. 1–17.
- S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016.
- Microsoft, “New Bing,” https://copilot.microsoft.com/, 2023, accessed: January 16, 2024.
- G. Navarro, “A guided tour to approximate string matching,” ACM computing surveys (CSUR), vol. 33, no. 1, pp. 31–88, 2001.
- A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
- OpenAI, “ChatGPT,” https://openai.com/blog/chatgpt, 2023, accessed: January 10, 2024.
- ——, “Gpt-4 technical report,” 2023.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
- J. Piet, C. Sitawarin, V. Fang, N. Mu, and D. Wagner, “Mark my words: Analyzing and evaluating language model watermarks,” arXiv preprint arXiv:2312.00273, 2023.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- P. Ranade, A. Piplai, S. Mittal, A. Joshi, and T. Finin, “Generating fake cyber threat intelligence using transformer-based models,” in 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–9.
- I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” Journal of the society for industrial and applied mathematics, vol. 8, no. 2, pp. 300–304, 1960.
- J. Ren, H. Xu, Y. Liu, Y. Cui, S. Wang, D. Yin, and J. Tang, “A robust semantics-based watermark for large language model against paraphrasing,” arXiv preprint arXiv:2311.08721, 2023.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2021.
- M. D. Swanson, B. Zhu, A. H. Tewfik, and L. Boney, “Robust audio watermarking using perceptual masking,” Signal processing, vol. 66, no. 3, pp. 337–355, 1998.
- C.-W. Tang and H.-M. Hang, “A feature-based robust digital image watermarking scheme,” IEEE transactions on signal processing, vol. 51, no. 4, pp. 950–959, 2003.
- U. Topkara, M. Topkara, and M. J. Atallah, “The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions,” in Proceedings of the 8th workshop on Multimedia and security, 2006, pp. 164–174.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- L. Wang, W. Yang, D. Chen, H. Zhou, Y. Lin, F. Meng, J. Zhou, and X. Sun, “Towards codable text watermarking for large language models,” arXiv preprint arXiv:2307.15992, 2023.
- X. Yang, K. Chen, W. Zhang, C. Liu, Y. Qi, J. Zhang, H. Fang, and N. Yu, “Watermarking text generated by black-box language models,” arXiv preprint arXiv:2305.08883, 2023.
- X. Yang, J. Zhang, K. Chen, W. Zhang, Z. Ma, F. Wang, and N. Yu, “Tracing text provenance via context-aware lexical substitution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 613–11 621.
- M. M. Yeung and F. Mintzer, “An invisible watermarking technique for image verification,” in Proceedings of international conference on image processing, vol. 2. IEEE, 1997, pp. 680–683.
- K. Yoo, W. Ahn, J. Jang, and N. Kwak, “Robust multi-bit natural language watermarking through invariant features,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 2092–2115.
- K. Yoo, W. Ahn, and N. Kwak, “Advancing beyond identification: Multi-bit watermark for language models,” arXiv preprint arXiv:2308.00221, 2023.
- J. Zhang, D. Chen, J. Liao, H. Fang, W. Zhang, W. Zhou, H. Cui, and N. Yu, “Model watermarking for image processing networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 805–12 812.
- X. Zhang, X. Sun, X. Sun, W. Sun, and S. K. Jha, “Robust reversible audio watermarking scheme for telemedicine and privacy protection.” Computers, Materials & Continua, vol. 71, no. 2, 2022.
- X. Zhao, P. Ananth, L. Li, and Y.-X. Wang, “Provable robust watermarking for ai-generated text,” arXiv preprint arXiv:2306.17439, 2023.
- L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” 2023.
- Wenjie Qu (16 papers)
- Dong Yin (36 papers)
- Wei Zou (62 papers)
- Tianyang Tao (8 papers)
- Jinyuan Jia (69 papers)
- Jiaheng Zhang (22 papers)
- Wengrui Zheng (1 paper)
- Yanze Jiang (3 papers)
- Zhihua Tian (8 papers)