2000 character limit reached
No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices (2402.16187v3)
Published 25 Feb 2024 in cs.CR, cs.CL, and cs.LG
Abstract: Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack -- leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.
- Scott Aaronson. Watermarking of large language models. https://simons.berkeley.edu/talks/scott-aaronson-ut-austin-openai-2023-08-17, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194, 2023.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- Publicly detectable watermarking for language models. Cryptology ePrint Archive, 2023.
- On the learnability of watermarks for language models. arXiv preprint arXiv:2312.04469, 2023.
- Emil Julius Gumbel. Statistical theory of extreme values and some practical applications: a series of lectures, volume 33. US Government Printing Office, 1948.
- Unbiased watermark for large language models. arXiv preprint arXiv:2310.10669, 2023.
- Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885, 2018.
- A watermark for large language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 17061–17084. PMLR, 23–29 Jul 2023.
- On the reliability of watermarks for large language models. arXiv preprint arXiv:2306.04634, 2023.
- Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
- Towards document-level paraphrase generation with sentence rewriting and reordering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1033–1044, 2021.
- Paraphrase generation with deep reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3865–3878, 2018.
- Detectgpt: zero-shot machine-generated text detection using probability curvature. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Stealing the decoding algorithms of language models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1835–1849, 2023.
- OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI blog, https://openai.com/blog/chatgpt, 2022.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- OpenAI. Openai moderation endpoint. https://platform.openai.com/docs/guides/moderation, 2023.
- OpenAI. Openai usage policies. https://openai.com/policies/usage-policies, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Mark my words: Analyzing and evaluating language model watermarks. arXiv preprint arXiv:2312.00273, 2023.
- Release strategies and the social impacts of language models, 2019.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Dipmark: A stealthy, efficient and resilient watermark for large language models. arXiv preprint arXiv:2310.07710, 2023.
- Towards codable text watermarking for large language models. arXiv preprint arXiv:2307.15992, 2023.
- Provable robust watermarking for AI-generated text. In The Twelfth International Conference on Learning Representations, 2024.
- Watermarks in the sand: Impossibility of strong watermarking for generative models. arXiv preprint arXiv:2311.04378, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Qi Pang (13 papers)
- Shengyuan Hu (14 papers)
- Wenting Zheng (8 papers)
- Virginia Smith (68 papers)