Multi-Bit Distortion-Free Watermarking for Large Language Models (2402.16578v1)
Abstract: Methods for watermarking LLMs have been proposed that distinguish AI-generated text from human-generated text by slightly altering the model output distribution, but they also distort the quality of the text, exposing the watermark to adversarial detection. More recently, distortion-free watermarking methods were proposed that require a secret key to detect the watermark. The prior methods generally embed zero-bit watermarks that do not provide additional information beyond tagging a text as being AI-generated. We extend an existing zero-bit distortion-free watermarking method by embedding multiple bits of meta-information as part of the watermark. We also develop a computationally efficient decoder that extracts the embedded information from the watermark with low bit error rate.
- Aaronson, S. My AI Safety Lecture for UT Effective Altruism., Nov. 2023. URL https://scottaaronson.blog/?p=6823. Accessed May 5, 2023.
- Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 121–140. IEEE, 2021.
- Real or fake? learning to discriminate machine from human generated text. arXiv preprint arXiv:1906.03351, 2019.
- Beresneva, D. Computer-generated text detection using machine learning: A systematic review. In Natural Language Processing and Information Systems: 21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Salford, UK, June 22-24, 2016, Proceedings 21, pp. 421–426. Springer, 2016.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- ChatEval: Towards better LLM-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Undetectable Watermarks for Language Models. 2023. URL http://arxiv.org/abs/2306.09194.
- Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 5792–5793, 2023.
- TweepFake: About Detecting Deepfake Tweets. Plos one, 16(5):e0251415, 2021.
- Three bricks to consolidate watermarks for large language models. 2023.
- Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370, 2023.
- Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv preprint arXiv:2301.04246, 2023.
- A Watermark for Large Language Models. 2023a. ISSN 26403498. URL http://arxiv.org/abs/2301.10226.
- On the Reliability of Watermarks for Large Language Models. 2023b. URL http://arxiv.org/abs/2306.04634.
- Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023.
- Detecting fake content with relative entropy scoring. Pan, 8(27-31):4, 2008.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- A semantic invariant robust watermark for large language models. arXiv preprint arXiv:2310.06356, 2023.
- Augmented language models: A survey. arXiv preprint arXiv:2302.07842, 2023.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023.
- OpenAI. Gpt-2: 1.5b release. Website, 2023. https://openai.com/research/gpt-2-1-5b-release/.
- Probability, Random Variables, and Stochastic Processes. McGraw-Hill series in electrical and computer engineering. McGraw-Hill, 2002. ISBN 9780071226615. URL https://books.google.com/books?id=k22UwAEACAAJ.
- HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. arXiv preprint arXiv:2303.17580, 2023.
- Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503, 2021.
- Tian, E. Gptzero update v1. Website, 2023. https://gptzero.substack.com/p/ gptzero-update-v1.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
- Towards Codable Text Watermarking for Large Language Models. pp. 1–25, 2023. URL https://arxiv.org/abs/2307.15992v1.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
- Next-GPT: Any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519, 2023b.
- Anatomy of an AI-powered malicious social botnet. arXiv preprint arXiv:2307.16336, 2023.
- Natural language is all a graph needs. arXiv preprint arXiv:2308.07134, 2023.
- Robust multi-bit natural language watermarking through invariant features. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2092–2115, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.