Unbiased Watermark for Large Language Models (2310.10669v2)
Abstract: The recent advancements in LLMs have sparked a growing apprehension regarding the potential misuse. One approach to mitigating this risk is to incorporate watermarking techniques into LLMs, allowing for the tracking and attribution of model outputs. This study examines a crucial aspect of watermarking: how significantly watermarks impact the quality of model-generated outputs. Previous studies have suggested a trade-off between watermark strength and output quality. However, our research demonstrates that it is possible to integrate watermarks without affecting the output probability distribution with appropriate implementation. We refer to this type of watermark as an unbiased watermark. This has significant implications for the use of LLMs, as it becomes impossible for users to discern whether a service provider has incorporated watermarks or not. Furthermore, the presence of watermarks does not compromise the performance of the model in downstream tasks, ensuring that the overall utility of the LLM is preserved. Our findings contribute to the ongoing discussion around responsible AI development, suggesting that unbiased watermarks can serve as an effective means of tracking and attributing model outputs without sacrificing output quality.
- Scott Aaronson. My ai safety lecture for ut effective altruism. November 2022. URL https://scottaaronson.blog/?p=6823.
- Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 2021 IEEE Symposium on Security and Privacy (SP), pages 121–140. IEEE, 2021.
- Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium, pages 1615–1631, 2018.
- Natural language watermarking: Design, analysis, and a proof-of-concept implementation. In Information Hiding: 4th International Workshop, IH 2001 Pittsburgh, PA, USA, April 25–27, 2001 Proceedings 4, pages 185–200. Springer, 2001.
- Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5301.
- Franziska Boenisch. A systematic review on model watermarking for neural networks. Frontiers in big Data, 4:729663, 2021.
- Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation, pages 169–214, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4717.
- Bad characters: Imperceptible nlp attacks. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1987–2004. IEEE, 2022.
- Natural language watermarking using semantic substitution for chinese text. In Digital Watermarking: Second International Workshop, IWDW 2003, Seoul, Korea, October 20-22, 2003. Revised Papers 2, pages 129–140. Springer, 2004.
- Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Digital watermarking and steganography. Morgan kaufmann, 2007.
- Machine generated text: A comprehensive survey of threat models and detection methods. arXiv preprint arXiv:2210.07321, 2022.
- Towards near-imperceptible steganographic text. arXiv preprint arXiv:1907.06679, 2019.
- Luc Devroye. Non-Uniform Random Variate Generation. Springer New York, 1986.
- Generating steganographic text with lstms. arXiv preprint arXiv:1705.10742, 2017.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
- The homograph attack. Communications of the ACM, 45(2):128, 2002.
- On pushing deepfake tweet detection capabilities to the limits. In 14th ACM Web Science Conference 2022, pages 154–163, 2022.
- Riley Goodside. There are adversarial attacks for that proposal as well — in particular, generating with emojis after words and then removing them before submitting defeats it.,. January 2023. URL https://twitter.com/goodside/status/1610682909647671306.
- Google. Palm-2-llm. https://blog.google/technology/ai/google-palm-2-ai-large-language-model/, 2023.
- Watermarking pre-trained language models with backdooring. arXiv preprint arXiv:2210.07543, 2022.
- Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
- Protecting intellectual property of language generation apis with lexical watermark. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10758–10766, 2022a.
- Cater: Intellectual property protection on text generation apis via conditional watermarks. arXiv preprint arXiv:2209.08773, 2022b.
- Dual canonicalization: An answer to the homograph attack. In 2012 eCrime Researchers Summit, pages 1–10. IEEE, 2012.
- Teaching machines to read and comprehend. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1693–1701, 2015.
- Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
- A review of digital watermarking techniques for text documents. In 2009 International Conference on Information and Multimedia Technology, pages 230–234. IEEE, 2009.
- Automatic detection of machine generated text: A critical survey. arXiv preprint arXiv:2011.01314, 2020.
- Entangled watermarks as a defense against model extraction. In USENIX Security Symposium, pages 1937–1954, 2021.
- A review of text watermarking: theory, methods, and applications. IEEE Access, 6:8011–8028, 2018.
- A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
- Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. arXiv preprint arXiv:2303.13408, 2023.
- Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
- How to prove your model belongs to you: A blind-watermark based framework to protect intellectual property of dnn. In Proceedings of the 35th Annual Computer Security Applications Conference, pages 126–137, 2019.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1387.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020.
- Natural language watermarking via morphosyntactic alterations. Computer Speech & Language, 23(1):107–125, 2009.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2023a.
- OpenAI. Gpt-4 technical report. arXiv, 2023b.
- Fall of giants: How popular text-based mlaas fall against a simple evasion attack. In 2021 IEEE European Symposium on Security and Privacy (EuroS&P), pages 198–211. IEEE, 2021.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Information hiding-a survey. Proceedings of the IEEE, 87(7):1062–1078, 1999.
- Digital watermarking: algorithms and applications. IEEE signal processing Magazine, 18(4):33–46, 2001.
- A survey of digital image watermarking techniques. In INDIN’05. 2005 3rd IEEE International Conference on Industrial Informatics, 2005., pages 709–716. IEEE, 2005.
- Fine-grain watermarking for intellectual property protection. EURASIP Journal on Information Security, 2019:1–20, 2019.
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
- A new synonym text steganography. In 2008 international conference on intelligent information hiding and multimedia signal processing, pages 1524–1526. IEEE, 2008.
- Information hiding techniques for steganography and digital watermarking, 2000.
- Deep intellectual property: A survey. arXiv preprint arXiv:2304.14613, 2023.
- Detecting cross-modal inconsistency to defend against neural fake news. arXiv preprint arXiv:2009.07698, 2020.
- The science of detecting llm-generated texts. arXiv preprint arXiv:2303.07205, 2023.
- Reverse engineering configurations of neural text generation models. arXiv preprint arXiv:2004.06201, 2020.
- Natural language watermarking. In Security, Steganography, and Watermarking of Multimedia Contents VII, volume 5681, pages 441–452. SPIE, 2005.
- Natural language watermarking: Challenges in building a practical system. In Security, Steganography, and Watermarking of Multimedia Contents VIII, volume 6072, pages 106–117. SPIE, 2006a.
- Words are not enough: sentence level natural language watermarking. In Proceedings of the 4th ACM international workshop on Contents protection and security, pages 37–46, 2006b.
- The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. In Proceedings of the 8th workshop on Multimedia and security, pages 164–174, 2006c.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Frustratingly easy edit-based linguistic steganography with a masked language model. arXiv preprint arXiv:2104.09833, 2021.
- Watermarking the outputs of structured prediction with an application in statistical machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1363–1372, 2011.
- Bot or human? detecting chatgpt imposters with a single question. arXiv preprint arXiv:2305.06424, 2023a.
- Towards codable text watermarking for large language models. arXiv preprint arXiv:2307.15992, 2023b.
- Avoiding detection on twitter: embedding strategies for linguistic steganography. Society of Photo-optical Instrumentation Engineers, 2016.
- Linguistic steganography on twitter: hierarchical language modeling with manual interaction. In Media Watermarking, Security, and Forensics 2014, volume 9028, pages 9–25. SPIE, 2014.
- Detection of steganographic techniques on twitter. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2564–2569, 2015.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Attacking neural text detectors. arXiv preprint arXiv:2002.11768, 2020.
- Tracing text provenance via context-aware lexical substitution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11613–11621, 2022.
- Robust natural language watermarking through invariant features. arXiv preprint arXiv:2305.01904, 2023a.
- Advancing beyond identification: Multi-bit watermark for language models. arXiv preprint arXiv:2308.00221, 2023b.
- Defending against neural fake news. Advances in neural information processing systems, 32, 2019.
- Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pages 159–172, 2018.
- Protecting language generation models via invisible watermarking. arXiv preprint arXiv:2302.03162, 2023.
- Neural linguistic steganography. arXiv preprint arXiv:1909.01496, 2019.
- Zhengmian Hu (23 papers)
- Lichang Chen (30 papers)
- Xidong Wu (13 papers)
- Yihan Wu (44 papers)
- Hongyang Zhang (71 papers)
- Heng Huang (189 papers)