A Semantic Invariant Robust Watermark for Large Language Models
Abstract: Watermark algorithms for LLMs have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at \href{https://github.com/THU-BPM/Robust_Watermark}{https://github.com/THU-BPM/Robust\_Watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}.
- Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 121–140. IEEE, 2021.
- Composition-contrastive learning for sentence embeddings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15836–15848, 2023.
- Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Automatic typing of dbpedia entities. In The Semantic Web–ISWC 2012: 11th International Semantic Web Conference, Boston, MA, USA, November 11-15, 2012, Proceedings, Part I 11, pp. 65–81. Springer, 2012.
- A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
- Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. arXiv preprint arXiv:2303.13408, 2023.
- Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
- Who wrote this code? watermarking for code generation. arXiv preprint arXiv:2305.15060, 2023.
- A private watermark for large language models. arXiv preprint arXiv:2307.16230, 2023.
- Pointer sentinel mixture models. In International Conference on Learning Representations, 2016.
- George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- Deeptextmark: Deep learning based text watermarking for detection of large language model generated text. arXiv preprint arXiv:2305.05773, 2023.
- Natural language watermarking via paraphraser-based lexical substitution. Artificial Intelligence, 317:103859, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020. URL https://arxiv.org/abs/2004.09813.
- Risks and benefits of large language models for the environment. Environmental Science & Technology, 57(9):3464–3466, 2023.
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Towards codable text watermarking for large language models. arXiv preprint arXiv:2307.15992, 2023.
- Tracing text provenance via context-aware lexical substitution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 11613–11621, 2022.
- Robust multi-bit natural language watermarking through invariant features. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2092–2115, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Provable robust watermarking for ai-generated text. arXiv preprint arXiv:2306.17439, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.