A Semantic Invariant Robust Watermark for Large Language Models (2310.06356v3)

Published 10 Oct 2023 in cs.CR and cs.CL

Abstract: Watermark algorithms for LLMs have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at \href{https://github.com/THU-BPM/Robust_Watermark}{https://github.com/THU-BPM/Robust\_Watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markLLM} \footnote{https://github.com/THU-BPM/MarkLLM}.

PDF HTML Abstract

A Semantic Invariant Robust Watermark for LLMs

The development of LLMs brings substantial improvements in natural language processing tasks, but it also introduces challenges related to the misuse of machine-generated content. This paper presents a novel approach to watermarking LLMs, aiming to enhance both the attack robustness and security robustness of watermarking methods. By harnessing semantic information to generate watermark logits, the proposed method addresses the limitations of existing algorithms, which often involve a trade-off between robustness against attacks and inferential security.

Methodology Overview

The paper introduces a watermarking algorithm wherein the watermark logits for LLMs are informed by the semantic content of preceding tokens rather than merely their identity. This semantic-based approach employs an auxiliary embedding LLM to generate embeddings of preceding tokens, which are subsequently transformed into watermark logits via a trained watermark model. The innovation lies in utilizing semantically invariant features to both ensure the robustness of the watermark against text modifications such as synonym substitutions and paraphrasing, and to increase the security robustness against attacks attempting to deduce watermarking rules.

The watermark model is trained using similarity and normalization objectives. The similarity loss ensures that the similarity between watermark logits reflects the similarity of input text embeddings, while the normalization loss ensures that the generated logits have balanced scores with neutral mean values. These objectives collectively support the algorithm's robustness and security by bolstering the complexity and unbiased nature of watermark generation.

Experimental Results

The experimental results presented in the paper indicate that the proposed watermarking method offers strong resistance to semantically invariant text changes. The watermark consistently demonstrates high detection accuracy across multiple attack scenarios, including text paraphrasing and synonym replacement. Moreover, the results confirmed that the proposed method achieved a desirable balance between attack robustness and security robustness, as evidenced by its resistance to watermark decryption through frequency attacks.

Furthermore, the paper evaluates the computational efficiency and text quality impacts of the watermarking process. Although the watermarking introduces some latency during text generation, primarily in the embedding phase, parallelization effectively mitigates these delays. Importantly, the text quality, as measured by perplexity, is only slightly affected, suggesting the method's feasibility for real-world applications.

Implications and Future Directions

The semantic invariant robust watermarking approach outlined in this paper has significant implications for the future of LLM usage. Its ability to maintain watermark robustness across a variety of text modifications highlights its potential for ensuring content authenticity and traceability, critical aspects in fields concerned with copyright and misinformation. By focusing on semantic embeddings, the approach also opens pathways to explore multilingual and context-peered watermarks, which can be particularly valuable as LLMs become integrated into diverse and global contexts.

As advancements in AI continue, future developments could involve integrating more sophisticated embedding models or exploring dynamic watermarking techniques that adapt in real-time to evolving language patterns. Enhancements to the watermark model architecture could also leverage advances in neural network training techniques, potentially improving the balance between security and robustness further. Overall, the proposed semantic invariant robust watermarking method lays foundational groundwork for mitigating the risks associated with LLM-generated text, fostering a broader and more responsible deployment of AI technologies.

PDF Markdown Bookmark Chat (Pro)

References (24)

Authors (5)

Aiwei Liu (42 papers)
Leyi Pan (7 papers)
Xuming Hu (120 papers)
Shiao Meng (5 papers)
Lijie Wen (58 papers)

Citations (40)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - THU-BPM/MarkLLM: MarkLLM: An Open-Source Toolkit for LLM Watermarking. (141 stars)