Watermark Smoothing Attacks against Language Models (2407.14206v1)

Published 19 Jul 2024 in cs.LG

Abstract: Watermarking is a technique used to embed a hidden signal in the probability distribution of text generated by LLMs, enabling attribution of the text to the originating model. We introduce smoothing attacks and show that existing watermarking methods are not robust against minor modifications of text. An adversary can use weaker LLMs to smooth out the distribution perturbations caused by watermarks without significantly compromising the quality of the generated text. The modified text resulting from the smoothing attack remains close to the distribution of text that the original model (without watermark) would have produced. Our attack reveals a fundamental limitation of a wide range of watermarking techniques.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel smoothing attack that neutralizes watermark perturbations while preserving high text quality.
The authors develop a two-phase framework combining watermark inference and weighted-logit smoothing to identify and smooth green tokens.
Experimental results on Llama-7B and OPT-6.7B demonstrate high AUC scores (>0.9) and near 0% watermark detection post-attack.

Watermark Smoothing Attacks against LLMs

The paper "Watermark Smoothing Attacks against LLMs" by Hongyan Chang, Hamed Hassani, and Reza Shokri provides a detailed examination of the robustness of statistical watermarking techniques employed in LLMs. The authors introduce a novel attack methodology termed "smoothing attacks" aimed at circumventing existing watermarking techniques without significantly degrading the quality of the generated text. This paper's investigation reveals fundamental limitations of conventional watermarking methods, especially when confronted with adversaries that possess weaker reference models.

Background and Problem Statement

Watermarking in LLMs involves embedding subtle signals within the probability distributions of text sequences to make the text attributable to a specific model. These watermarks are designed to be undetectable by human readers while remaining identifiable by automated detection methods. The two primary challenges of watermarking are maintaining text quality and preventing easy erasure of the watermark.

In this research, the authors focus on evaluating and enhancing the second challenge by introducing smoothing attacks against watermarked outputs. The primary goal of the adversary in this context is to recover text similar to what the original unwatermarked model would produce. The attack leverages a weaker reference model to smooth out the statistical perturbations introduced by the watermark.

Attack Framework

The attack is bifurcated into two distinct phases: watermark inference and watermark smoothing.

Phase I: Watermark Inference

The authors assume an adversary with access to a weaker reference model ( $M_{ref}$ ). The target model is denoted as $\tilde{M}$ (watermarked model), and the goal is to infer the "green list" (watermarked tokens). The key observation utilized is that, while different models might generate somewhat differing token distributions, the relative ranking of tokens by likelihood should generally agree. In contrast, watermarked models introduce systematic shifts favoring green tokens, facilitating their identification by rank differences.

By querying both models with varied prefixes while keeping the context for watermark dependency fixed, the attack averages out model discrepancies and amplifies the watermark-induced shifts. This produces a watermark inference score that measures relative token shifts, effectively differentiating between green and red tokens.

Phase II: Watermark Smoothing

Once the green list is inferred with high confidence, the smoothing phase employs a weighted average approach to generate logits. For tokens likely to be watermarked, the output distribution is adjusted by interpolating with the reference model's logits. This method effectively neutralizes the watermark's perturbations while preserving high utility in the generated text.

Experimental Evaluation

The authors validate their approach using Llama-7B and OPT-6.7B models on LFQA and OpenGen datasets. Key metrics include perplexity for text quality and the z-score for watermark detection strength. Comparisons are made against established paraphrasing attacks and simpler average-based attacks.

Results

Watermark Inference: The inference score, robustly averaged over multiple prefixes, achieves high AUC values (>0.9), indicating strong performance in identifying green tokens.
Watermark Smoothing: The generated text post-attack maintains high quality, comparable to unwatermarked text, with significantly reduced detectability as evidenced by lower z-scores and positive prediction rates close to 0%. Notably, adversarial samples evade detection while avoiding the pitfalls of degraded text quality seen in naive paraphrasing.

Implications and Future Work

The findings underscore substantive vulnerabilities in current watermarking schemes. With rapid advancements in LLMs, such attacks highlight the need for more resilient watermarking strategies. Future developments may include:

Enhanced Robustness: Designing watermarking methods resistant to smoothing attacks.
Efficiency Improvements: Reducing query requirements for effective attacks, making them feasible on larger scales.
Regulatory Measures: Implementing standard practices to ensure responsible AI deployment and usage.

This paper contributes significantly to the ongoing discourse on AI security, advocating for dynamic and resilient approaches to watermarking in the context of evolving adversarial threats. The techniques and insights provided form a robust foundation for future research addressing these critical challenges.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rzshokri/status/1815402019097661674

https://twitter.com/loagents/status/1929417238970532342