On the Reliability of Watermarks for Large Language Models (2306.04634v4)

Published 7 Jun 2023 in cs.LG, cs.CL, and cs.CR

Abstract: As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.

PDF Abstract

On the Reliability of Watermarks for LLMs

The paper, "On the Reliability of Watermarks for LLMs," explores the reliability of watermarking methods for discerning machine-generated text from human-written text—considered pivotal due to the proliferation of LLMs. The paper underscores watermarking as an avenue to mitigate the proliferation of valueless and potentially harmful content generated by LLMs—especially as these models find widespread application—emphasizing the need for rigorous detection mechanisms.

Core Contributions and Research Approach

Significant contributions from this paper include a thorough evaluation of the robustness of watermarked text against various forms of paraphrasing and human intervention. The research leverages a comprehensive evaluation framework to assess watermark effectiveness across diverse scenarios—from human paraphrasing to machine paraphrasing via advanced models like GPT-3.5-turbo and Dipper, as well as synthetic copy-paste attacks where watermarked text segments were embedded within larger documents.

Key components of the watermarking strategy employed in this paper include the combinatorial watermark mechanism, where tokens in the generator's output are pseudo-randomly colored (green or red), and sampling adjustments favor green tokens. The paper advances the existing watermarking methodology by introducing variations such as the SelfHash and LeftHash schemes, optimizing watermark detection for realistic scenarios.

Empirical Findings and Theoretical Implications

The paper reveals watermarks are resilient, remaining detectable despite paraphrasing detonations which generally dilute watermark strength. For instance, watermarked text attacked by state-of-the-art paraphrasing models remained detectable with an ROC-AUC exceeding 0.85 when at least 200 tokens were available for analysis. Moreover, even under substantial human paraphrasing—the acknowledged strongest attacker—the watermark was found on average after observing 800 tokens. Such empirical data substantiates claims regarding watermarking's resilience in various attack scenarios, showcasing its potential as a robust detection mechanism in practice through the efficacy of statistical properties like the $z$ -test.

Furthermore, the paper contemplates watermark detection through the lens of sample complexity, stressing the volume of text needed to achieve reliable detection. Watermarking's distinguished trait is that detection efficiency demonstrably scales with the observed text length—a vital consideration for future applications.

Comparative Analysis with Other Detection Methods

Comparative analysis positions watermarking relative to other detection paradigms—such as retrieval systems utilizing semantic similarity and zero-shot detection methods like DetectGPT. Watermarking emerges as a competitive, if not superior, alternative under attack due to favorable sample efficiency and intrinsic reliability. The comparative inferiority of DetectGPT in maintaining detection reliability across lengthy, paraphrased texts suggests that post-hoc methods may struggle with scaling issues not present in statistical watermarking.

Future Prospects in AI Text Generation

Anticipating future trajectories in AI, the paper argues that watermarking could serve as a bulwark against the unregulated dissemination of machine-generated text, encouraging model owners to integrate transparent, proactive detection mechanisms. While theoretical challenges remain regarding watermarking in white-box settings, like anti-watermark techniques using strong paraphrasers, the research highlights watermarking's practical viability, arguing for its deployment in a spectrum of language generation contexts. Moreover, the potential of watermarking to discern machine-generated content retrospectively in archival applications underscores its utility.

Ultimately, this paper provides a substantive exploration of watermarking as an essential method for managing LLMs' spread of text across digital ecosystems. As AI continues to evolve, refining and adopting comprehensive watermarking strategies could play a pivotal role in preserving digital content integrity and maintaining the separation between human and machine-generated narratives.