Robust Distortion-free Watermarks for Language Models (2307.15593v3)

Published 28 Jul 2023 in cs.LG, cs.CL, and cs.CR

Abstract: We propose a methodology for planting watermarks in text from an autoregressive LLM that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the LLM. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three LLMs -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50\%$ of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.

PDF Abstract

Robust Distortion-free Watermarks for LLMs

The paper "Robust Distortion-free Watermarks for LLMs" addresses a pertinent issue in the domain of LLMs, specifically focusing on the need for detecting and attributing the provenance of generated text. This work introduces a methodology to plant robust watermarks in text generated by autoregressive LLMs such that it withstands various perturbations while maintaining a distortion-free distribution over the text according to a predetermined generation budget.

At its core, the paper proposes a systematic process to embed these watermarks by mapping sequences of random numbers, computed via a randomized watermark key, into samples from an LLM. Detection of such watermarks can be performed by any party with access to the necessary key, who can then align the text with the random number sequence used for its generation.

The researchers instantiate their watermarking methodology using two distinct sampling techniques: inverse transform sampling and exponential minimum sampling. These techniques have been applied to three LLMs—OPT-1.3B, LLaMA-7B, and Alpaca-7B—to validate the statistical power and resilience of the watermarks against paraphrasing attacks.

Notably, notable empirical results include reliable detection of watermarked text with $p \le 0.01$ from as few as 35 tokens in models OPT-1.3B and LLaMA-7B, even after corrupting 40-50% of the tokens through random edits like substitutions, insertions, or deletions. The Alpaca-7B model shows different behavior due to lower response entropy, where 25% of responses were detectable with the same $p$ -value especially in the context of common user instructions.

This research holds significant implications both theoretically and practically. Theoretically, it advances the discourse on content attribution in AI-generated text, forging a path to reliable forensic tools for content checks and balances. Practically, it provides a means by which platform moderators, educators, and LLM providers can monitor, control, and possibly mitigate the misuse of AI-generated text. While further exploration of these watermarking techniques could yield improved or new mechanisms, the current methods provide a foundation for ensuring the authenticity and originality of content within digital realms influenced by LLMs.

In terms of future prospects, this paper opens the avenue for embedding watermarks in AI models smoothly without affecting their overall performance. Moreover, emphasis on non-distortion and robust authenticity checks paves the way for wider adoption in real-time applications across various industries impacted by AI content generation, such as journalism, academia, and digital content creation.

In conclusion, "Robust Distortion-free Watermarks for LLMs" presents a novel approach to embedding and detecting robust watermarks within LLM-generated content, mitigating potential misuse without compromising text quality. This paper stands as a stepping stone toward more reliable, innovative solutions in AI content verification, critical in today's increasing reliance on artificial intelligence for text generation.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Rohith Kuditipudi (11 papers)
John Thickstun (21 papers)
Tatsunori Hashimoto (80 papers)
Percy Liang (239 papers)

Citations (122)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/FSFG/status/1799165314489303500

YouTube

Show All Videos