Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 39 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 22 tok/s Pro

GPT-4o 82 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 27 tok/s Pro

2000 character limit reached

UWbench: Benchmark for Unbiased Watermarking

Updated 5 October 2025

UWbench is an open-source benchmark that rigorously evaluates unbiased watermarking methods in language models through both theoretical and empirical frameworks.
It introduces the SPMG metric to measure distribution drift under repeated prompts and establishes finite-sample concentration guarantees, emphasizing detection challenges.
UWbench employs a three-axis evaluation protocol—unbiasedness, detectability, and robustness—to standardize watermark assessment and certify resilience against adversarial token modifications.

UWbench is an open-source benchmark designed to rigorously evaluate unbiased watermarking methods for LLMs, providing both theoretical and empirical frameworks to quantify watermark effectiveness and limitations. It uniquely addresses the phenomenon of distributional drift under repeated prompts, establishes formal metrics and impossibility results, and advances a three-axis protocol for standardizing assessment of unbiasedness, detectability, and robustness. UWbench is intended as a reproducible platform for the comparative analysis of watermarking algorithms, enabling precise measurement and certification of watermark properties under varied adversarial scenarios.

1. Motivation and Overview

UWbench was introduced to resolve two core challenges in the evaluation of unbiased watermarks for AI-generated text: (1) undetected accumulation of distributional bias across multiple generations, and (2) inconsistent assessment of watermark robustness in prior studies (Wu et al., 28 Sep 2025). "Unbiased" watermarking aims to tag model output such that the token distribution remains statistically indistinguishable from the original in single generations, preserving perceptual quality and task utility. However, the repeated application of watermarking to identical prompts can cause subtle but accumulating deviations from the original distribution—a phenomenon not captured by conventional one-shot metrics. UWbench formalizes this issue, proposing new statistical metrics and protocols to quantify and compare watermarking techniques under principled and reproducible conditions.

2. Multi-Batch Drift and the SPMG Metric

To address distributional drift that arises when identical prompts are repeatedly submitted to a watermarked model (“multi-batch drift”), UWbench introduces the Single-Prompt Multi-Generation (SPMG) unbiasedness metric. Given a prompt $p_i$ , let $g_{j}^{(p_i)}(P)$ denote $m$ independent generations by model $P$ ; a surrogate performance metric $Met$ is computed (e.g., BLEU, ROUGE, BERTScore, perplexity). The mean per-prompt metric is:

$\overline{Met}_i(P) = \frac{1}{m} \sum_{j=1}^m Met(g_j^{(p_i)}(P))$

The gap between models $P_M$ (original) and $P_T$ (tested, e.g., watermarked) is:

$\Delta Met(P_M, P_T) = \frac{1}{n} \sum_{i=1}^n |\overline{Met}_i(P_M) - \overline{Met}_i(P_T)|$

Natural sampling noise is controlled by comparing $P_M$ to an independent clone $P_M'$ . The variance-controlled detection statistic is:

$DetWmk(P_M, P_T) = \Delta Met(P_M, P_T) - \Delta Met(P_M, P_M')$

This statistic isolates drift strictly attributable to watermarking. UWbench’s use of the SPMG metric accounts for repeated-sampling effects and establishes finite-sample concentration guarantees via McDiarmid’s inequality.

Metric	Formula	Purpose
SPMG Gap	$\Delta Met(P_M, P_T) = \frac{1}{n} \sum_{i=1}^n \|\overline{Met}_i(P_M) - \overline{Met}_i(P_T)\|$	Multi-batch unbiasedness
Controlled Stat	$DetWmk(P_M, P_T) = \Delta Met(P_M, P_T) - \Delta Met(P_M, P_M')$	Drift due to watermarking

The SPMG metric allows empirical comparison of unbiasedness over multiple generations, thereby revealing subtle degradation not exposed by typical one-shot distribution matching.

3. Impossibility Theorem for Unbiased Watermarking

UWbench provides a formal impossibility result establishing that no watermarking scheme can simultaneously maintain strict distributional fidelity and reliable detectability under repeated queries with a fixed watermark key. The theorem (quoted as “Unbiasedness breaks under repeated prompts”) asserts that—if detectability is possible via statistical tests—then, under infinite queries, unbiased watermarking will necessarily incur distribution drift. Specifically, the two conditions are irreconcilable:

Preservation: For any prompt and fixed key, the watermarked output distribution matches the original LM distribution for all single queries.
Detectability: There exists a test reliably differentiating watermarked and unwatermarked output.

Repeated application with the same key violates strict unbiasedness, motivating the need for metrics capable of quantifying this drift (such as SPMG). This suggests the long-term indistinguishability goal is fundamentally unattainable in any practically detectable watermark.

4. Robustness Assessment Against Adversarial Token Modifications

UWbench formalizes the analysis of robustness with respect to adversarial token-level modifications. The adversary is modeled as an “edit-bounded” attacker permitted to substitute, insert, or delete up to $b$ tokens. The notion of a token “effect region” $R_i(x)$ quantifies, for each token $x_i$ at position $i$ , the count of subsequent detector scores that include $x_i$ in their context (e.g., n-gram windows):

$R_i(x) = |\{t \geq i: x_i \text{ is used in } C_t(x)\}|$

For practical detectors (keyed by n-gram prefixes), $R_i(x) \leq n+1$ .

If $s_t(x)$ is the detector score for token $t$ and $S(x) = \sum_t s_t(x)$ , a single edit alters $S(x)$ by at most $R_{max} \cdot B$ , where $B$ bounds $|s_t|$ and $R_{max}$ is $\max_i R_i(x)$ . For $b$ edits:

$\text{If } S(x) - \tau > b \cdot R_{max} \cdot B, \text{ then } S(x') \geq \tau \text{ for all } x' \text{ with } \leq b \text{ edits}$

This certified robustness guarantee enables watermarks to be rigorously certified as resilient against a quantified budget of token changes.

UWbench empirically compares paraphrasing-based attacks (e.g., using DIPPER paraphraser) and direct token modification attacks, demonstrating that the latter yield more stable and reproducible robustness evaluation, whereas paraphrasing induces high variance and inconsistent assessments.

5. Three-Axis Evaluation Protocol

UWbench standardizes watermark evaluation along three principal axes:

Unbiasedness: Assessed using both one-shot metrics and the SPMG multi-batch metric. Conventional outputs such as BLEU, ROUGE, BERTScore, and perplexity are computed for both original and watermarked texts, as are multi-generation gaps.
Detectability: Determined through statistical hypothesis testing between watermarked and unwatermarked outputs. Reported measures include true positive rate (TPR) at fixed false positive rates (FPR), median p-values, and AUROC.
Robustness: Evaluated under adversarial modification, comparing the stability of detection under paraphrasing and random token edit attacks. Results indicate that token-level modification adversaries provide more consistent and reliable robustness estimates than paraphrasing adversaries.

Axis	Metric Examples	Assessment Mode
Unbiasedness	BLEU, ROUGE, BERTScore, Perplexity, SPMG gap	One-shot & Multi-batch
Detectability	TPR, FPR, p-values, AUROC	Hypothesis Tests
Robustness	Test statistic delta, certified bounds	Token mod/paraphrase attacks

The existence of certified robustness bounds and the recognizably reproducible metrics supports standardized and transparent comparison between watermarking approaches.

6. Technical Integration and Implications

UWbench’s empirical and theoretical protocol provides a foundation for rigorous comparative studies of watermarking schemes. Its principled metric design exposes key trade-offs between unbiasedness, detectability, and robustness, and the impossibility theorem demonstrates the fundamental limitation inherent to unbiased watermarking with fixed keys under repeated queries. A plausible implication is that future watermark designs may need to incorporate key rotation or other mechanisms to manage drift.

By providing certified robustness guarantees, UWbench enables practitioners to objectively quantify resilience to adversarial modification. Its adoption marks a shift toward standardized reporting and evaluation of LLM watermarking, directing the community toward reproducible and interpretable benchmarks that can guide ongoing algorithmic improvements and deployment considerations.

PDF Markdown Chat (Pro)

References (1)

Analyzing and Evaluating Unbiased Language Model Watermark (2025)

Follow Topic

Get notified by email when new papers are published related to UWbench.