Watermark Detection in LLMs

Updated 11 October 2025

Watermark detection in LLMs is a technique that embeds statistically verifiable signals into generated text to distinguish machine outputs from human-created content.
Methods include white-box logit-manipulation, black-box post-hoc synonym substitution, and frequency-based signal detection to achieve high detection reliability.
Applications span content provenance, copyright protection, and regulatory compliance, while challenges focus on managing low entropy, human edits, and adversarial attacks.

Watermark detection in LLMs encompasses a family of algorithmic and statistical methods for attributing, verifying, and tracing the origin of generated text by embedding and recognizing subtle, algorithmically verifiable patterns. These techniques are foundational for model accountability, content provenance, copyright protection, and mitigation of misuse in both proprietary and open-source LLMs. The field now includes strategies tailored to inference-time postprocessing, model-agnostic API use, fine-tuned open-source models, downstream mixture detection, and defense against sophisticated attacks and human edits.

1. Core Principles and Frameworks

Watermarking in the context of LLMs refers to the deliberate embedding of statistically detectable (yet semantically imperceptible) signals into generated text. These signals enable downstream detection methods to distinguish machine-generated outputs from human-created content with high reliability. Detectability relies on provable differences in the statistical distribution of tokens or features relative to the null distribution of natural language.

A mathematically rigorous framework for watermarking is based on formal hypothesis testing: given text $x_{1:T}$ and auxiliary variables (such as a pseudorandom sequence used in watermark injection), the detection task is to distinguish between

$H_0:\ \text{Text is human; pivotal statistics } Y_t \sim \mu_0\text{ i.i.d.}$

versus

$H_1:\ \text{Text is watermarked by LLM; } Y_t\sim\mu_{1,P_t}\text{ (potentially via mixture if modified)}.$

Optimal detector design is formulated as minimizing Type II error (missed detection) under constraints on Type I error (false alarm) and text distortion, with joint optimization over watermark injection and detection (He et al., 3 Oct 2024). This supports both distribution-adaptive watermarking and robust detection under post-editing and adversarial attacks.

2. Methodologies: White-box, Black-box, and Feature-Based Approaches

2.1. White-box and Logit-manipulation Schemes

Canonical methods such as the Kirchenbauer-style red-green watermark and its derivatives operate by manipulating next-token probabilities during decoding. At each generation step, the vocabulary is pseudo-randomly partitioned into favored (“green”) and unfavored (“red”) token lists (based on an $n$ -gram context and a secret key), with green tokens’ logits receiving additive bias $\delta$ , boosting their selection probability (Tang et al., 2023, He et al., 3 Oct 2024).

Detection most often involves counting the number of green tokens and performing a binomial-based z-test:

$z = \frac{|\mathcal{S}|_G - \gamma T}{\sqrt{\gamma(1-\gamma)T}}$

where $|\mathcal{S}|_G$ is the count of green tokens, $T$ is the sequence length, and $\gamma$ is the green list fraction (Mao et al., 23 May 2024, Yang et al., 2023). If $z$ exceeds a significance threshold, a watermark is inferred.

2.2. Black-box and Post-hoc Techniques

In scenarios with only black-box API access (e.g. third-party app developers), embedding must occur post-generation. The synonym-substitution post-hoc watermark (Yang et al., 2023) uses a binary encoding function $b_i = \mathrm{RandomBinary}(h(w_i) \oplus h(w_{i-1}))$ where $h(\cdot)$ is a string hash, to compute Bernoulli-distributed bits per eligible token. Words encoding bit-0 are replaced with context-preserving synonyms that flip their code to bit-1, guided by context-aware similarity metrics (RoBERTa, GloVe, BERT). Detection relies on binomial-proportion testing of observed bit-1 frequencies.

Another class leverages feature-based rejection sampling. Here, inference-time candidate outputs are scored via deterministic features (Sparse Autoencoders—SAEs) and only candidates aligning with key-derived target statistics are selected (SAEMark (Yu et al., 11 Aug 2025)). This logit-free paradigm allows multi-bit watermark embedding even for closed-source or multilingual API endpoints.

2.3. Signal/Periodic Watermarks

Frequency-based schemes embed periodic signals into the sequence of token ranks selected during sampling. For each generation step, a pre-defined (e.g., sinusoidal (Xu et al., 9 Oct 2024) or periodic discrete (Xu et al., 9 Oct 2024)) function dictates the rank to select among sorted candidate tokens. Detection is performed by reconstructing the sequence of ranks and applying a Fast Fourier Transform or Short-Time Fourier Transform (STFT) to reveal a frequency-domain signature. These methods provide robust, sentence-level detection and strong AUC under adversarial modifications, with minimal impact on text fluency.

2.4. Dataset and Training Data Watermarks

Lexical substitution watermarking for training data (LexiMark (German et al., 17 Jun 2025)) embeds synonym substitutions among high-entropy words, leveraging the propensity of LLMs to memorize rare terms. This confers resilience against removal and stealth, enhancing membership inference accuracy for dataset verification.

3. Detection Algorithms and Goodness-of-Fit Testing

Detection algorithms center on statistical hypothesis testing using pivotal statistics derived from the text and the watermarking scheme. The following approaches are prominent:

Binomial/z-tests: For red-green schemes and their variants, the count of “watermark tokens” is tested against a binomial null (Yang et al., 2023, Mao et al., 23 May 2024).
Likelihood ratio and log-probability statistics: Used to discern text generated via mark-biased decoders.
Permutation/compatibility tests: For schemes based on fixed ordered lists or seed permutations (e.g., Kuditipudi, Bahri methods), permutation tests or compatibility scores are calculated (Bahri et al., 18 Aug 2025).

Recently, goodness-of-fit (GoF) tests have been advanced as general detectors (He et al., 4 Oct 2025, Li et al., 21 Nov 2024). Given the set of pivotal statistics $Y_t$ , classic tests such as Kolmogorov-Smirnov, Anderson-Darling, Cramér–von Mises, and truncated $\phi$ -divergence are applied to the empirical CDF $F_n$ against the theoretical CDF $F_0$ . The Phi statistic (Li et al., 21 Nov 2024) is

$S_n^+(s) = \sup_{r \in I^+} K_s^+(F_n(r), F_0(r))$

with $K_s^+$ a truncated divergence based on $\phi_s(\cdot)$ .

Empirical results demonstrate that GoF tests not only outperform sum-based detectors across various watermarking schemes (inverse transform, Gumbel-max, SynthID) and models, but also provide robustness to post-editing, including repetition artifacts in low-entropy settings.

4. Robustness to Attacks and Post-editing

Robust watermark detection requires resilience to text-level transformations:

Insertion, deletion, synonym substitution: Token-based watermark detectors sustain detection power if only a sublinear fraction of tokens is altered (Mao et al., 23 May 2024).
Paraphrasing and polishing: Watermarks degrade only when semantic meaning is significantly disrupted (Yang et al., 2023, Xu et al., 9 Oct 2024).
Human edits (mixture models): Mixture models for human-edited text (Tr-GoF (Li et al., 21 Nov 2024)) address cases where only a fraction of pivotal statistics remain watermarked. The statistic adapts by ignoring the need for knowledge of precise edit levels or LLM details and achieves the optimal detection boundary $q+2p=1$ in the regime where $q$ quantifies the signal strength and $p$ the unmodified token fraction.
Adversarial fine-tuning: For open-source models, additive parameter perturbation watermarking (WAPITI (Chen et al., 9 Oct 2024)) ties watermark removal to simultaneous task performance degradation.

Ensemble watermarking (combining red-green, acrostics, and sensorimotor norm signals) further boosts robustness. After paraphrasing, detection via the ensemble remains as high as 95% compared to 49% for a single-feature watermark (Niess et al., 29 Nov 2024).

5. Hybrid and Adaptive Detection

Single-source watermark detectors weaken under low-entropy prompts or post-trained (RLHF, instruction-tuned) models (Bahri et al., 18 Aug 2025). Hybrid schemes combine watermark-specific statistics ( $s_w$ ) with non-watermark detectors ( $s_d$ ) based on log-likelihood, prompt likelihood ratios, or learned classifiers (e.g., RoBERTa). Cascaded and logistic regression hybrids achieve consistent accuracy boosts (5–8 percentage points in some regimes), leveraging complementary information, and amortizing computational costs by deferring more expensive detection only for ambiguous cases.

6. Applications, Implications, and Limitations

Applications

Provenance and regulatory compliance: Enables content attribution, accountability, and compliance with labeling requirements in both proprietary and open-source LLM contexts (Tang et al., 2023, Chen et al., 9 Oct 2024).
Downstream dataset tracking: Detects unauthorized use of LLM-generated datasets in classifiers and generative models by persistent watermarking in both input-level and output-level tasks (Liu et al., 16 Jun 2025).
IP protection, content attribution, and academic integrity: Embedding robust, hard-to-remove watermarks supports rights holders and deters misuse and plagiarism.

Limitations and Open Problems

Entropy bottleneck: Detection power scales with generation entropy; low-diversity outputs limit detectability (Bahri et al., 18 Aug 2025).
Imperceptibility trade-offs: Service-level attacks (e.g., Water-Probe (Liu et al., 4 Oct 2024)) can reveal the presence of watermarks via distributional biases; the Water-Bag strategy mitigates this by key randomization but may reduce per-text detectability.
Robustness vs. forensic accuracy: Some highly robust schemes (e.g., binary dataset-level watermarking) require large sample sizes or are vulnerable in short-form outputs.
Compute and deployment: Sampling-based or feature-based rejection schemes (e.g. SAEMark) introduce inference overhead.
Fine-tuning and transfer: Some parameter-integrated watermarks may decrease in detectability or utility after further model adaptation unless orthogonality between watermark and fine-tuning directions is maintained.

7. Future Directions

Current research highlights several promising paths:

High-dimensional and semantic GoF detectors adapting to richer feature statistics or textual embeddings (He et al., 4 Oct 2025).
Adaptive distribution-aware watermarking, jointly optimizing embedding and detection given text context and application constraints (He et al., 3 Oct 2024).
Stealthy, adaptive, and multi-bit watermarking for both open- and closed-source scenarios, leveraging learned feature spaces and distributed key assignment (SAEMark (Yu et al., 11 Aug 2025), ensemble schemes (Niess et al., 29 Nov 2024)).
Automated removal defense by integrating watermark detection with model update monitoring and fine-tuning-detection linkage (e.g., via WAPITI parameter integration (Chen et al., 9 Oct 2024)).
Compositional or hybrid methods blending GoF statistics, token-level signatures, and data-level lexical markers for cross-domain, adversarial, and post-processing robustness.

In summary, watermark detection in LLMs now encompasses a broad spectrum of statistical, signal-processing, feature-driven, and adaptive-hybrid methodologies, tightly integrated with the practical realities of open, fine-tuned, and closed-source LLM deployment. Continued development focuses on maximizing detection power under entropy constraints, adversarial conditions, and evolving content-generation paradigms while enforcing rigorous theoretical guarantees and minimal impact on generation quality.