Papers
Topics
Authors
Recent
Search
2000 character limit reached

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

Published 18 Dec 2025 in cs.CR and cs.CL | (2512.16904v1)

Abstract: Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore post-hoc watermarking where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.

Summary

  • The paper demonstrates that post-hoc watermarking through LLM rephrasing, particularly using the Gumbel-max method, effectively embeds detectable signals while maintaining high semantic fidelity.
  • It reveals a trade-off where larger models yield better text quality but lower watermark strength, whereas smaller models preserve detectability, especially in open-ended texts versus code.
  • The study underscores the importance of decoding strategies like beam search and emphasizes challenges in multilingual and low-entropy domains, suggesting directions for future watermarking methods.

Evaluation of Post-Hoc Watermarking with LLM Rephrasing

Introduction

This paper presents an extensive empirical study of post-hoc watermarking through LLM-based paraphrasing, designed to embed detectable statistical signals in existing text data. Unlike generation-time watermarking constrained by API or system integration, the post-hoc paradigm allows flexible paraphrasing of arbitrary text, facilitating applications such as data copyright tracing, traitor-tracing, and membership evaluation in RAG/training corpora. The approach introduces new degrees of freedom in watermark embedding—most notably, adaptable decoding strategies and model selection—fundamentally transforming the fidelity-detectability trade-off.

The authors benchmark a broad suite of watermarking algorithms, paraphrasing models (spanning multiple families and scales), decoding/search strategies, and detection/aggregation procedures. Crucially, they distinguish between open-ended text (Wikipedia, books) and verifiable domains (code), exposing the circumstances under which existing watermarking algorithms succeed or fail and identifying optimal parameter regimes for practical deployment. Figure 1

Figure 1: Post-hoc text watermarking through watermarked LLM rephrasing, empirically evaluating detection power, semantic fidelity, and the effect of scheme and compute allocation.

Watermarking Methodology

The post-hoc protocol involves chunking the input, rephrasing each chunk with an LLM under watermark-constrained decoding, and aggregating watermark evidence for statistical detection. The watermark is injected by biasing or perturbing token selection pseudorandomly (via a secret key and prior context), employing methods such as Green-Red, Gumbel-max, DiPMark, MorphMark, and SynthID-Text. Notably, the Gumbel-max method outperforms most alternatives in random sampling scenarios, establishing a robust Pareto frontier across text quality and watermark strength.

Detection is achieved through rigorous statistical hypothesis testing, using deduplication and private key search to guarantee correct false positive rates under the null distribution. Compute-enhanced modes—beam search, multi-candidate generation (WaterMax), and entropy-based filtering—allow further navigation of the quality-for-detectability trade-off, which is inaccessible to generation-time-only methods.

Empirical Results

Quality-Detectability Trade-Off

Extensive sweep experiments on LLM paraphrasers (notably Llama-3 and Qwen) demonstrate that under standard nucleus sampling regimes, Gumbel-max watermarking dominates the quality-detectability Pareto frontier on open-ended English prose. Alternative schemes (DiPMark, MorphMark, SynthID) yield strictly inferior frontiers unless search (e.g., beam search with biased scoring) is engaged. Figure 2

Figure 2: Detection quality trade-off for competing watermarking methods and parameter regimes, highlighting the dominance of Gumbel-max in median-case performance.

Qualitative rephrasing results confirm high semantic fidelity at strong detectability, as evidenced by high SBERT similarity, low perplexity, and extremely low pp-values.

Model Family and Scale

Larger LLMs consistently generate semantically faithful paraphrases with lower perplexity. Paradoxically, they are less capable of embedding strong watermarks, since their outputs are less entropic and thus more resistant to substantial perturbation. At higher watermark strengths, only smaller, less capable models maintain detectability, indicating a clear trade-off between semantic accuracy and watermark channel capacity. Figure 3

Figure 3: The effect of model family/scale on cross entropy and watermark strength; larger models improve quality but only smaller models reach high detection strength.

Decoding and Compute

Systematic decoding via beam search, especially with scoring biased toward watermarked likelihoods, consistently improves the trade-off over random sampling. WaterMax, which selects from multiple unperturbed candidates, achieves minimal watermarking power—further, it is not suitable for “radioactivity” checks (training/context membership inference), a unique post-hoc detection feature. Figure 4

Figure 4: Beam search upshifts the Pareto frontier for robust watermarking methods, particularly with biased scoring that maximizes watermark evidence.

Entropy-aware detection brings only modest improvements and is generally not worth the additional complexity, except in marginal cases. Figure 5

Figure 5: Modest, configuration-dependent benefit of entropy-aware token filtering at detection time; generally less than 20% improvement.

Watermarking in Code

For verifiable domains, especially code, watermarking is substantially constrained by the need to preserve functional correctness. Here, small models outperform larger ones for watermark detectability, but both face a tight utility wall: watermark strength ("detection power") and correctness ("pass@1") are mutually exclusive at high levels. This exposes a fundamental limitation: code is a low-entropy, high-stakes object and only weak watermarking is feasible without excessive degradation. Figure 6

Figure 6: Pass@1 vs TPR at FPR=10310^{-3} for various code watermarking methods; Gumbel-max again leads, but strong watermarking is unattainable for high correctness.

Size ablations demonstrate that, unlike in open-ended text, smaller models offer superior watermarking performance at equivalent generation operating points, but at diminished functional accuracy.

Multilingual Documents and Chunking

Non-English watermarking is possible but comes with a sharper degradation in semantic similarity to attain comparable detection strength. This reflects typical LLM training biases and highlights the need for language-specific adaptations. For long documents, context-aware chunked paraphrasing strictly improves semantic consistency and watermark detection. Figure 7

Figure 7: Distribution of the output-to-input length ratio across models and watermarking schemes; output stability is determined primarily by base model capacity.

Implications and Future Directions

The thorough empirical evidence delineates the zones of feasibility for post-hoc watermarking. Practically, post-hoc LLM watermarking is a viable, low-effort tool for copyright management, contamination tracing, and downstream membership inference in open-ended text. The need to balance rephrasing model capacity and watermark injection strength is paramount: "larger-is-better" does not hold when watermark detectability is the objective.

The critique of WaterMax clarifies that not all watermarking algorithms are suitable for robust post-hoc provenance tasks, especially those requiring "radioactivity" and adversarial settings. Furthermore, automated semantic-quality proxies are insufficient, especially in multilingual and highly-structured domains, motivating the inclusion of execution-validated correctness assessments (e.g., code) in future work.

The findings suggest two priority directions:

  • Development of watermarking schemes adapted to low-entropy, verifiable domains (e.g., code) where current methods hit capacity-functionality walls.
  • Advancement of cross-lingual watermark injection and detection mechanisms, potentially with adaptive LLM architectures or code-switching approaches.

Conclusion

This study establishes the empirical limits and design principles of post-hoc watermarking via LLM rephrasing. Robust watermarking is attainable in natural text through judicious selection of model scale, watermarking method (favoring Gumbel-max or beam-enhanced schemes), and compute allocation. In contrast, code and low-entropy domains remain major open challenges due to innate correctness constraints. The results provide strong guidance for practical watermarking deployments and catalyze research into next-generation post-hoc methods optimized for both traceability and utility.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.