Detectability of Deeply Entangled Structural Watermarks in LLM Outputs

Determine whether, when a watermark is deeply entangled with a large language model’s architecture by being embedded into transformer weights and layers, reliable detection of the watermark from the model’s generated text is realistically achievable.

Background

The paper analyzes watermarking methods that modify the internal structure of LLMs, such as embedding signals into model weights or training-time procedures, and contrasts them with token-level watermarking. These structural approaches promise low impact on output quality and resilience to removal but raise questions about the persistence and identifiability of the signal after downstream modifications.

Empirical observations cited in the paper show mixed robustness of structural watermarks under common open-source workflows (e.g., pruning, quantization, merging, and fine-tuning), and the evaluations to date often use limited metrics (e.g., fixed FPR settings), leaving gaps in understanding overall detectability. This motivates a specific open issue about whether deeply embedded signals remain practically detectable from generated text.

References

The deepness of such in-processing watermarks, into weights and layers, leads to an open issue. If the watermarking is too deeply entangled with the LLM architecture, will it be realistic to reliably detect it in the final LLM text?

— Watermarking Large Language Models in Europe: Interpreting the AI Act in Light of Technology (2511.03641 - Souverain, 5 Nov 2025) in Section 6, Trade-Offs for Existing LLM Watermarking Techniques; In-Processing Approaches; Watermarking in Model Architecture (steps 1 & 2) — Research avenues on final detectability and distortion of LLM outputs

Detectability of Deeply Entangled Structural Watermarks in LLM Outputs

Sponsor

Background

References

Related Problems