"Training robust watermarking model may hurt authentication!'' Exploring and Mitigating the Identity Leakage in Robust Watermarking

Published 10 May 2026 in cs.CR | (2605.09646v1)

Abstract: The rapid advancement of generative AI has underscored the critical need for identifying image ownership and protecting copyrights. This makes post-processing image watermarking an essential tool -- it involves embedding a specific watermark message into an image, with successful verification if a similar message can be decoded from the watermarked image. However, this method is susceptible to both adversarial attacks that manipulate the watermarked image to yield an unverified message upon decoding, and the proposed identity leakage-related attacks (e.g., forging watermarked images). The threat of identity leakage is particularly exacerbated in both empirical and certified robust watermarking methods. To defend against the aforementioned attacks, we propose W-IR, the first image watermarking framework that simultaneously incorporates identity protection and robustness. To enhance model robustness, we introduce a novel randomized smoothing technique as part of a robust watermarking, that offers certified robustness against perturbations across two distinct transformation spaces: pixel-level and coordinate-level. Moreover, to further mitigate identity leakage, we propose a new strategy based on residual information loss, aimed at minimizing the mutual information between the residual and watermarked images. Our work strikes a superior balance between robustness and identity leakage mitigation. Extensive experiments demonstrate that our W-IR framework achieves high certified accuracy for authenticity while effectively reducing identity leakage. \footnote{The code is available at https://github.com/holdrain/W-I-R.}

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that robust watermarking amplifies residual identity leakage, undermining authentication fidelity.
It employs a residual information loss objective to minimize watermark leakage while preserving robustness and clean bit accuracy.
Extensive empirical and certified evaluations confirm that integrating information bottlenecks effectively balances watermark strength with security.

Identity Leakage in Robust Post-Processing Image Watermarking: Analysis and Mitigation

Overview and Main Contributions

The paper "Training robust watermarking model may hurt authentication!'' Exploring and Mitigating the Identity Leakage in Robust Watermarking" (2605.09646) rigorously analyzes post-processing image watermarking and exposes the overlooked vulnerability of identity leakage, particularly exacerbated by robust watermarking protocols. The authors introduce W-IR, a watermarking framework addressing both certified robustness and mitigation of identity leakage via a residual information minimization objective. The paper substantiates these claims with formal theoretical analysis and comprehensive empirical evaluation across multiple datasets and watermarking strategies.

Figure 1: Main contributions: (1) the discovery of identity leakage and corresponding attacks in post-processing watermarking, especially worsened in the robust case; (2) W-IR—robust watermarking with mitigated identity leakage.

Identity Leakage: Threat Model and Attack Taxonomy

The core vulnerability arises from the residual images produced by subtracting the original image from the watermarked image. These residuals retain stable, secret watermark patterns, enabling practical attacks:

Identity Linking: Residual images cluster by user identity, exposing confidential watermark information even without explicit decoding. Clustering analysis demonstrates significantly reduced intra-cluster distances for robust models, indicating stronger identity correlation.

Figure 2: Residual images from four users cluster coherently, revealing identity information under both COCO and CelebA datasets.

Identity Forgery: An adversary can overlay averaged residuals onto new images to generate watermarked images with high fidelity and authenticity-under-decoder bit accuracy. Even with a single residual, substantial watermark information is recoverable, and attack success scales with $m$ (number of watermarked images) and the overlay multiplier $\zeta$ .
Figure 3: Identity forgery—overlaying residuals enables adversaries to create convincing forged watermarks.

Figure 4: Forging watermarked images on StegaStamp; visual quality and decoding accuracy improve with $m$ and $\zeta$ .

Identity Extraction: Bit-wise watermark extraction is feasible via iterative decoder queries and residual distance comparison; attack success exceeds $90\%$ bit accuracy in robust models, especially on StegaStamp.
Figure 5: Adversary extracts watermark bits by comparing residuals from encoder queries.

These attacks are more effective as model robustness increases, with empirical and certified robustness protocols intensifying identity leakage. Robust training inadvertently increases the decoder’s capacity to extract watermark information from residuals rather than confining it to the watermarked image alone.

Robustness in Neural Watermarking: Empirical and Certified Approaches

Classic post-processing watermarking (StegaStamp, HiDDeN) employs encoder-decoder neural architectures with loss terms for message reconstruction, visual fidelity (LPIPS), and adversarial regularization.

Robust watermarking introduces noise simulation layers or adversarial augmentations:

Empirical Robustness (W-ER): Adversarial training and augmentation improve resilience but increase identity leakage.
Certified Robustness (W-CR): The paper extends randomized smoothing techniques to watermark authentication—providing formal robustness guarantees against pixel and coordinate perturbations, including additive Gaussian noise and affine transformations.

Figure 6: Typical distortions for certified robustness: pixel noise, coordinate perturbations.

Certified robustness is achieved using smoothed classification bounds. The authentication model $h$ is constrained such that prediction is invariant for perturbations within a certified radius $R$ (determined by noise level and prediction confidence).

Figure 7: Certified accuracy at different radii under additive noise—accuracy decreases with radius but remains high within bounds.

Empirical evaluation demonstrates near-perfect certified accuracy (up to $99.5\%$ or higher) under realistic noise levels, with minimal sacrifice in clean bit accuracy or visual quality.

Information-Theoretic Mitigation: Residual Information Loss Objective

To mitigate identity leakage, the paper formalizes a residual information bottleneck objective. The watermarked image should maximize mutual information with the secret watermark $I(w; t)$ , while minimizing mutual information between residual image and watermark $I(z; t)$ :

$\zeta$ 0

Direct mutual information estimation is intractable; the authors employ variational bounds and KL-divergence approximation (“residual information loss”). Optimizing this objective via additional encoder training substantially reduces identity leakage without degrading robustness or authentication.

Figure 8: Information content in feature representations; residual information loss encourages maximal identity retention in watermark, minimal leakage in residual.

Figure 9: Schematic of residual information loss—mitigation pathway during robust watermark training.

Strong Empirical Findings and Contradictory Observations

Identity leakage is strongly intensified by robust watermarking: Empirical and certified robust models show much higher silhouette scores (identity linking), forgery bit accuracy, and extraction accuracy compared to clean vanilla models, especially with StegaStamp.
Residual information loss restores identity protection: Models trained with this loss achieve leakage rates comparable to, or lower than, clean models, while maintaining high robustness.
Certified robustness achieves high authentication accuracy: W-CR preserves $\zeta$ 1 certified accuracy even under significant geometric and noise perturbations.
Trade-off between robustness and identity protection can be balanced: The introduction of residual information loss does not impair robustness certification or clean accuracy.
Figure 10: Three-facet performance visualization: authenticity, robustness, identity protection across watermarking strategies—W-IR achieves superior balance.

Figure 11: Impact of $\zeta$ 2 (number of images) and $\zeta$ 3 (overlay multiplier) on COCO (StegaStamp)—identity leakage scales but can be attenuated via information bottleneck training.

Practical and Theoretical Implications

The explicit demonstration of robust watermarking exacerbating identity leakage challenges conventional assumptions in watermark authentication security.

For forensic and copyright applications: Invisible watermarks deployed with robust training are vulnerable to forgery and extraction—even in black-box deployment.
For watermarking system design: Information bottleneck objectives must be integrated to confine watermark information within the intended image and prevent residual-based leakage.
For adversarial attack evaluation: Certified robustness certificates must include identity leakage analysis to validate both security and privacy.

Methodologically, the adaptation of randomized smoothing for coordinate-level perturbations contributes to robust watermark certification beyond pixel adversaries.

Future Directions in Secure AI Watermarking

Theoretical extensions should address adversaries with more powerful generative or optimization-based capabilities, including attacks exploiting semantic correlation or multimodal watermark embeddings. Further, the interplay of watermark information, image semantics, and generative model outputs in in-processing watermarking suggests new attack vectors requiring robust information-theoretic mitigation.

Integration with generative AI ecosystems, large-scale deployment, and regulatory compliance (copyright, evidence provenance) demands scalable, efficient, and provably secure watermarking frameworks—combining robustness certification with formal leakage mitigation.

Conclusion

The paper provides a comprehensive analytical framework for understanding and mitigating identity leakage in robust post-processing image watermarking. By formalizing and empirically validating the leakage phenomenon—especially under robust training—and presenting an effective residual information bottleneck mitigation, it establishes new foundational standards for secure watermarking practices in generative AI and digital content protection.

Markdown Report Issue