On the Robustness of Watermarking for Autoregressive Image Generation

Published 13 Apr 2026 in cs.CV, cs.AI, and cs.CR | (2604.11720v1)

Abstract: The proliferation of autoregressive (AR) image generators demands reliable detection and attribution of their outputs to mitigate misinformation, and to filter synthetic images from training data to prevent model collapse. To address this need, watermarking techniques, specifically designed for AR models, embed a subtle signal at generation time, enabling downstream verification through a corresponding watermark detector. In this work, we study these schemes and demonstrate their vulnerability to both watermark removal and forgery attacks. We assess existing attacks and further introduce three new attacks: (i) a vector-quantized regeneration removal attack, (ii) adversarial optimization-based attack, and (iii) a frequency injection attack. Our evaluation reveals that removal and forgery attacks can be effective with access to a single watermarked reference image and without access to original model parameters or watermarking secrets. Our findings indicate that existing watermarking schemes for AR image generation do not reliably support synthetic content detection for dataset filtering. Moreover, they enable Watermark Mimicry, whereby authentic images can be manipulated to imitate a generator's watermark and trigger false detection to prevent their inclusion in future model training.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper shows that modern AR watermarking schemes are vulnerable to targeted token substitution and adversarial latent manipulation methods.
It demonstrates that bitwise watermarking, while robust to non-targeted removal, can be effectively forged in white-box settings.
The study underscores that merely adjusting detection thresholds is insufficient, urging the development of fundamentally robust watermarking designs.

Robustness of Watermarking in Autoregressive Image Generation

Introduction

Recent advances in autoregressive (AR) image generation models, including scalable architectures like VAR, HMAR, and Infinity, have accelerated the deployment of synthetic imagery. To enable content provenance, several in-generation watermarking schemes, notably KGW-derived methods, have been adapted to the AR setting. These approaches embed deterministic or stochastic signals during generation by manipulating token probabilities, aiming for both robust verification and minimal perceptual distortion. The paper "On the Robustness of Watermarking for Autoregressive Image Generation" (2604.11720) critically evaluates the practical security and reliability of such schemes under removal and forgery attacks, questioning their effectiveness for synthetic data filtering in large-scale training pipelines.

Technical Overview of AR Watermarking Schemes

Modern AR image generators often operate in tokenized latent spaces, producing sequences via chained conditional distributions with vector quantization. Watermarking in this paradigm typically follows an adaptive token manipulation strategy: at each generation step, the output distribution over the visual vocabulary is split into "green" (favored) and "red" (unfavored) subsets using context-dependent pseudorandom selection, and logits for green tokens are boosted. During verification, statistical tests determine the overrepresentation of green tokens, thus inferring watermark presence.

Notable schemes covered:

IndexMark: Deterministic token-pairings and systematic replacement for detection, optimized for minimal degradation.
WMAR: KGW-compatible, with encoder/decoder finetuning for increased cycle consistency and perturbation robustness; optionally incorporates geometric synchronization.
ClusterMark: Token clustering for improved resilience, coupled with a classification head for supervised verification.
BitMark: Tailored to bitwise AR generators (Infinity), leveraging multi-scale residual bits and exploiting KGW's framework in a binary regime, designed for radioactive watermark propagation into downstream models.

Proposed and Evaluated Attack Vectors

The study systematically assesses both prior and novel attack methodologies:

1. Token Substitution/Regeneration: The Vector-Quantized Regeneration (VQ-Regen) attack exploits the codebook structure, generating reconstructions from alternate token selections in latent space to disrupt watermark traceability.

2. Latent Adversarial Optimization (LatentOpt): White/grey/black-box adversarial attacks using VQ-VAE encoders optimize for maximal shift in latent representations with bounded perturbation budgets, either to remove watermarks (by shifting away from watermarked centroids) or to forge them (by moving toward watermarked references).

3. Frequency Injection: For BitMark, periodic spatial artifacts are injected in the Fourier domain to mimic the watermark detector’s statistical biases, enabling potent forgery at high perceptual fidelity.

Additionally, standard robustness benchmarks such as JPEG compression, additive noise, geometric transformations, and established diffusion-based regeneration attacks (Regen, Rinse, CtrlRegen+) are included.

Empirical Findings

Token-based Watermarks (IndexMark, WMAR, ClusterMark)

Removability: All tested schemes are highly vulnerable to VQ-Regen and LatentOpt removal attacks. Even black-box attackers not privy to watermark keys or parameters achieve significant reductions in true positive rates for watermark detection, with minimal PSNR compromise (often >30dB). Grey- and white-box access further improves attack efficacy.
Forgery: Forgery proves nontrivial in black-box settings; success is contingent on encoder alignment. White-box forgeries are easier for low-dimensional latent spaces (notably LlamaGen's VQ-VAE with $d=8$ ), owing to simplified Voronoi partitioning.
Perturbation Budget: Attack strength and perceptual distortion scale predictably with $\ell_p$ -norm bounds on adversarial shifts, and gradient-informed optimization clearly outperforms naive corruption.

Bitwise-Autoregressive Watermark (BitMark)

Robustness to Removal: BitMark demonstrates strong resilience to all non-targeted attacks (including diffusion regeneration and VQ-Regen), attributed to the immense number of embedded bits per image (high statistical order). However, a white-box adversarial (BitOpt) attack that leverages full knowledge of encoder parameters and green set $G$ can decisively erase the watermark at negligible perceptual cost.
Forgery and Watermark Mimicry: LatentOpt-forgery and frequency injection dramatically outperform naive attempts and succeed with TPRs approaching or exceeding 80% in some black-box settings. This watermark mimicry enables adversaries to protect authentic images from training data curation, violating the principle intent of radioactive watermarking.
Threshold Tuning Limitations: Adjusting detection thresholds (e.g., for lower FPRs or increased sensitivity to “radioactive” downstream data) is insufficient for robust discrimination: overlap in the z-score distributions for removals and forgeries prohibits reliable operation points.

Practical and Theoretical Implications

Synthetic Data Filtering and Model Collapse Mitigation

The central application motivating AR watermark development is the filtering of synthetic data during web-scale scraping, to prevent recursive model collapse as outlined in prior work [31]. These findings show that all current AR watermarks—including BitMark—are fundamentally vulnerable either to removal (token-based) or to forgery (bitwise, via Watermark Mimicry), particularly when attackers can access or train proxies of the deployed encoders. This subverts the radioactive property meant to propagate watermark signals through fine-tuned or hybridized models.

Security Model Assumptions

Effective AR watermarking often presupposes secrecy of encoder parameters and watermark keys, but this study shows that even moderate knowledge transfer (proxy encoders of similar architectures, accessible open-source models) suffices for most attack avenues. The lack of cryptographic binding between models, datasets, and watermarks remains a core Achilles’ heel.

Directions for Future Research

These results necessitate new directions in robust multimedia watermarking for generative models:

Embedding schemes that are less transfer-prone or more robust to adversarial manipulations across encoder architectures.
Watermark designs functioning under open-box assumptions with formally proven lower bounds on removability/forgery success rates.
Hybrid approaches leveraging both in-generation and post-hoc watermarking, linked to robust cryptographic primitives.

Conclusion

A comprehensive and methodologically rigorous robustness evaluation of state-of-the-art AR image watermarking reveals that current schemes do not provide the intended security guarantees for synthetic data provenance. Token-based methods are easily removable even with black-box attacks, while BitMark, though extremely robust to removal, is widely susceptible to realistic forgery. Adjustments to detection thresholds are fundamentally inadequate to address these weaknesses; thus, the challenge of practical, attack-resilient AR watermarking for content provenance and dataset filtering remains unresolved. This work lays the groundwork for further advances in robust watermarking aligned with adversarial threat modeling and open-source AR generation ecosystems.

Markdown Report Issue