Decoder Gradient Shields for Box-Free Watermarking

Updated 24 January 2026

Decoder Gradient Shields (DGSs) are a family of provable defenses that reorient, rescale, or perturb decoder gradients to obstruct watermark removal without degrading output quality.
They comprise three variants—DGS-O, DGS-I, and DGS-L—that intervene at the output, input, and intermediate layers respectively, each balancing security with computational cost.
Empirical results demonstrate that DGSs maintain high image fidelity (PSNR > 30 dB) and robustness across various attack models, proving effective in deep image-to-image watermarking.

Decoder Gradient Shields (DGSs) are a family of provable, high-fidelity defenses designed to prevent gradient-based removal of invisible watermarks in box-free watermarking pipelines for deep image-to-image and generative models. By reorienting, rescaling, or orthogonally perturbing gradients at various points in the decoder’s dataflow, DGSs ensure that an attacker—who leverages a decoder’s output and its backpropagated gradients to optimize a watermark remover—cannot achieve convergence, while explanatory integrity and fidelity for legitimate watermark extraction remain uncompromised (An et al., 28 Feb 2025, An et al., 17 Jan 2026).

1. Threat Model in Box-Free Watermarking

Box-free watermarking for deep image-to-image models employs a private encoder $\mathbb E$ to embed a watermark $W$ into a model output $X$ , producing a watermarked image $Y = \mathbb E(\mathrm{Concat}(X,W))$ . The decoder $\mathbb D$ , typically deployed as a black-box API, is used to extract either $W$ from watermarked images or a null-mark $W_0$ from non-watermarked images.

A primary vulnerability arises when an adversary queries the protected model to harvest $(X_0, Y)$ pairs and trains a removal network $\mathbb R$ seeking to generate $\hat X = \mathbb R(Y)$ such that the decoder extracts $W_0$ from $\hat X$ . The attacker minimizes the removal loss: $\mathcal L_{\mathrm{Attack}} = \beta_1 \|\mathbb D[\mathbb R(Y)]-W_0\|_2^2 + \beta_2 \|\mathbb R(Y)-Y\|_2^2$ by exploiting access to the gradients $\partial \mathcal L_{\mathrm{Attack}}/\partial \mathbb R(Y)$ —either directly or via black-box estimation—enabling convergence of $\mathbb R$ to an inverse mapping of the watermark encoder (An et al., 17 Jan 2026).

2. Core Concepts and Variants of Decoder Gradient Shields

Decoder Gradient Shields intervene at strategic locations within the decoder pipeline to neutralize or mislead the gradient signal exploited by attackers. All variants operate without perceptible degradation in legitimate watermark extraction or output image quality. The three principal DGS variants are:

DGS at the Output (DGS-O): Applies a closed-form, deterministic transformation at the decoder’s output, reorienting and rescaling the backpropagated gradient.
DGS at the Input (DGS-I): Introduces additively small, adversarially distributed noise at the decoder input, precisely orthogonal to the attacker's loss gradient, thereby neutralizing its effect.
DGS in the Layers (DGS-L): Injects orthogonal perturbations into arbitrary internal layers of the decoder, further obfuscating the gradient flow and concealing the defense’s presence.

Each variant guarantees a fundamental disruption to the chain of gradients the attacker depends on, with DGS-O prioritizing minimal cost and universal deployability, DGS-I conferring greater security with minor latency overhead, and DGS-L balancing hiddenness with moderate computational cost (An et al., 17 Jan 2026).

3. Mathematical Formulation and Implementation

DGS-O: Closed-Form Output Transformation

The DGS-O variant defines a transformation: $Z^* = -P Z + (P + I) W$ where $Z = \mathbb D(S)$ is the vanilla decoder output for query $S$ , $W$ is the reference watermark, $P \succ 0$ is a positive-definite diagonal matrix with small entries (e.g., $\lambda_i \in [10^{-8}, 10^{-3}]$ ), and $I$ is the identity. The transformation is conditionally applied if $\mathrm{NC}(Z, W) > \tau$ , where $\mathrm{NC}$ denotes normalized cross-correlation and $\tau \approx 0.96$ .

This mapping provably flips the gradient’s direction by $90$– $180^\circ$ and scales its norm by the spectrum of $P$ , ensuring that any descent step in attacker loss cannot effect progress toward $W_0$ . Generic pseudocode for DGS-O:

def DGS_Decoder(S, D, W, P, tau):
    Z = D(S)
    if normalized_correlation(Z, W) > tau:
        return -P @ Z + (P + np.eye(len(P))) @ W
    else:
        return Z

DGS-I and DGS-L: Orthogonal Perturbations

For DGS-I:

Perturb the decoder input $S$ by $\eta(S)$ such that $\|\eta(S)\|_\infty \leq \varepsilon$ and $\nabla_S \mathcal L_{\rm Removal}(S)^T \eta(S) = 0$ .
$\mathbb D^*(S) = \mathbb D(S + \eta(S))$ .

For DGS-L:

At internal layer $k$ , intermediate features $F$ are perturbed: $\widetilde F = F + \eta(F)$ , with similar orthogonality and norm constraints on $\eta$ .
Only later layers process the perturbed signal, reducing the exposure of the shield (An et al., 17 Jan 2026).

4. Theoretical Guarantees

Gradient Reorientation and Attenuation

Under the DGS-O transformation, the attacker's loss gradient is provably replaced by

$\nabla^* = -2 (Z - W_0)^T P J$

where $J = \partial Z / \partial \mathbb R(Y)$ . The angle between $\nabla$ and $\nabla^*$ satisfies $\theta(\nabla, \nabla^*) \in (90^\circ, 180^\circ]$ and $\|\nabla^*\| \leq \lambda_{\text{max}} \|\nabla\|$ for diagonal $P$ (An et al., 28 Feb 2025).

No-convergence is proved: Any gradient-based attacker observing only $\nabla^*$ cannot reduce the removal loss $\mathcal L_{\rm Removal}$ below its initial value minus an exponentially small term $O(\lambda_{\text{max}})$ , precluding effective removal of the watermark.

For DGS-I and DGS-L, the sequence of orthogonally randomized perturbations ensures that gradient descent accumulates uncorrelated error, with the effective descent direction rendered random and target loss unattainable (An et al., 17 Jan 2026).

5. Empirical Performance and Evaluation

Experimental Setup

Tasks: Image deraining (PASCAL VOC) and text-to-image (Stable Diffusion) transformations.
Watermarking: Encoder–decoder pipeline of Zhang et al., with jointly trained $𝔼$ and $𝔻$ .
Attacker: U-Net-based watermark remover, trained for 100 epochs (Adam, learning rate $2 \times 10^{-4}$ ).
Metrics: Fidelity (PSNR, MS-SSIM), robustness (Success Rate, SR).

Experimental Findings

Without DGS, all forms of removal loss converge (i.e., $SR \approx 0$ ). With DGS-O, DGS-I, or DGS-L:

Success Rate (SR) reaches $100\%$ across all tasks and removal loss functions, with the attack loss stalling above $1 \times 10^{-2}$ .
Decoder outputs under DGS remain perceptually indistinguishable (PSNR $> 30$ dB, MS-SSIM $> 0.98$ ).
Robustness persists under JPEG compression (10–40\%), additive Gaussian noise (0–30 dB), lattice attacks, and attempted sign-flip of returned gradients. Extreme degradations reduce PSNR but maintain $SR \geq 58\%$ for severe perturbations (An et al., 28 Feb 2025, An et al., 17 Jan 2026).

Computational overhead is minimal: DGS-O incurs $<0.13$ ms/query, DGS-I $0.06$–$0.07$ s/query, and DGS-L $0.03$–$0.06$ s/query depending on implementation specifics.

6. Limitations and Practical Considerations

While DGSs are effective against gradient-based removal attacks that interact with the deployed decoder, several open issues exist:

Invertibility: DGS-O’s linear formulation could potentially be inverted if an attacker estimates $P$ and $W$ through repeated queries, though the use of randomized or low-norm $P$ mitigates straightforward inversion.
Attack Model Coverage: All current DGS forms protect only against removal attacks that actively query the decoder. If attackers train a surrogate decoder independently (“surrogate bypass”), DGSs offer no defense.
Parameter Selection: The strength and effectiveness of the shield depend on the selection of $P$ , perturbation budgets $\varepsilon$ , and adaptation to attacker-specific query patterns.
Computational Cost: For extremely high-resolution images, the requirement to store and multiply by $P$ —especially if dense—may introduce overhead. Structured $P$ can alleviate this (An et al., 28 Feb 2025, An et al., 17 Jan 2026).

7. Extensions and Future Research Directions

Future developments may focus on:

Combination of gradient shields with data augmentation during decoder training to defend against surrogate bypass.
Adaptive DGSs that randomize shield parameters per query, broadening unpredictability and resilience.
Layerwise variation and randomization of shield injection for further robustness (“layer-wise randomization”).
Analysis of tighter theoretical lower bounds in scenarios with repeated, randomized orthogonal shield application.
Exploration of non-linear shield mappings, as well as integration with predicted-label poisoning and adversarial-training–style defenses (An et al., 28 Feb 2025, An et al., 17 Jan 2026).

A plausible implication is that adopting DGSs represents the current state of the art for defending decoder APIs in box-free watermarking against gradient-based watermark removal, but systematic extension to broader attack models remains an open research frontier.

References:

"Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal" (An et al., 28 Feb 2025)
"Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal" (An et al., 17 Jan 2026)

Markdown Report Issue Upgrade to Chat

References (2)

Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal (2025)

Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoder Gradient Shields (DGSs).