White-Box Watermarking Framework: DNN IP Protection

Updated 25 December 2025

White-box watermarking frameworks are defined by embedding proprietary marks directly into DNN weights or activations, enabling verifiable IP protection.
They employ dual-phase methods—adding auxiliary losses during training and extracting with secret keys or neural extractors—to ensure high watermark capacity and security.
These schemes consistently demonstrate resilience against fine-tuning, pruning, and obfuscation attacks, supporting reliable forensic audits and legal verifications.

White-box watermarking frameworks constitute a critical class of mechanisms for deep neural network (DNN) intellectual property (IP) protection, wherein the watermark is directly embedded into and extracted from internal model artifacts (weights, activations, or learnable modules) requiring white-box access for verification. Unlike black-box watermarking, which relies on externally observable behaviors, white-box schemes are designed for settings where model parameters or state can be accessed—such as forensic audits, on-premise deployments, or legally compelled disclosures. These frameworks enable robust, high-capacity watermarking with zero or negligible impact on model utility, while offering strong resilience to removal, fine-tuning, and detection attacks.

1. Formal Structure and Taxonomy

White-box watermarking schemes share a unifying two-phase structure: watermark embedding is realized by adding an auxiliary loss (or architectural mechanism) during training or fine-tuning, while watermark extraction operates over internal representations using secret keys, projections, or extraction DNNs.

Let $f_W$ denote a DNN with weights $W$ . Embedding augments the base loss $\mathcal{L}(W; \mathcal{D})$ with a watermark-specific regularizer $\mathcal{L}_\text{wmk}(W, s)$ : $\mathcal{L}'(W) = \mathcal{L}(W; \mathcal{D}) + \lambda \mathcal{L}_\text{wmk}(W, s),$ where $s \in \{0,1\}^b$ is a secret bit-string. Verification relies on an extraction function operating over either a masked subset of weights or specific activation statistics: $s' = T_h\bigl(A\bigl(M(f_{\widetilde W})\bigr)\bigr),$ where $A$ is a (possibly learned) projection, $M$ is a selector/mask, and $T_h$ is a thresholding operator. High-fidelity extraction occurs if $\mathrm{BER}(s, s') < \theta$ for small $\theta$ .

Historical schemes are differentiated by the nature of features (weights vs. activations), encoding/decoding mechanisms (projections, DNN extractors), and auxiliary components (adversarial regularization, key-based perturbation, passport matrices). A comprehensive classification is found in (Bellafqira et al., 2022), which formalizes static and dynamic taxonomies.

2. Embedding Mechanisms and Representative Frameworks

2.1 Static Weight-based Embedding

Early schemes such as Uchida et al. (cf. (Bellafqira et al., 2022)) and extensions (e.g. RIGA (Wang et al., 2019)) canonicalize the feature extraction as selection and aggregation of weights from a specific layer, followed by projection to watermark space: $L_\text{WM}^{\text{(weight)}}(W; s) = \lVert A \, \text{vec}(W_\ell) - s \rVert_2^2,$ training the network (or a chosen subset) so that projections yield prescribed bits.

RIGA (Wang et al., 2019) innovates by including an adversarial covertness component: a discriminator $D(w; \theta)$ distinguishes watermarked from benign models, while an extractor network $W(q; \phi)$ is trained to recover the owner’s message from selected features $q = g(w; \text{FE})$ . The generator loss is

$L_F(w, \phi, \theta) = E_o(w) + \lambda_1 E_\text{wm}(w, \phi) + \lambda_2 E_\text{det}(w; \theta),$

ensuring both watermark extractability and resistance to property-inference attacks.

2.2 Activation-based and Dynamic Embedding

Schemes such as DeepSigns (Rouhani et al., 2018) and DICTION (Bellafqira et al., 2022) operate in the space of intermediate activations, often tying the watermark to the statistical structure (e.g. means of Gaussian Mixture Model clusters) under specific “trigger” inputs. DeepSigns regularizes the network so that projections of GMM means onto random directions match the owner’s signature.

DICTION (Bellafqira et al., 2022) generalizes DeepSigns to the latent space: a “generator” consisting of the first $\ell$ layers produces features from random latent vectors, which are mapped via a learned extractor DNN trained adversarially to distinguish and reconstruct the proper watermark bits. The objective includes

$E_\text{wat} = d(D_\theta(\text{Ext}(M^\text{wat}, z)), b) + d(D_\theta(\text{Ext}(M, z)), b_r),$

with $z$ sampled from a high-entropy prior, yielding high-capacity, robust, and attack-resistant embeddings.

2.3 Specialized LoRA and Diffusion-Model Watermarks

Recent work addresses the unique requirements of parameter-efficient adaptation (LoRA) and generative diffusion models. SEAL (Oh et al., 16 Jan 2025) embeds a secret, non-trainable matrix (“passport”) between standard LoRA weight factors, then entangles this object with trainable weights via stochastic switching during fine-tuning. For image diffusion models, AquaLoRA (Feng et al., 18 May 2024) introduces a two-stage protocol: latent watermark pre-training followed by prior-preserving fine-tuning of a LoRA module, integrating bit-string-dependent scaling matrices for flexible ownership updates.

3. White-Box Extraction and Ownership Verification

Extraction relies on model internals and owner-held secrets. Common approaches include:

Projection and Thresholding: For static schemes, extraction involves projecting selected weights or means via secret keys and thresholding to bits.
Neural Extractors: For dynamic/adversarial frameworks (RIGA, DICTION), a small DNN is used to reconstruct the bit-string from feature vectors.
Passport Matrix Recovery: In LoRA watermarking, owners use private original factors and pseudo-inverses to reconstruct and statistically test the passport’s presence (Oh et al., 16 Jan 2025).
Bitwise Consistency Checks: Fragile schemes (Gao et al., 2023) partition each parameter’s bits and enforce local mutual/self-checks for tamper-detection and recovery.

Verification fidelity is quantified by bit-error-rate (BER). Watermark recovery is robust under fine-tuning, pruning, and overwriting (e.g., RIGA, DeepSigns, DICTION), but fragile under structural attacks unless specifically hardened (see Section 5).

4. Robustness, Security Analysis, and Attacks

4.1 Standard Threat Model and Defenses

The white-box model assumes adversaries with access to all parameters and architecture, but without owner keys. Defenses routinely evaluated include:

Fine-tuning: Watermark persists unless model accuracy is severely compromised (Rouhani et al., 2018, Wang et al., 2019, Bellafqira et al., 2022).
Pruning: Up to 90-99% sparsity, as long as accuracy holds, recovery is unaffected (Rouhani et al., 2018, Wang et al., 2019, Bellafqira et al., 2022).
Overwriting: Attackers embedding new watermarks in the same layer fail to erase the original (BER remains zero) in RIGA, DeepSigns, DICTION.

4.2 Advanced Structural and Functional Attacks

Structural obfuscation attacks (dummy neurons, neuron permutation, scaling) provide existential vulnerabilities for mainstream white-box watermarks (Yan et al., 2023). By inserting functionally inert yet stealthy neurons—via constructs such as NeuronClique and NeuronSplit—adversaries can render the extraction mappings invalid while preserving model utility. These manipulations (3–5% extra neurons) drive BER above decision thresholds (≈55–75%), defeating ownership verification across nine tested schemes.

Permutation of neuron order (with compensatory downstream inverse permutations) invalidates activation-based extraction unless remedied by a domain-specific alignment step: the neuron alignment framework (Li et al., 2021) encodes per-neuron response codes via ECC, enabling order recovery and blocking permutation-only obfuscation.

5. Covertness, Capacity, and Trade-offs

A core strength of adversarially regularized frameworks (e.g., RIGA) is statistical covertness: the distribution of weights in watermarked and benign models is rendered indistinguishable, with property-inference attack accuracy falling to random guess (≈55–60%) (Wang et al., 2019). DICTION further achieves indistinguishable activation statistics (Bellafqira et al., 2022).

Capacity is typically constrained by the width of the feature vector, embedding layer size, and strength of enforced regularizers. RIGA, DeepSigns, and DICTION attain embeddings of up to 128–256 bits with negligible accuracy loss ( $\lesssim 0.5\%$ ). FreeMark (Chen et al., 16 Sep 2024), leveraging cryptographic key generation without model modification, pushes watermark length to 512 bits with zero performance overhead.

Capacity-fidelity trade-offs are systematically explored (Rouhani et al., 2018, Bellafqira et al., 2022): increasing bit-length or regularization strength beyond empirically determined thresholds leads to observable accuracy degradation.

6. Specialized Extensions and Practical Considerations

White-box watermarking schemes have been extended to fragile/tamper-detection regimes, LoRA-trained models, and generative models (GANs, diffusion). Specific innovations include:

Fragile Embeddings: Self-mutual check frameworks (Gao et al., 2023) enable precise per-parameter tampering detection and recovery of original weights, provided attack rates are under 20%.
GANs and Diffusion Models: Wide-flat-minimum loss landscapes are enforced to ensure watermark persistence under parameter noise and surrogate modeling (Fei et al., 2023), while AquaLoRA (Feng et al., 18 May 2024) achieves both hot-swappable secrets and resilience across samplers, distortion types, and checkpoint averaging.
LoRA Security: Passport entanglement with LoRA weights, as in SEAL, provides resistance to pruning, low-rank obfuscation, and ambiguity (forged passport) attacks, with robust statistical verification (Oh et al., 16 Jan 2025).

Implementation notes emphasize the necessity of efficient key management, secret trigger sets, and safeguarding of extraction parameters.

7. Open Challenges and Future Directions

While contemporary frameworks offer high robustness against standard removal and fine-tuning attacks, the threat posed by structural obfuscations necessitates a shift toward extraction mechanisms invariant under permutation, scaling, and layer-width changes (Yan et al., 2023). Neuron alignment (Li et al., 2021) and dimension-agnostic extraction (e.g., pooling, random projections) are promising directions.

Emerging schemes (FreeMark) introduce cryptographic principles—off-model secret keys, one-way trapdoor mappings—heralding a new paradigm in white-box watermarking (Chen et al., 16 Sep 2024). Remaining challenges include formalization of one-wayness, dynamic adaptation to evolving models, scaling to extremely large architectures, and integration with semi-black-box or behavioral triggers for hybrid protection.

The domain continues to evolve in response to a rapidly advancing adversarial landscape and the increasing commercial importance of deep model IP.