White-box Watermarking Frameworks

Updated 7 April 2026

White-box watermarking frameworks are IP protection techniques that embed ownership information directly into model internals such as weights or activations, enabling forensic audits and legal verification.
They deploy diverse methods like static weight regularization, dynamic activation embedding, GAN-based adversarial techniques, and non-invasive key-based verification to ensure robustness.
Current research focuses on enhancing watermark robustness, stealth, and capacity while maintaining negligible model accuracy loss under fine-tuning, pruning, and structural attacks.

White-box watermarking frameworks comprise a class of model IP protection techniques characterized by embedding ownership information directly into the internal parameters or latent representations of deep learning models. Unlike black-box watermarking, where extraction operates solely through input–output behavior, white-box schemes rely on full access to model internals at verification time. This makes them effective under scenarios of forensic investigation, audit, and legal dispute resolution, but also exposes them to powerful attacks that exploit the model’s structural transparency. The design space of white-box watermarking has produced diverse methods including static weight regularization, dynamic activation embedding, GAN-based strategies, non-invasive key-based extraction, entangled low-rank adaptation protection, fragile watermarking for tamper detection, and chaos-driven identification. Current research aims to maximize watermark robustness, stealth, and capacity while maintaining negligible model accuracy loss.

1. Taxonomy and Core Principles

White-box watermarking frameworks can be categorized by the location of the embedded message and the operational paradigm:

Weight-based (static) schemes encode ownership directly into trained weights using auxiliary regularizers during training or through post-hoc modification. The extraction process typically projects (or applies a small neural network to) these weights and thresholds the result to recover the watermark.
Activation-based (dynamic) methods encode information in representations (e.g., means or pdfs of activations) arising at specific layers, often linked to secret trigger sets. Dynamic embedding increases attack resiliency but requires retaining triggers and extraction projections as secret keys.
Passport-based and entangled schemes incorporate non-trainable matrices, “passports,” or related constructs within the model’s parameterization, such that removal or alteration of the secret disables correct task performance.
GAN-based and adversarial hiding techniques use adversarial minimax training (discriminator and generator) to ensure that watermarked models are statistically indistinguishable from clean counterparts and resistant to detection attacks.
Non-invasive schemes (e.g., FreeMark) forego any weight modification, relying instead on the inferential power of model-specific secret keys and cryptographic protocols to verify activations.
Fragile watermarking frameworks (e.g., those with self-mutual check bits) focus on tamper-detection, localization, and recoverability in addition to conventional ownership proof.
Chaos-based approaches inject ownership data as unique chaotic patterns in model parameters, verifiable only through white-box access with matching chaotic seed recovery.

The essential criteria across these approaches are high fidelity (accuracy preservation), sufficient capacity for identifying bits, robustness to model modification (fine-tuning, pruning, overwriting), and extraction reliability under varied adversarial models.

2. Representative Frameworks

DeepSigns: Dynamic Activation PDF Watermarking

DeepSigns formulates watermark embedding as the joint optimization

$L_{total}(\theta) = L_{task}(\theta;D) + \lambda\,L_{wm}(\theta;W)$

where $L_{task}$ is standard task loss, and $L_{wm}$ penalizes deviation of selected activation cluster means $\mu_j$ from secrets $s_j$ , and adds a cross-entropy penalty on signature projections. Extraction projects empirical means from a probe set using secret $A_j$ to recover bitstrings and checks Hamming distance to the true watermark. DeepSigns demonstrably survives high-rate pruning, aggressive fine-tuning, and watermark overwriting, with zero bit error rate up to model collapse (Rouhani et al., 2018).

RIGA: Adversarially Hidden Weight-Based Watermarking

RIGA introduces a neural network–based extractor $E_\theta$ for recovering the watermark, coupled with a discriminator that enforces the embedded model’s weights to closely match the distribution of non-watermarked weights. The training process iteratively updates the extractor, discriminator, and model weights via a joint minimax loss. This adversarial regularization thwarts property inference attacks, substantially increases embedding capacity, and achieves resilience to fine-tuning, weight pruning, and overwriting without accuracy loss (Wang et al., 2019).

DICTION: Dynamic GAN-Style Embedding

DICTION generalizes DeepSigns by using a small neural discriminator (“projection network”) trained adversarially on latent-space triggers, abandoning real-image trigger reliance. The white-box extraction process is thus robust under significant parameter pruning, fine-tuning, and overwriting, maintaining zero BER under attacks that destroy prior schemes. Experimental results show DICTION achieves both higher capacity and stronger robustness on standard image benchmarks compared to DeepSigns and other static methods (Bellafqira et al., 2022).

FreeMark: Non-Invasive, Key-Based Verification

FreeMark shifts embedding complexity into secret-key generation by leveraging host model activations, without altering model parameters. The owner generates secret matrix-key pairs used to extract a binary watermark from mean activations of a trigger set. This eliminates accuracy degradation, achieves high-capacity (512 bits), and resists removal via fine-tuning or pruning, provided secret keys and trusted third-party extraction remain uncompromised (Chen et al., 2024).

SEAL: Entangled LoRA Watermarking

SEAL prototypically secures LoRA-adapted weights by inserting a non-trainable passport matrix between LoRA factors during training. The task-specific LoRA modules become entangled with the passport, making correct output impossible if the passport is perturbed or omitted. SEAL supports efficient verification by either statistical extraction or fidelity-drop tests across two co-trained passports, and remains detectable under pruning, fine-tuning, SVD-based obfuscation, and ambiguity attacks (Oh et al., 16 Jan 2025).

AquaLoRA: LoRA Scaling and Prior-Preserving Fine-Tuning

AquaLoRA, designed for Stable Diffusion, encodes secret bits via scaling matrices in all LoRA adapters. Prior-Preserving Fine-Tuning (PPFT) anchors model prediction distributions to the original, ensuring high-fidelity watermark embedding. Robustness is empirically validated under image-level distortions and model modification, with extraction relying on an EfficientNet-based secret decoder (Feng et al., 2024).

Fragile Watermarking with Self-Mutual Check Bits

This scheme encodes information and check bits in mutually intertwined patterns within parameter bits. Ownership verification can tamper-detect, localize, and restore parameters unless a significant fraction of the model is corrupted. Adaptive bit embedding mitigates accuracy degradation, and the scheme achieves 100% tamper detection at parameter granularity, with full recovery possible up to 20% random modification (Gao et al., 2023).

Chaos-Based Sequence Watermarking

Ownership information is embedded by adding a small-scale chaotic sequence to selected weights, generated by a logistic map with secret parameters. Extraction uses a genetic algorithm to search for the original chaos parameters, enabling robust recovery even after moderate fine-tuning, with binary model performance unaffected (B et al., 18 Dec 2025).

3. Attacks and Countermeasures

White-box watermarking frameworks are targeted by sophisticated attacks, including:

Fine-tuning and retraining: Simple retraining is generally ineffective in erasing robust dynamic or adversarially-hidden watermarks. DICTION, RIGA, and DeepSigns report zero BER or near-zero loss after hundreds of epochs or until main-task accuracy collapses (Rouhani et al., 2018, Bellafqira et al., 2022, Wang et al., 2019).
Pruning and quantization: Robust schemes retain low BER (>0) at up to 90–99% parameter sparsification. Score-based and chaos-based methods also preserve key statistics post-pruning (Rouhani et al., 2018, Bellafqira et al., 2022, Wang et al., 2019, B et al., 18 Dec 2025).
Overwriting attacks: When an adversary embeds an alternate watermark in the same subspace, only adversarially-generalized or dynamically-regularized schemes maintain the original watermark recoverability (e.g., RIGA and DICTION), while static schemes often fail (Bellafqira et al., 2022, Wang et al., 2019).
Neural structural obfuscation: Dummy neuron attacks can invalidate extraction by altering tensor shapes or weight statistics without changing I/O behavior. All surveyed frameworks are vulnerable unless extraction procedures are adapted to the modified structure (Yan et al., 2023).
Functionality-equivalence transformations: Neuron alignment–based defense mechanisms must be integrated to restore neuron order prior to extraction, enabling accurate verification even after permutation (Li et al., 2021).

4. Evaluation Metrics and Empirical Protocols

Empirical validation of white-box frameworks leverages several metrics:

Fidelity: Change in test accuracy (ΔAcc), which is expected to be negligible, often below 0.1%.
Capacity: Drawable message length (e.g., 4–512 bits for DeepSigns/FreeMark, images for RIGA).
Robustness: Maximum tolerated pruning percentage, fine-tuning epochs, and BER (bit error rate) under each attack.
Covertness: Ability of detectors to distinguish clean vs. watermarked models; adversarial hiding drops detection accuracy to near random (Wang et al., 2019).
Stealthiness: Preservation of model weight and activation distributions as measured by statistical or model-based detectors.
Tamper detection and recovery: For fragile schemes, localization and exact repair of the altered bits or parameters is measured.

Selected empirical results from key frameworks:

Framework	Fidelity Loss (ΔAcc)	Max Pruning (%)	Fine-tune Robustness	Capacity (bits)	Overwriting Robustness	Ref.
DeepSigns	<0.1%	80–99	200 epochs, BER=0	4–128	BER=0 unless collapse	(Rouhani et al., 2018)
DICTION	<0.1%	90–95	150 epochs, BER=0	256	BER=0 always	(Bellafqira et al., 2022)
RIGA	<0.1%	99	BER=0 unless collapse	256+	BER=0 always	(Wang et al., 2019)
FreeMark	0	(not embed)	BER=0 (fine-tuning)	512	N/A (non-invasive)	(Chen et al., 2024)
SEAL	≈0	99.9 (LoRA)	BER=0, pval ≪ 1	LoRA dim	High ambiguity	(Oh et al., 16 Jan 2025)
Fragile (self-check)	<0.8%	N/A	recovery to >95%	11 bits/param	100% detection	(Gao et al., 2023)
Chaos-based	<1%	(not eval)	up to 10 epochs	single-seq	GA search deters fp	(B et al., 18 Dec 2025)

5. Theoretical Bounds and Limitations

Formulations in recent frameworks introduce several theoretical guarantees:

Lipschitz-type bounds (DeepSigns): If parameter drift is bounded, cluster mean drift and projected bit flips are also bounded (Rouhani et al., 2018).
Error-correction guarantees (neuron alignment): Code design ensures unique mapping even with substantial symbol corruption (Li et al., 2021).
False positive rates (chaos-based, FreeMark): Extraction via randomness yields BER near 0.5 on unmarked models or forged keys (B et al., 18 Dec 2025, Chen et al., 2024).
Tamper detection probability (fragile self-mutual check): Probability of missed detection is ≤1/512 (Gao et al., 2023).

Key limitations span:

Structural attacks: All surveyed frameworks (as of (Yan et al., 2023)) are vulnerable to dummy neuron–based obfuscation unless extraction is explicitly adapted.
Capacity vs. fidelity trade-off: Static projection–based schemes are limited in how many bits they can embed without inducing detectable model distortions or compromising accuracy.
White-box requirement: Most approaches necessitate full parameter or activation access for extraction, precluding deployment in strictly black-box settings.
Key management: Non-invasive and cryptography-inspired schemes require secure, non-leakage of secret keys, often reliant on TTP protocols.

6. Developments, Challenges, and Open Questions

The primary ongoing research thrusts in white-box watermarking frameworks are:

Generalization to new architectures: Expanding robust methodologies to transformer, graph, and diffusion models (e.g., AquaLoRA (Feng et al., 2024), but further work needed on diffusion, GNNs, etc.).
Adversarial adaptation and detection: Developing defenses against structural attacks, adaptive attackers, and collusion or model distillation not yet handled by existing schemes (Yan et al., 2023, Chen et al., 2024).
Multi-layer and multi-modal schemes: Combining dynamic activation codes, neuron alignment, and static weight schemes to address broader attack vectors.
Stealth vs. capacity: Optimizing the trade-off between undetectability, information rate, and model fidelity.
Provable security: Extending formal bounds from information theory, cryptography, and robust statistics to quantify resistance to all known and anticipated attack strategies.