Embedding Watermarks in DNNs

Updated 5 February 2026

Embedding watermarks into deep neural networks is a technique for IP protection and ownership verification by embedding secret signals within model parameters, architecture, activations, or outputs.
Watermarking methods—including white-box, black-box, structural, and plug-and-play approaches—utilize parameter regularizers, trigger sets, and neuron alignment to maintain fidelity and resist attacks.
Empirical studies show that carefully designed watermarking schemes preserve model accuracy (error increase ≪1%) and remain robust against fine-tuning, pruning, and distillation, highlighting their practical utility.

Embedding watermarks into deep neural networks (DNNs) is a critical technique for intellectual property (IP) protection, model provenance, and ownership verification. Watermarks can be embedded in various forms—into model parameters, architectures, activations, or outputs—and must satisfy strict requirements regarding fidelity, robustness, capacity, security, efficiency, and resistance to sophisticated attacks including model piracy, overwriting, and functional equivalence transformations. The field comprises white-box and black-box schemes, parameter-based and structure-based methods, and hybrid and plug-and-play approaches.

1. Watermarking Paradigms: Definitions and Core Objectives

Watermarking a DNN involves embedding a secret, verifiable signal—such as a binary string or unique pattern—within the model such that it can be reliably extracted or detected by the model owner, but is resilient against removal, modification, or fraudulent claims. Key requirements include:

Fidelity: The embedding must not degrade the primary task accuracy; test error typically increases by ≪1% when embedding is performed jointly with training (Uchida et al., 2017, Nagai et al., 2018).
Robustness: The watermark must survive standard manipulations like fine-tuning, pruning, and even some architectural modifications.
Capacity: The system should allow for embedding as many bits as feasible, though limited by model/layer size and method.
Security and Stealth: Adversaries without the key cannot detect or remove the watermark without impairing model utility.
Efficiency: Embedding and extraction should not introduce undue computational burden or model complexity.

Watermarking approaches can be categorized along these axes:

Scheme	Embedding Location	Extraction	Remarks
Parameter-based	Weights	White-box	High capacity, stealth
Structure-based	Architecture	White-box	Robust to weight changes
Activation-based	Neuron activations	White/Black-box	Requires trigger inputs
Output-based	Output distributions	Black-box	API query-based
Plug-and-Play	Auxiliary nets	Black-box	Model-agnostic
Transposed	All parameters	White-box	Human-intuitive images

2. White-box Watermarking by Parameter and Activation Encoding

Early white-box watermarking schemes embed the watermark into selected weights using a parameter regularizer plus a secret projection ("key") (Uchida et al., 2017, Nagai et al., 2018). This is formalized as:

Select a layer, flatten its weights $\mathbf{w}$ into a fixed-length vector.
Define a random projection matrix $X \in \mathbb{R}^{T \times M}$ (the key), where $T$ is the number of bits.
Use embedding loss:

$E_R(\mathbf{w}) = -\sum_{j=1}^T \left[ b_j \log \sigma(X_j \cdot \mathbf{w}) + (1-b_j) \log (1-\sigma(X_j \cdot \mathbf{w})) \right]$

where $\sigma$ is sigmoid and $b_j \in \{0,1\}^T$ is the watermark.

The network is trained to minimize the joint loss $E_\text{total} = E_0 + \lambda E_R$ , where $E_0$ is the task loss.

Capacity is limited by the number of parameters in the chosen layer, with $T \leq M$ maintaining low bit error rates. Extraction is by linear projection and thresholding using the secret $X$ .

Activation encoding approaches (e.g., (Li et al., 2021)) encode watermark bits into neuron activations triggered by specific inputs and regularize the outputs accordingly. These methods can embed watermarks during training from scratch, fine-tuning, or knowledge distillation (Uchida et al., 2017, Nagai et al., 2018).

3. Attacks and Countermeasures: Functionality-Equivalence and Neuron Alignment

A major vulnerability of conventional white-box watermarking is its reliance on fixed neuron/channel ordering. The neuron permutation attack exploits the homogeneous structure of neurons within a layer by permuting their order without changing network functionality, breaking the mapping between the watermark and the physical location of neurons or weights (Li et al., 2021).

To counteract this, a neuron alignment procedure has been proposed (Li et al., 2021):

Assign each neuron an error-correcting codeword;
Generate robust trigger inputs that selectively activate codeword-specific centers;
During verification, collect neuronal responses to triggers, decode observed codes via Hamming or $L_1$ distance to align neurons to their original order;
Re-apply standard watermark extraction protocols.

Empirical results show that robust neuron alignment restores watermark verification rates to 75–99% across common architectures and attacks, while unaligned models under attack yield 0% verification rates.

4. Structural and Plug-and-Play Watermarking Approaches

Structural Watermarking via Channel Pruning

Structural watermarking operates on the model architecture, not only parameters, thereby offering significant robustness to weight-level attacks (Zhao et al., 2021):

Allocate a binary watermark by modulating the per-layer channel pruning rates.
With secret key $K$ , partition the watermark into $l$ -bit segments, assign each to a selected layer's pruning rate in a specified range $[p_\text{min}, p_\text{max})$ .
Extraction recovers the watermark from the pruned architecture by measuring remaining channel counts.
Robust to fine-tuning, weight pruning, and quantization; primarily vulnerable to architecture-level attacks (e.g., full knowledge distillation to new structure).

Plug-and-Play Watermarking

Plug-and-play methods inject an independently trained proprietary network alongside the target model (Wang et al., 2022):

The original model $f$ is frozen; a lightweight auxiliary network $g$ is trained to detect trigger inputs and output high logits for a secret watermark set.
The final output is $y = \operatorname{softmax}(f(x) + \alpha g(x))$ .
Fidelity is strictly preserved, as $f$ is unmodified.
Robust against attacks on $f$ ; removal of $g$ erases the watermark but requires explicit targeting.
Efficient, scalable, and model-agnostic, providing fast embedding and robust verified detection under black-box access.

5. Black-box Watermarking and Output Manipulation

Black-box watermarking for DNNs involves embedding signals that can be probed via remote API access rather than weight inspection (Chen et al., 2019, Adi et al., 2018):

Trigger-set/backdoor methods: Fine-tune with a set of unique trigger inputs mapped to designated labels.
Output distribution perturbations: Apply soft-label perturbations throughout the output probability space, making removal via distillation or pruning difficult (Chien et al., 2022).
Behavioral mapping: Use adversarially generated samples as keys, mapped to bits via an output clustering scheme, and decode the owner's signature by querying the model (Chen et al., 2019).
Function-coupled triggers: Fuse in-distribution samples to produce triggers sharing representation with normal data, while dispersing watermark loss over masked weights for redundancy (Wen et al., 2023).

Empirical studies show that appropriately designed black-box schemes can achieve 100% ownership verification under aggressive fine-tuning and pruning attacks (Wen et al., 2023), while capacity and stealth depend on trigger set design and distribution overlap.

6. Robustness, Capacity, and Practical Deployment Considerations

Embedding robustness is critically evaluated under:

Fine-tuning and pruning: White-box, plug-and-play, function-coupled, and structural schemes can tolerate typical update rates and pruning levels (often up to 65–80%) with negligible bit error rates when error correction is used (Uchida et al., 2017, Li et al., 2021, Wen et al., 2023, Zhao et al., 2021).
Overwriting and piracy attacks: Null embedding and function-coupled schemes ensure that overwriting or piracy attempts drastically impair main accuracy, as normal task performance is now tightly coupled to the presence of the correct watermark pattern (Li et al., 2019).
Distillation attacks: Output distribution perturbation schemes, such as customized soft-label perturbation with paired detectors, are notably resilient, achieving >90% watermark accuracy post-distillation and outperforming feature- and backdoor-based methods (Chien et al., 2022).

Capacity is method-dependent: parameter-based approaches can embed hundreds to thousands of bits per large layer, soft-label perturbation-based up to class number, while plug-and-play and visible watermarking (e.g., ClearMark, (Krauß et al., 2023)) can encode bitmaps with thousands of bits.

For deployment, practical schemes emphasize low embedding overhead, automation, and compatibility with various DNN architectures. White-box extraction often remains necessary for high-capacity or visible watermarks, though black-box APIs are used in behavioral and trigger-set methods.

7. Open Problems and Future Directions

Current limitations and open problems include:

Vulnerability to architectural transformations (e.g., knowledge distillation or architectural surgery) for structural and parameter-based watermarks.
Overwriting/forgery prevention beyond timestamp-based protocols, especially for widely shared backdoor formats.
Efficient, high-capacity watermarking that is robust to all common and advanced removal attacks (including adaptive attacks).
Stealthiness and undetectability under white-box inspection and functionality analysis.
Integration with cryptographic primitives—digital signatures, consensus time-stamping, and zero-knowledge proofs—for robust, non-repudiable ownership claims (Li et al., 2021).
Dataset-agnostic and architecture-agnostic universal watermarking, including human-intuitive visible watermarks that facilitate legal assessment and broad forensic use (Krauß et al., 2023).

Continued research into hybrid encoding/decoding strategies, multi-layer/multi-modal alignment, and decentralized verification protocols is ongoing to further raise the resilience and practicality of watermarking for protecting valuable DNN assets.