EditMark: Stealth Watermarking for LLMs

Updated 25 October 2025

EditMark is a watermarking framework for large language models that embeds multi-bit watermarks using model editing instead of re-training.
It employs an adaptive, multi-round editing strategy with regularization and noise matrix injection to ensure high fidelity and robustness of the watermark.
The framework uses permutation-based encoding to map answer orders to bits, supporting robust ownership verification and provenance tracing.

EditMark is a watermarking framework for LLMs that leverages model editing rather than conventional data-label modification or model re-training to embed robust, multi-bit watermarks. The method targets proprietary and open-source LLM ownership verification and unauthorized use tracing, offering a training-free, performance-preserving, and stealthy watermarking protocol. EditMark operates by aligning multiple-answer (MA) questions and answer templates with the watermark encoding, updating a minimal subset of model parameters via a closed-form editing strategy coupled with algorithmic refinements to ensure stability and robustness.

1. Watermark Embedding via Model Editing

EditMark utilizes a model editing mechanism to encode a multi-bit watermark into an LLM without additional training. The paradigm is predicated on the fact that certain prompts (MA questions) have multiple semantically correct answers, and that the selection among these answers can be deterministically mapped to watermark bits. The embedding procedure comprises:

Generating a collection of MA questions (e.g., math inequalities) where each option corresponds to a specific bit or group of bits through a permutation-based encoding schema.
For each QA pair, computing a target output vector aligned with the desired watermark segment using a permutation mapping function. The mapping defines a bijection between answer sequences and bitstrings.
Directly updating the MLP weights of the LLM by calculating an editing perturbation $\Delta P$ :

$\Delta P = R \cdot K_1 \cdot P \cdot (K_1K_1^\top + I)^{-1}$

where $R$ is the output residual relative to the target, $K_1$ is the key representation for the edited knowledge, $P$ is the projection onto the null space of unaffected knowledge, and $I$ is the identity matrix.

This approach does not require gradient-based re-training or backpropagation over large datasets, making it computationally efficient and preserving all other latent model knowledge.

2. Adaptive Multi-Round Stable Editing

Rather than applying a single perturbation to the model parameters, EditMark incorporates an adaptive multi-round stable editing strategy to ensure convergence, stability, and low interference:

The algorithm iteratively computes the editing residual $S_i = \| W_i \cdot K_1 - V_1 \|_2$ after each update, where $W_i$ represents the layer weights at round $i$ , and $V_1$ is the target output.
If $S_i$ remains above a specified threshold $\tau$ , additional editing rounds are executed.
A regularization mechanism ensures that no single weight update $\delta_j$ exceeds a fraction of the hidden state norm. If so, the update is downscaled to prevent over-fitting or unwanted interference (edit entanglement) among similar knowledge points.
This ensures both high-fidelity embedding of the watermark and minimal adverse effect on the LLM’s original outputs.

3. Robustness through Noise Matrix Injection

To protect against adversarial attacks including pruning, quantization, and noise injection, EditMark integrates a noise robustness feature:

During select editing rounds, the key representations $K_1$ are perturbed as $K_1 + \epsilon$ , with $\epsilon$ sampled from a Gaussian distribution.
This simulates real-world attacks by introducing distributional shifts in the target representation, thus encouraging the embedding process to converge to a watermark that is resistant to random or structured model alterations.
Empirical analysis indicates that attacks cause $K_1$ differences that are well-approximated by Gaussian noise, justifying this strategy.

4. Watermark Encoding and Decoding Formalism

The watermark is embedded and extracted using permutation theory, mapping answer orders to bits and vice versa. For a bit vector $b = (a+1, ..., a+n)$ and an integer watermark $I$ :

The encoding is given by:

$\alpha_i = b_i^{\left\lfloor \frac{I_i}{((n-i)!/(n-m)!)} \right\rfloor}$

with recursive updates to $I_{i+1}$ and $b_{i+1}$ .

Decoding uses:

$I = \sum_{i=1}^{m} [\operatorname{pos}(\alpha_i, b_i) \times ((n-i)! / (n-m)!)]$

followed by conversion from decimal to binary to recover the embedded watermark.

These operations ensure full invertibility between the answer order and the watermark bitstring, supporting robust multi-bit watermarking.

5. Experimental Evaluation

EditMark was benchmarked on GPT2-XL, GPT-J-6B, LLaMA-3-8B, Baichuan-7B, and Qwen-7B. Key observations include:

Efficiency: Embedding a 32-bit watermark requires less than 20 seconds, a dramatic reduction compared to fine-tuning (6875 seconds reported).
Extraction Success Rate: 100% ESR is achieved for 32, 64, and 128-bit watermarks across multiple models.
Fidelity: No observable impact on standard NLP evaluation tasks (MMLU, BLIMP, TruthfulQA, GLUE), maintaining parity with original unwatermarked models.
Robustness: The watermark persists after various attacks—fine-tuning, quantization (Int-8: 100% ESR), random noise injection, pruning, and targeted model editing. The watermark can only be weakened by highly informed adaptive attacks with precise knowledge of the templates and edit locations; in realistic scenarios, it remains robust.

6. Applications and Limitations

EditMark supports several applications:

Copyright protection for both open-source and proprietary LLMs via embedded digital signatures.
Provenance tracing and verification, preventing unauthorized resale or distribution.
Model integrity assurance in production scenarios involving text generation and sensitive data.

Noted limitations include:

Susceptibility to adaptive attacks if the adversary possesses detailed knowledge of the MA question templates and edited layers, though such scenarios are distinguished in the paper as more extreme than typical threat models.
Approximation errors in knowledge localization may lead to minor, inadvertent model edits, though extensive experiments demonstrate minimal impact.
Edit entanglement, mitigated by careful construction of QA pairs and editing locations, but not entirely eliminated for highly similar or overlapping knowledge domains.

7. Theoretical and Technical Foundations

EditMark formalizes LLM watermarking as a constrained model editing problem, embedding information in a stealthy and parameter-efficient manner. The algorithmic framework employs:

Closed-form solutions for localized MLP weight updates that respect knowledge preservation constraints.
Iterative refinement and regularization to stabilize the embedding.
Permutation encoding theory to maximize answer diversity and bit payload.
Empirical evidence supports negligible impact on LLM performance, maximum extraction success, and resilience against standard and adaptive attacks.

EditMark’s codebase, as well as empirical protocols, are available per the underlying publication, allowing for reproduction and broader adoption within the LLM research and deployment community.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to EditMark.