Anti-RAG-Thief Watermarking

Updated 25 February 2026

Anti-RAG-Thief watermarking is a set of techniques that embed covert, traceable signals into datasets to detect unauthorized use in Retrieval-Augmented Generation systems.
It employs complementary methods—lexical, semantic, and multimodal watermarking—to ensure robust detection even after adversarial paraphrasing and content alterations.
Empirical evaluations demonstrate high query efficiency and detection accuracy (e.g., ROC-AUC = 1.0 with minimal queries) while preserving system fidelity and content utility.

Anti-RAG-Thief Watermarking comprises a set of information security technologies and methodologies developed to address unauthorized data appropriation and intellectual property leakage in Retrieval-Augmented Generation (RAG) systems. These techniques are expressly engineered to survive in the adversarial black-box scenario of RAG, enabling robust, statistically principled tracing and proof of dataset misuse without requiring privileged model internals or access to the retrieval index. Anti-RAG-Thief watermarking encompasses both textual and multimodal (image-text) settings, employing lexical, semantic, structural, and spatial modifications to inject and detect distinguishable, covert signatures within protected content.

1. Threat Model and Problem Formulation

Modern RAG systems combine LLMs and external knowledge corpora, returning responses that integrate retrieved documents with generative capabilities. This architecture heightens risk for IP owners, as attackers can steal proprietary datasets, including text and images, and incorporate them into unauthorized RAG deployments. The central Anti-RAG-Thief challenge is to enable data owners to perform black-box membership inference: to confidently determine, with controlled false positive rate (FPR), whether suspect RAG responses contain evidence of unauthorized use of their protected content (Liu et al., 15 Feb 2025, Liu et al., 9 Oct 2025, Jovanović et al., 2024, Lv et al., 9 Jan 2025, Chen et al., 10 Jun 2025).

The canonical attack scenarios include:

Full or partial exfiltration of a proprietary text or image dataset into a malicious RAG knowledge base.
Post-processing by the adversary's LLM to paraphrase, summarize, or otherwise obscure retrieved content.
Knowledge expansion or dilution, where the attacker increases unrelated corpus content to reduce watermark coverage.
Black-box auditing, where the defender probes the suspect RAG through API queries only.

A formal defense is obliged to meet these criteria: effectiveness under adversarial paraphrasing and knowledge augmentation, high query efficiency, provable error bounds (controlling FPR and Type II error), robustness to retrieval and generation noise, and negligible degradation in the protected system's utility (Liu et al., 15 Feb 2025, Jovanović et al., 2024).

2. Watermark Embedding Methodologies

Anti-RAG-Thief defenses embed information-theoretically or statistically traceable signals into proprietary datasets through techniques tailored to RAG’s retrieval+generation workflow. Principal approaches are:

a. Lexical-Level (“Red-Green” Token) Watermarking

This method modulates token-generation probabilities during synthetic document creation:

The vocabulary $\mathcal{V}$ is partitioned into green ( $G$ ) and red ( $R$ ) tokens, with $|G| = \gamma |\mathcal{V}|$ .
At each generation step $t$ , a logit bias $\delta > 0$ is added to green tokens:

$\hat \ell_t[k] = \ell_t[k] + \delta\,\mathbf{1}_{k\in G},\quad k\in\mathcal{V}$

The resulting statistical surplus of green tokens constitutes a provable watermark signature robust to moderate paraphrase and chunking (Liu et al., 15 Feb 2025, Jovanović et al., 2024, Fernandez et al., 18 Dec 2025).

b. Semantic (“Canary” or Knowledge) Watermarking

Watermarks can be encoded as synthetic “canary” documents or inserted facts:

Synthetic canary documents $d_s$ are generated to closely mimic real style and content, embedding fictional entities or plausible but fake facts unobtrusively.
In dual-layer watermarking, a small set of watermarked facts $W_D$ is injected, maximizing both content coherence with the base document set and distinctiveness from external corpora (Liu et al., 9 Oct 2025, Lv et al., 9 Jan 2025).
Domain-structured knowledge watermarking, as in RAG-WM, leverages secret-keyed entity-relation tuples, generating natural language snippets that encode relations difficult to forge or strip without damaging corpus utility (Lv et al., 9 Jan 2025).

c. Post-hoc Watermarking via LLM Paraphrasing

Existing documents are rephrased using LLMs under generation-time watermarking constraints. Approaches include:

Gumbel-Max sampling with deterministic noise from a pseudorandom function.
Beam search over watermarked distributions to optimize the fidelity-detectability Pareto frontier. This methodology enables watermarking when generation-time access to the production model is unavailable and supports arbitrary document lengths through context-aware chunking (Fernandez et al., 18 Dec 2025).

d. Multimodal (Image) Watermarking

For image content, especially in Multimodal RAG-as-a-Service, techniques include:

Synthetic images embedding semantic triggers such as acronyms ( $AQUA_{\text{acronym}}$ ), rendered within scene context to assure retrieval and OCR propagation into text responses.
Spatial relationship cues ( $G$ 0): arranging objects in rare spatial configurations detectable via targeted queries. These ensure the image watermark survives both image retrieval and subsequent VLM-generated textual summarization (Chen et al., 10 Jun 2025).

3. Detection and Statistical Hypothesis Testing

Detection operates in a black-box regime by issuing carefully crafted queries and evaluating the returned responses for evidence of watermark presence:

For lexical marks, count the number of green tokens $G$ 1 in concatenated responses $G$ 2 and compute the z-statistic:

$G$ 3

where $G$ 4 (Liu et al., 15 Feb 2025, Jovanović et al., 2024, Fernandez et al., 18 Dec 2025).

For semantic marks or facts, compute the fraction $G$ 5 of queries eliciting a watermarked fact in responses, with the test statistic compared against a threshold determined by the reference (non-watermarked) hit rate (Liu et al., 9 Oct 2025, Lv et al., 9 Jan 2025).
For images, evaluate Verification Success Rate (VSR) or Conditional Generation Success Rate (CGSR), and conduct hypothesis testing via reference distribution contrast (typically Welch's t-test) (Chen et al., 10 Jun 2025).

The null hypothesis $G$ 6 is the absence of watermark (green token rate = $G$ 7, no fact leaks, etc.); rejection thresholds are set to yield a controlled FPR (e.g., $G$ 8 for $G$ 9). Combined decision rules (OR of lexical and semantic detection) further increase resilience (Liu et al., 9 Oct 2025).

4. Query Efficiency, Robustness, and Stealth

Anti-RAG-Thief watermarking places strong emphasis on query efficiency (minimal queries per detection), adversarial robustness, and minimal impact on system fidelity:

Empirical evaluation shows ROC-AUC = 1.0 and TPR@1% FPR = 1.0 with as few as 12-14 queries for text watermarking on NFCorpus (Liu et al., 15 Feb 2025), and 30 queries to confidently detect IP theft in RAG-WM (Lv et al., 9 Jan 2025).
Stealth is maintained by ensuring $R$ 0 and achieving negligible distributional distortion (e.g., BLEU = 0.997, MAUVE = 0.999), with detection provably robust to moderate paraphrasing and retrieval/indexing noise (Liu et al., 15 Feb 2025, Jovanović et al., 2024).
Robustness under paraphrase and expansion attacks is empirically established: watermarked verification numbers (WSN) remain far above random threshold under paraphrasing, unrelated sentence removal, and knowledge insertion (Lv et al., 9 Jan 2025, Liu et al., 9 Oct 2025).
For visual watermarks, AQUA achieves high CGSR and VSR (often $R$ 1), with robust performance even under common image transformations (Chen et al., 10 Jun 2025).

Method/Domain	Query Count for Detection	FPR @ Threshold	Robustness to Paraphrase	Stealth/Fidelity
Canary (text) (Liu et al., 15 Feb 2025)	12–14	$R$ 2	Moderate	BLEU ≥ 0.997
RAG-WM (Lv et al., 9 Jan 2025)	30	$R$ 3	High	CDPA ≈ 97.9%
AQUA (image) (Chen et al., 10 Jun 2025)	<30	Low	Transform-robust	Perceptually aligned

Higher watermark strength ( $R$ 4) increases detection rates but may reduce semantic fidelity, requiring per-domain tuning (Liu et al., 9 Oct 2025, Fernandez et al., 18 Dec 2025). Adversarial filtering or deep paraphrasing can reduce detection power, but multi-layer or hybrid schemes maintain high accuracy.

5. System Architectures and End-to-End Pipelines

Canonical Anti-RAG-Thief pipelines follow a two-phase structure:

a. Pre-Release Protection

Canary or knowledge fact synthesis: generation of synthetic documents or fact tuples via LLMs, attribute extractors, secret key indexing, and iterative in-the-loop validation.
Watermark injection: incorporation of synthetic watermarked elements into the target dataset (text corpus or image database) with minimal perturbation.
Quality validation: empirical measurement of retrieval, utility, and stealth metrics before release (Liu et al., 15 Feb 2025, Lv et al., 9 Jan 2025, Chen et al., 10 Jun 2025).

b. Post-Release Audit

Query generation: construction of detection-oriented queries targeting canary documents, watermark facts, entity relationships, spatial triggers, or acronymed images.
Black-box querying: batch invocation of the suspect RAG system, collection of retrieved/generative outputs.
Detection and statistical testing: application of z-statistics, binomial/Gamma tests, or reference p-value computations to decide membership or misuse, with per-method confidence calibration (Jovanović et al., 2024, Liu et al., 9 Oct 2025, Fernandez et al., 18 Dec 2025, Chen et al., 10 Jun 2025).

Interrogator-detective frameworks can further automate query generation and adaptive detection, improving coverage and adversary-resilience (Liu et al., 9 Oct 2025).

6. Limitations, Evasion, and Research Directions

Current Anti-RAG-Thief watermarking carries important limitations:

Heavy, human-level or ML-powered paraphrasing can attenuate or destroy statistical signals; quantification and enhancement of watermark propagation under such adversarial rewriting are ongoing research targets (Jovanović et al., 2024, Liu et al., 15 Feb 2025, Liu et al., 9 Oct 2025, Fernandez et al., 18 Dec 2025).
Adaptive adversaries could selectively filter outliers or detect watermark-induced patterns, especially in low-entropy, domain-specific corpora.
Scaling detection protocols to multimillion-document indices and highly redundant real-world content remains an active area of investigation.
Quality-security trade-offs: increasing watermark strength or payload may degrade style or semantic depth, especially in creative domains (Liu et al., 9 Oct 2025, Fernandez et al., 18 Dec 2025).

Future work aims at:

Multi-bit/multi-class and hybrid schemes encoding richer ownership proofs.
Dynamic watermark updates and continual learning in detection pipelines.
Integration with retriever-aware and cross-modal architectures, including robust watermarking for image-to-text generative pathways (Chen et al., 10 Jun 2025).
Legal and ethical standards for watermarking strength, privacy protections, and anti-fingerprinting assurances.

A plausible implication is that continual arms-race dynamics between watermark schemes and adversarial removal or evasion techniques will necessitate regular recommissioning of watermark keys, adaptive embedding strategies, and increasingly sophisticated statistical tests.

7. Comparative Evaluation and Domain-Specific Considerations

Comparative benchmarking indicates a Pareto frontier in the fidelity–detectability space:

For open-text, Gumbel-Max watermarking under nucleus sampling is empirically optimal (Fernandez et al., 18 Dec 2025).
For code, smaller models at higher temperatures yield stronger watermark signals, but at the cost of pass@1 correctness.
Lexical schemes offer high query efficiency and are robust under moderate rewriting, but semantically-targeted watermarks (fake facts, entity-relation tuples) can expose misuse even after heavy paraphrase, as long as fact propagation is preserved (Liu et al., 9 Oct 2025, Lv et al., 9 Jan 2025).
Multimodal approaches, such as AQUA, uniquely address image-to-text knowledge propagation, outperforming text-only techniques in mixed-modal retrieval scenarios, and achieving robust performance with as few as 30 probe queries (Chen et al., 10 Jun 2025).

Domain/Method	Lexical WM	Semantic/Fact WM	Image/MModal WM
Open Text	High TPR, fast	Moderate, robust	n/a
Code	Lower TPR; tradeoff	Ineffective	n/a
MModal (img2txt)	n/a	n/a	High TPR, robust

This suggests that future Anti-RAG-Thief strategies will pursue integrated, modality-aware hybrid watermarking, with formal statistical guarantees and adversarial adaptation, to both maximize detection power and minimize utility cost across the expanding landscape of RAG deployments.

References:

(Liu et al., 15 Feb 2025) Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs
(Liu et al., 9 Oct 2025) Who Stole Your Data? A Method for Detecting Unauthorized RAG Theft
(Jovanović et al., 2024) Ward: Provable RAG Dataset Inference via LLM Watermarks
(Fernandez et al., 18 Dec 2025) How Good is Post-Hoc Watermarking With LLM Rephrasing?
(Lv et al., 9 Jan 2025) RAG-WM: An Efficient Black-Box Watermarking Approach for Retrieval-Augmented Generation of LLMs
(Chen et al., 10 Jun 2025) Safeguarding Multimodal Knowledge Copyright in the RAG-as-a-Service Environment