Knowledge Corruption Attacks

Updated 19 November 2025

Knowledge corruption attacks are adversarial strategies that stealthily manipulate training data, memory, or knowledge bases to produce misleading, attacker-controlled outputs.
These attacks employ techniques such as training-time poisoning and retrieval/memory-time corruption, achieving high attack success rates in systems like RAG and web agents.
Current defenses struggle to mitigate these vulnerabilities, prompting research into robust retrieval methods, cryptographic memory isolation, and comprehensive anomaly detection.

Knowledge corruption attacks are adversarial techniques designed to manipulate machine learning or agentic systems by stealthily corrupting their knowledge sources, memory, or training data, causing models or agents to produce misleading, incorrect, or attacker-controlled outputs. These attacks target knowledge persistence artifacts—databases, client-side memory, document loaders, or training sets—rather than simply manipulating prompts or test-time inputs. Knowledge corruption attacks have rapidly gained prominence in the context of Retrieval-Augmented Generation (RAG), web agents, and deep classifiers, where even minimal poisoning budgets or subtle memory injections can yield high attack success rates without affecting standard accuracy or evading detection by conventional filtering mechanisms.

1. Taxonomy, Formal Definitions, and Threat Models

Knowledge corruption attacks span multiple modalities, architectures, and adversarial strategies. The two principal settings are:

Training-time corruption (data/model poisoning): The adversary injects malicious examples or manipulates labels/features in the training dataset to implant backdoors or corrupt the learned decision boundaries, often with imperceptible perturbations (Barni et al., 2019, Ramirez et al., 2022).
Retrieval/Memory-time corruption: The adversary targets persistent knowledge artifacts—retrieval stores, agent context/state, external documents, or vector databases—by seeding them with semantically relevant but attacker-crafted content, which is later retrieved and grounds model predictions (Zou et al., 2024, Geng et al., 26 Aug 2025, Zhang et al., 4 Apr 2025).

Formal threat models specify: (i) adversary capability (injection budget, location, granularity), (ii) adversary knowledge (black/grey/white-box), (iii) invariants (e.g., label preservation), and (iv) attack objectives (ASR maximization, targeted misclassification, denial of service).

Example formalization (RAG setting) (Zou et al., 2024, Geng et al., 26 Aug 2025):

Let $\mathcal{D}$ be the knowledge database, $\mathcal{Q}$ a set of target queries, $\mathrm{RAG}(\cdot)$ the system, and $\Gamma$ the set of injected adversarial passages. The attacker's goal is: $\max_{\Gamma: |\Gamma|\leq B} \frac{1}{|\mathcal{Q}|} \sum_{q \in \mathcal{Q}} \mathbf{1}\left(\mathrm{RAG}(q, \mathcal{D} \cup \Gamma) = A(q)\right)$ with $B$ the budget, $A(q)$ the attacker-specified answer.

In plan injection for web agents (Patlan et al., 18 Jun 2025), the context $c_t = (p_t, d_t, k, h_t)$ is corrupted by perturbing memory or plan representations: $M'_t = M_t + \Delta, \quad ||\Delta|| \leq \beta$ or by direct plan injection,

$c^* = (p_i, d_{i,t}, k, h_{i,t}, P_i \oplus \delta_P)$

2. Canonical Attack Techniques and Mechanisms

2.1 Retrieval-Augmented Generation and Knowledge Stores

Query-targeted document poisoning: PoisonedRAG and CorruptRAG demonstrate that injecting a small number of attacker-constructed passages into a large document corpus enables the adversary to reliably steer RAG outputs. Each poisoned passage is optimized for both the retriever (high similarity to target query) and the LLM (eliciting attacker-chosen answer) (Zou et al., 2024, Zhang et al., 4 Apr 2025). CorruptRAG shows that a single injected document per target query achieves ASR up to 0.97–0.98, even with top-5 retrieval in million-scale datasets.

Universal attacks (UniC-RAG): Rather than per-query poisons, UniC-RAG jointly optimizes $n \ll |\mathcal{Q}|$ adversarial texts for $|\mathcal{Q}|$ diverse queries, via balanced clustering and HotFlip-based similarity maximization (Geng et al., 26 Aug 2025). This yields >90% ASR against 2,000 queries with 100 passages.

Stealthy ingestion attacks: Loader-centric attacks leverage invisible Unicode manipulation, zero-width characters, font poisoning, out-of-bound text, and transparent overlays to inject or obfuscate data at the parsing stage, often completely bypassing document sanitizers (Castagnaro et al., 7 Jul 2025). These attacks affect RAG systems at the pre-embedding phase and are highly format-agnostic.

2.2 Memory and Context Corruption in Agents

Plan/context injection: In agentic settings, plan injection alters the high-level plan representation without touching user prompts or agent code (Patlan et al., 18 Jun 2025). Context-chained variants semantically bridge user goals and attacker objectives so that validation layers are bypassed. Bypassing prompt-injection defenses, these attacks exploit the inherently insecure architecture of client- or third-party managed agent memory.

2.3 Data Poisoning in Supervised Learning

Training-set corruption without label poisoning: Imperceptible, class-restricted perturbations are added to a subset of target-class samples; no label anomalies are introduced, maintaining high clean accuracy but enabling backdoor activation at inference (Barni et al., 2019).
Label-flipping attacks: The adversary randomly or selectively inverts labels (e.g., flipping only high-importance malicious samples to benign) to degrade classifier integrity or selectively reduce recall (Ramirez et al., 2022).

3. Empirical Effectiveness and Attack Evaluation

Attack efficiency is typically measured by Attack Success Rate (ASR):

$\mathrm{ASR} = \frac{\#\{\text{successful attacks}\}}{\#\{\text{attempts}\}}$

Retrieval-based attacks:

PoisonedRAG: $>90\%$ ASR (5 poisons/query, millions of texts, $k=5$ ) (Zou et al., 2024).
CorruptRAG: Up to $0.97$–$0.98$ (single poison/query) (Zhang et al., 4 Apr 2025).
UniC-RAG: $>90\%$ ASR for 2,000 queries with 100 passages (Geng et al., 26 Aug 2025).
Loader attacks: $74.4\%$ ASR across document parsers/formats (Castagnaro et al., 7 Jul 2025).

Agent memory attacks:

Plan injection boosts ASR up to $3\times$ over prompt injection; context-chained injection further raises ASR by $+17.7\%$ in privacy exfiltration (Patlan et al., 18 Jun 2025).

Data/model poisoning:

CNN backdoor: up to $96\%$ ASR with $\alpha=0.4$ , imperceptible triggers, and $>99\%$ pristine accuracy (Barni et al., 2019).
Malware classifiers: up to $25$– $40\%$ accuracy drop at $\alpha=0.5$ random flips; targeted flips maintain higher overall accuracy but cause recall to collapse on targeted class (e.g., FNR $>35\%$ for Decision Trees) (Ramirez et al., 2022).

4. Impact, Limitations of Defenses, and Architectural Consequences

Existing defenses against knowledge corruption attacks have been found inadequate in several dimensions:

Robust RAG variants: Techniques relying on LLMs to individually inspect context, secure aggregation of responses, or increased retrieval width only partially mitigate attacks. Expanding the context window often increases the probability that adversarial passages are retrieved (Zou et al., 2024, Zhang et al., 4 Apr 2025, Geng et al., 26 Aug 2025).
Paraphrasing and instructional filtering: Paraphrasing queries drops ASR marginally ( $\sim$ 10%); adversarial passages remain high-similarity nearest neighbors (Zou et al., 2024, Geng et al., 26 Aug 2025).
Heuristic content filters and surface sanitization: Most loader-based obfuscation and injection attacks circumvent canonicalization, codepoint filtering, and metadata validation unless OCR/fallback is used (Castagnaro et al., 7 Jul 2025).
Correct knowledge expansion: Injecting multiple benign correct-answer passages can dilute attack efficacy but is not scalable and comes with trade-offs (Zhang et al., 4 Apr 2025).

High-performing defenses such as RAGDefender (post-retrieval embedding clustering and scoring) achieve strong ASR reduction (e.g., $0.89 \to 0.02$ on Gemini-1.5-Pro at 4× adversarial ratio) (Kim et al., 3 Nov 2025). Certifiably robust aggregation (RobustRAG; isolate-then-aggregate with thresholded keyword or decoding fusion) delivers formal $\tau$ -robustness guarantees when $k' < k$ passages are corrupted (Xiang et al., 2024), but incurs nontrivial compute overhead.

In agentic systems, cryptographic signing of plans, API-based memory isolation, and embedding-based consistency checks are mandated to control plan injection risk (Patlan et al., 18 Jun 2025). Training-side poisoning in classifiers is best mitigated by clustering, spectral analysis, label sanitization, or holistic robust optimization (distributionally robust losses that capture both evasion and poisoning) (Ramirez et al., 2022, Bennouna et al., 2023).

5. Methodological Innovations: Optimization, Clustering, and Retrieval

Recent advances center on optimization of knowledge corruption payloads:

Two-stage poison construction: Split poisoned document into a retrieval-optimized prefix and generation-optimized payload (Zou et al., 2024, Zhang et al., 4 Apr 2025).
HotFlip-based text optimization: Use white-box gradient-based perturbations for semantic similarity maximization between poisoned text and target query clusters (Geng et al., 26 Aug 2025).
Balanced clustering: Assign queries to clusters such that each adversarial passage targets a maximally homogeneous and balanced set, increasing retrievability and ASR (Geng et al., 26 Aug 2025).
Loader-level attack toolkits: Automated suites (e.g., PhantomText) test multiple parser-breaking and hiding strategies for format-agnostic RAG ingestion corruption (Castagnaro et al., 7 Jul 2025).

6. Open Challenges and Research Directions

While most current defenses rely on post-hoc or heuristic filtering, there is a clear need for:

Provable, certifiable retrieval robustness: Formal certification of retrieval aggregation, as in RobustRAG, is emerging but comes with costs in latency and LLM calls (Xiang et al., 2024).
Universal and adaptive poison resilience: Approaches that scale to universal, domain-spanning attacks with minimal budget and support real-world, open-domain RAG systems (Geng et al., 26 Aug 2025).
Integrated, multi-layered sanitization: Combining shallow surface preprocessing, loader-format validation, and deep anomaly detection (Castagnaro et al., 7 Jul 2025).
Memory and context integrity in agents: Mandatory cryptographic and process isolation for persistent agent memory (Patlan et al., 18 Jun 2025).
Fundamental trade-offs: Studying limits of retrievability vs. manipulability and the balance between ASR reduction and benign coverage (Geng et al., 26 Aug 2025).

The rapidly evolving landscape of knowledge corruption attacks thus highlights the urgent requirement for architectural, theoretical, and practical advances in defense, detection, and system design to protect against this multi-faceted, highly effective attack surface.