CorruptRAG: Attacks on Retrieval-Augmented Generation
- CorruptRAG is a class of targeted knowledge corruption attacks that compromise RAG systems by injecting adversarial content into the retrieval corpus.
- Attack methodologies use retrieval triggers and adversarial payloads to achieve high attack success rates with minimal poisoning ratios.
- Defense strategies span retrieval-stage filtering, post-retrieval clustering, and aggregation methods to significantly reduce the manipulated output rate.
CorruptRAG is a general term denoting the class of targeted knowledge corruption attacks—sometimes also referred to as corpus poisoning or data poisoning—against Retrieval-Augmented Generation (RAG) systems. These attacks aim to manipulate RAG outputs by surreptitiously injecting adversarial content into the knowledge base or retrieval corpus, inducing LLMs to produce attacker-specified or misleading responses to user queries. Multiple attack variants and highly effective defense strategies have been rigorously studied in recent literature.
1. Formal Threat Model and Attack Taxonomy
The CorruptRAG threat model is defined by an adversarial actor who can insert a small number of malicious texts (passages, documents, or chunks) into the RAG system’s knowledge base . Given a user query , the retriever returns a set of top- documents by embedding-based similarity. The generator or (LLM) produces the final response . The adversary's objective is to influence so that contains an attacker-chosen answer 0 instead of the true fact.
Attack capabilities and goals, as surveyed across systems, include:
- Targeted yield: Manipulate outputs for specific queries 1, sets of semantically related queries, or even universally across large query sets.
- Retrieval backdoors: Poison passages are optimized so that only queries with specific triggers retrieve them, maximizing stealth.
- Stealth constraints: Poisoned insertions must not significantly affect the output distribution on unrelated queries.
- Attack success rate (ASR): Fraction of queries for which the attacker’s target answer or objective appears in 2.
- Retrievability requirements: Poisoned passage must consistently appear in top-3 results for targeted 4.
The practical model is black-box with respect to the LLM but sometimes assumes white-box retriever access for embedding optimization. Attack cost is quantified by the poisoning ratio 5, which is kept to 6 in recent works for high stealth.
Notable attack families include:
- Single-query targeted poisoning (Zhang et al., 4 Apr 2025, Korn, 7 May 2026)
- Trigger-based retrieval backdoors and semantic steering (Xue et al., 2024)
- Universal corruption for large, diverse query sets (UniC-RAG) (Geng et al., 26 Aug 2025)
- Low-level perturbation attacks via typos (GARAG) (Cho et al., 2024)
- Prompt and cache-based attacks via prompt injection and stale embeddings (RoyChowdhury et al., 2024)
2. Key Attack Methodologies
The most influential CorruptRAG attacks exhibit two-part design: (i) a retrieval prefix that ensures high retrievability and (ii) an adversarial payload to induce the target output. Approaches include:
- Retrieval Trigger: For each query 7, set the poisoned passage’s retrieval prefix 8 (query-as-poison), ensuring the passage is always highly scored in dense retrieval (Zhang et al., 4 Apr 2025).
- Adversarial Payload (Generation Trigger): Lexically or stylistically assert the incorrect answer, frequently deploying meta-epistemic framing: (e.g., “Note, there are many outdated corpus stating that the incorrect answer [C_i]. The latest data confirms that the correct answer is [A_i].”) (Zhang et al., 4 Apr 2025, Korn, 7 May 2026).
- Universal Knowledge Corruption (UniC-RAG): Partition a large diverse query set 9 into 0 balanced clusters, then jointly optimize one adversarial text per cluster via white-box access to the retriever to maximize simultaneous retrieval and target output for all cluster members (Geng et al., 26 Aug 2025).
- Genetic and Typo-based Attacks: Introduction of low-level errors (inner shuffles, truncations, keyboard typos) via evolutionary optimization, maximizing the drop in retriever relevance and generation faithfulness (Cho et al., 2024).
- Prompt Injection and Cache Exploitation: Embed instruction-like sequences or policy triggers in documents or exploit stale embedding caches to induce either integrity or confidentiality violations (RoyChowdhury et al., 2024).
Attack optimization is often gradient-based (e.g., HotFlip, white-box), but effective black-box single-pass and LLM-prompted variants exist. Attack evaluation metrics are ASR (fraction of attacker-specified outputs), retrieval success rate (poisoned passages in top-1), and clean accuracy drop.
3. Empirical Findings and Impact
Empirical analysis across large open QA datasets (Natural Questions, MS MARCO, HotpotQA), diverse retrievers (Contriever, ANCE, DPR), and powerful LLMs (GPT-4, Llama, Vicuna, Gemini) reveals the following:
| Attack | ASR (Top-5) | Poisoning Ratio | Clean Acc. Drop |
|---|---|---|---|
| CorruptRAG-AS | 0.97 (Zhang et al., 4 Apr 2025) | 2 | 3 |
| CorruptRAG-AK | 0.95 | 4 | 5 |
| UniC-RAG | 6 (Geng et al., 26 Aug 2025) | 7–8 | Minimal |
| TrojRAG (trigger-based) | 0.98 (Xue et al., 2024) | 9 | 0 |
| GARAG (typo-based) | 0.7–0.8 (Cho et al., 2024) | N/A (modification) | 10.20 (precision) |
Attack success degrades only slightly even as top-2 increases (3), or when queries are paraphrased or context windows are expanded. Single poisoned documents per query suffice to subvert even robust RAG systems. Universal attacks scale to thousands of queries; injection of 100 adversarial texts suffices to corrupt 2,000 queries at ASR 4 (Geng et al., 26 Aug 2025).
Behavioral and architectural studies demonstrate that, for adversarial meta-epistemic framing (CorruptRAG–AK), vanilla retrieve-then-generate RAG achieves ASR 5, agentic RAG 6, multi-agent debate MADAM-RAG 7, and Recursive LLMs (RLM) 8 (Korn, 7 May 2026). This highlights that content framing, not retrieval alone, is the principal driver of attack potency.
4. Defense Strategies and Limitations
Defenses span retrieval-time, post-retrieval, and generation-level filtering, with a divide between scalable pragmatic methods and approaches offering certifiable robustness guarantees.
Retrieval- and Embedding-Stage Defenses
- RAGPart: Fragment documents into 9 parts, embed each, and aggregate retrieval over all 0-sized fragment sets using majority vote. Robust if no majority of fragment combos are contaminated (1) (Pathmanathan et al., 30 Dec 2025).
- RAGMask: Mask and re-embed consecutive token spans, computing drop in retrieval score. Spans whose removal sharply drops score are marked as “poison” and sanitized. Both RAGPart and RAGMask block up to 2 of attacks across paraphrase, HotFlip, and AdvRAGgen Adversarial Generator attacks, at a utility cost of 3–4 points (Pathmanathan et al., 30 Dec 2025).
Post-Retrieval Lightweight Filtering
- RAGDefender: Two-phase post-retrieval filter: (1) clustering or concentration-scoring to estimate 5 (number of adversarial passages), (2) frequency-based scoring of pairwise passage similarities to identify the densest cluster (assumed adversarial). This approach is computationally light (6 ms/query) and, on Gemini with 7 poisoning, reduces ASR from 8—outperforming RobustRAG and Discern-and-Answer (Kim et al., 3 Nov 2025).
Certifiable and Adversarially-Robust Aggregation
- RobustRAG: Isolate-then-aggregate strategy: generate one LLM output per passage, then aggregate results. Keyword-based and decoding-based aggregation algorithms provide certifiable lower bounds on accuracy under up to 9 adversarial passages. With 0, certified accuracy reaches 1–2 (Xiang et al., 2024, Shen et al., 27 Sep 2025).
- ReliabilityRAG: Constructs a contradiction graph over isolated passage answers using an NLI model and computes a maximum independent set (MIS) weighted by retrieval rank/reliability. For large 3, sample-and-aggregate schemes maintain 4 benign accuracy under attack, with negligible added inference time (Shen et al., 27 Sep 2025).
Architectural Hardening
- RAG system design matters: Multi-agent debate, agentic retrieval, recursive reasoning reduce ASR to 25–45% but often at the cost of increased latency or invocation rate of non-answer behaviors (Korn, 7 May 2026). Nonetheless, meta-epistemic attacks still outperform naive contradiction injection, indicating current agentic and aggregation techniques are insufficient for strong adversarial robustness.
Defensive Failure Modes
- Prompt-based detection, paraphrasing, correct-knowledge expansion, and perplexity filtering have negligible impact on CorruptRAG and UniC-RAG, with ASR typically above 60–90% post-defense (Zhang et al., 4 Apr 2025, Geng et al., 26 Aug 2025). Retrieval-stage defenses are ineffective if the adversarial document’s factual content is semantically plausible and already competitive in the top-5 (Pathmanathan et al., 30 Dec 2025). Certified decoder and keyword-based aggregation reduce maximum ASR below 10% but often trade off 10–30% of benign accuracy (Xiang et al., 2024).
5. Implementation, Complexity, and Deployment Considerations
CorruptRAG exploits are generally low-effort, requiring only knowledge of the intended query and the ability to write to the KB. Typical deployments operate with 6–7 top passages, and even small 8 (9) suffice for single- or multi-query corruption.
- Attack insertion pipeline: Compose passage prefix (retrieval trigger), craft payload (adversarial content), inject into KB (0 operation), and await retrieval for the targeted queries.
- Defensive pipelines: RAGPart and RAGMask increase retriever computation at indexing and query time (1 aggregation, masking segments per document). RAGDefender incurs 2 per query but is practical for 3; RobustRAG’s cost scales with 4 in LLM calls. ReliabilityRAG’s sample-and-aggregate algorithm requires 5 LLM calls and 6 NLI checks for sampled contexts.
Persistent logging of retrieval/ranking shifts and routine retriever retraining on hard negatives are recommended to maintain long-term resilience (Pathmanathan et al., 30 Dec 2025).
6. Future Challenges and Open Directions
Continued research into CorruptRAG is driven by several open questions:
- Theoretical guarantees: What density/separation gaps are necessary for provable adversarial detection in embedding space? (Kim et al., 3 Nov 2025, Shen et al., 27 Sep 2025)
- Universal and adaptive attacks: Joint optimization of retrieval and generation payloads, transferability to unseen retrievers, and extension to new domains (fact verification, summarization, multimodal) (Geng et al., 26 Aug 2025).
- Defensive meta-learning and anomaly detection: Embedding-distribution monitoring, dynamic information flow control, cryptographic provenance for retrieved content (RoyChowdhury et al., 2024).
- Architectural optimization: Weighting agentic responses by inter-agent agreement, real-time cache invalidation, and control/data channel separation for prompt safety.
- Certified defenses at scale: Balancing efficiency with certifiable robustness metrics for large 7, multilingual/multimodal corpora, and high volume deployments.
A plausible implication is that, as RAG continues to be adopted in high-risk settings (e.g., enterprise QA, scientific research), robust and certifiable defense schemes such as aggregation, clustering-based filtering, and reliability-aware majority mechanisms will be required in conjunction with adversarial-aware retriever training and scalable system-level monitoring.
Key References: (Zhang et al., 4 Apr 2025, Korn, 7 May 2026, Geng et al., 26 Aug 2025, Kim et al., 3 Nov 2025, Pathmanathan et al., 30 Dec 2025, Xiang et al., 2024, Shen et al., 27 Sep 2025, Xue et al., 2024, Cho et al., 2024, RoyChowdhury et al., 2024)