Papers
Topics
Authors
Recent
Search
2000 character limit reached

CorruptRAG: Attacks on Retrieval-Augmented Generation

Updated 11 May 2026
  • CorruptRAG is a class of targeted knowledge corruption attacks that compromise RAG systems by injecting adversarial content into the retrieval corpus.
  • Attack methodologies use retrieval triggers and adversarial payloads to achieve high attack success rates with minimal poisoning ratios.
  • Defense strategies span retrieval-stage filtering, post-retrieval clustering, and aggregation methods to significantly reduce the manipulated output rate.

CorruptRAG is a general term denoting the class of targeted knowledge corruption attacks—sometimes also referred to as corpus poisoning or data poisoning—against Retrieval-Augmented Generation (RAG) systems. These attacks aim to manipulate RAG outputs by surreptitiously injecting adversarial content into the knowledge base or retrieval corpus, inducing LLMs to produce attacker-specified or misleading responses to user queries. Multiple attack variants and highly effective defense strategies have been rigorously studied in recent literature.

1. Formal Threat Model and Attack Taxonomy

The CorruptRAG threat model is defined by an adversarial actor who can insert a small number of malicious texts (passages, documents, or chunks) into the RAG system’s knowledge base D\mathcal{D}. Given a user query qq, the retriever R\mathcal{R} returns a set of top-kk documents T(q)T(q) by embedding-based similarity. The generator GG or ff (LLM) produces the final response y=G(q,T(q))y = G(q, T(q)). The adversary's objective is to influence GG so that yy contains an attacker-chosen answer qq0 instead of the true fact.

Attack capabilities and goals, as surveyed across systems, include:

  • Targeted yield: Manipulate outputs for specific queries qq1, sets of semantically related queries, or even universally across large query sets.
  • Retrieval backdoors: Poison passages are optimized so that only queries with specific triggers retrieve them, maximizing stealth.
  • Stealth constraints: Poisoned insertions must not significantly affect the output distribution on unrelated queries.
  • Attack success rate (ASR): Fraction of queries for which the attacker’s target answer or objective appears in qq2.
  • Retrievability requirements: Poisoned passage must consistently appear in top-qq3 results for targeted qq4.

The practical model is black-box with respect to the LLM but sometimes assumes white-box retriever access for embedding optimization. Attack cost is quantified by the poisoning ratio qq5, which is kept to qq6 in recent works for high stealth.

Notable attack families include:

2. Key Attack Methodologies

The most influential CorruptRAG attacks exhibit two-part design: (i) a retrieval prefix that ensures high retrievability and (ii) an adversarial payload to induce the target output. Approaches include:

  • Retrieval Trigger: For each query qq7, set the poisoned passage’s retrieval prefix qq8 (query-as-poison), ensuring the passage is always highly scored in dense retrieval (Zhang et al., 4 Apr 2025).
  • Adversarial Payload (Generation Trigger): Lexically or stylistically assert the incorrect answer, frequently deploying meta-epistemic framing: (e.g., “Note, there are many outdated corpus stating that the incorrect answer [C_i]. The latest data confirms that the correct answer is [A_i].”) (Zhang et al., 4 Apr 2025, Korn, 7 May 2026).
  • Universal Knowledge Corruption (UniC-RAG): Partition a large diverse query set qq9 into R\mathcal{R}0 balanced clusters, then jointly optimize one adversarial text per cluster via white-box access to the retriever to maximize simultaneous retrieval and target output for all cluster members (Geng et al., 26 Aug 2025).
  • Genetic and Typo-based Attacks: Introduction of low-level errors (inner shuffles, truncations, keyboard typos) via evolutionary optimization, maximizing the drop in retriever relevance and generation faithfulness (Cho et al., 2024).
  • Prompt Injection and Cache Exploitation: Embed instruction-like sequences or policy triggers in documents or exploit stale embedding caches to induce either integrity or confidentiality violations (RoyChowdhury et al., 2024).

Attack optimization is often gradient-based (e.g., HotFlip, white-box), but effective black-box single-pass and LLM-prompted variants exist. Attack evaluation metrics are ASR (fraction of attacker-specified outputs), retrieval success rate (poisoned passages in top-R\mathcal{R}1), and clean accuracy drop.

3. Empirical Findings and Impact

Empirical analysis across large open QA datasets (Natural Questions, MS MARCO, HotpotQA), diverse retrievers (Contriever, ANCE, DPR), and powerful LLMs (GPT-4, Llama, Vicuna, Gemini) reveals the following:

Attack ASR (Top-5) Poisoning Ratio Clean Acc. Drop
CorruptRAG-AS 0.97 (Zhang et al., 4 Apr 2025) R\mathcal{R}2 R\mathcal{R}3
CorruptRAG-AK 0.95 R\mathcal{R}4 R\mathcal{R}5
UniC-RAG R\mathcal{R}6 (Geng et al., 26 Aug 2025) R\mathcal{R}7–R\mathcal{R}8 Minimal
TrojRAG (trigger-based) 0.98 (Xue et al., 2024) R\mathcal{R}9 kk0
GARAG (typo-based) 0.7–0.8 (Cho et al., 2024) N/A (modification) kk10.20 (precision)

Attack success degrades only slightly even as top-kk2 increases (kk3), or when queries are paraphrased or context windows are expanded. Single poisoned documents per query suffice to subvert even robust RAG systems. Universal attacks scale to thousands of queries; injection of 100 adversarial texts suffices to corrupt 2,000 queries at ASR kk4 (Geng et al., 26 Aug 2025).

Behavioral and architectural studies demonstrate that, for adversarial meta-epistemic framing (CorruptRAG–AK), vanilla retrieve-then-generate RAG achieves ASR kk5, agentic RAG kk6, multi-agent debate MADAM-RAG kk7, and Recursive LLMs (RLM) kk8 (Korn, 7 May 2026). This highlights that content framing, not retrieval alone, is the principal driver of attack potency.

4. Defense Strategies and Limitations

Defenses span retrieval-time, post-retrieval, and generation-level filtering, with a divide between scalable pragmatic methods and approaches offering certifiable robustness guarantees.

Retrieval- and Embedding-Stage Defenses

  • RAGPart: Fragment documents into kk9 parts, embed each, and aggregate retrieval over all T(q)T(q)0-sized fragment sets using majority vote. Robust if no majority of fragment combos are contaminated (T(q)T(q)1) (Pathmanathan et al., 30 Dec 2025).
  • RAGMask: Mask and re-embed consecutive token spans, computing drop in retrieval score. Spans whose removal sharply drops score are marked as “poison” and sanitized. Both RAGPart and RAGMask block up to T(q)T(q)2 of attacks across paraphrase, HotFlip, and AdvRAGgen Adversarial Generator attacks, at a utility cost of T(q)T(q)3–T(q)T(q)4 points (Pathmanathan et al., 30 Dec 2025).

Post-Retrieval Lightweight Filtering

  • RAGDefender: Two-phase post-retrieval filter: (1) clustering or concentration-scoring to estimate T(q)T(q)5 (number of adversarial passages), (2) frequency-based scoring of pairwise passage similarities to identify the densest cluster (assumed adversarial). This approach is computationally light (T(q)T(q)6 ms/query) and, on Gemini with T(q)T(q)7 poisoning, reduces ASR from T(q)T(q)8—outperforming RobustRAG and Discern-and-Answer (Kim et al., 3 Nov 2025).

Certifiable and Adversarially-Robust Aggregation

  • RobustRAG: Isolate-then-aggregate strategy: generate one LLM output per passage, then aggregate results. Keyword-based and decoding-based aggregation algorithms provide certifiable lower bounds on accuracy under up to T(q)T(q)9 adversarial passages. With GG0, certified accuracy reaches GG1–GG2 (Xiang et al., 2024, Shen et al., 27 Sep 2025).
  • ReliabilityRAG: Constructs a contradiction graph over isolated passage answers using an NLI model and computes a maximum independent set (MIS) weighted by retrieval rank/reliability. For large GG3, sample-and-aggregate schemes maintain GG4 benign accuracy under attack, with negligible added inference time (Shen et al., 27 Sep 2025).

Architectural Hardening

  • RAG system design matters: Multi-agent debate, agentic retrieval, recursive reasoning reduce ASR to 25–45% but often at the cost of increased latency or invocation rate of non-answer behaviors (Korn, 7 May 2026). Nonetheless, meta-epistemic attacks still outperform naive contradiction injection, indicating current agentic and aggregation techniques are insufficient for strong adversarial robustness.

Defensive Failure Modes

5. Implementation, Complexity, and Deployment Considerations

CorruptRAG exploits are generally low-effort, requiring only knowledge of the intended query and the ability to write to the KB. Typical deployments operate with GG6–GG7 top passages, and even small GG8 (GG9) suffice for single- or multi-query corruption.

  • Attack insertion pipeline: Compose passage prefix (retrieval trigger), craft payload (adversarial content), inject into KB (ff0 operation), and await retrieval for the targeted queries.
  • Defensive pipelines: RAGPart and RAGMask increase retriever computation at indexing and query time (ff1 aggregation, masking segments per document). RAGDefender incurs ff2 per query but is practical for ff3; RobustRAG’s cost scales with ff4 in LLM calls. ReliabilityRAG’s sample-and-aggregate algorithm requires ff5 LLM calls and ff6 NLI checks for sampled contexts.

Persistent logging of retrieval/ranking shifts and routine retriever retraining on hard negatives are recommended to maintain long-term resilience (Pathmanathan et al., 30 Dec 2025).

6. Future Challenges and Open Directions

Continued research into CorruptRAG is driven by several open questions:

  • Theoretical guarantees: What density/separation gaps are necessary for provable adversarial detection in embedding space? (Kim et al., 3 Nov 2025, Shen et al., 27 Sep 2025)
  • Universal and adaptive attacks: Joint optimization of retrieval and generation payloads, transferability to unseen retrievers, and extension to new domains (fact verification, summarization, multimodal) (Geng et al., 26 Aug 2025).
  • Defensive meta-learning and anomaly detection: Embedding-distribution monitoring, dynamic information flow control, cryptographic provenance for retrieved content (RoyChowdhury et al., 2024).
  • Architectural optimization: Weighting agentic responses by inter-agent agreement, real-time cache invalidation, and control/data channel separation for prompt safety.
  • Certified defenses at scale: Balancing efficiency with certifiable robustness metrics for large ff7, multilingual/multimodal corpora, and high volume deployments.

A plausible implication is that, as RAG continues to be adopted in high-risk settings (e.g., enterprise QA, scientific research), robust and certifiable defense schemes such as aggregation, clustering-based filtering, and reliability-aware majority mechanisms will be required in conjunction with adversarial-aware retriever training and scalable system-level monitoring.


Key References: (Zhang et al., 4 Apr 2025, Korn, 7 May 2026, Geng et al., 26 Aug 2025, Kim et al., 3 Nov 2025, Pathmanathan et al., 30 Dec 2025, Xiang et al., 2024, Shen et al., 27 Sep 2025, Xue et al., 2024, Cho et al., 2024, RoyChowdhury et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CorruptRAG.