PoisonedRAG Framework

Updated 19 November 2025

PoisonedRAG is a framework that targets Retrieval-Augmented Generation systems by poisoning external knowledge bases to induce specific, malicious outputs from LLMs.
It exploits the separation of retrieval and generation by injecting minimally perturbed adversarial entries, which reliably degrade output quality and disrupt threat analysis.
Robust evaluation metrics and proposed defenses, such as index integrity checks and adversarial-aware retrieval, underscore the critical need for security enhancements in RAG frameworks.

The PoisonedRAG framework encompasses a class of targeted poisoning attacks against Retrieval-Augmented Generation (RAG) systems, in which adversaries alter external knowledge bases—usually text corpora or structured stores—to induce LLMs to generate specific, attacker-determined outputs in response to particular queries. PoisonedRAG was first studied formally as a knowledge corruption attack to RAG in the context of LLM-based threat detection and mitigation for IoT networks, but its principles generalize to a broad range of RAG-infused pipelines and modalities (Ikbarieh et al., 9 Nov 2025, Zou et al., 2024). The framework highlights a practical and substantial attack surface where even minimal, meaning-preserving perturbations or the injection of a few adversarial entries can reliably degrade the utility or accuracy of state-of-the-art LLM-based systems.

1. System Architecture and Attack Surface

The PoisonedRAG threat model assumes a standard RAG pipeline in which an upstream retriever selects relevant knowledge base (KB) entries for a user query, and an LLM conditions its generation on the retrieved context. In the specific setting of IoT attack analysis and mitigation (Ikbarieh et al., 9 Nov 2025), the pipeline is as follows:

Input & Preprocessing: Network traffic flows are processed into 47–61 features, normalized and encoded as $\mathbf{x} \in \mathbb{R}^d$ .
Attack Detection: A Random Forest classifier predicts the class $y \in \{\textrm{benign}, \textrm{attack}_1, \ldots, \textrm{attack}_{28}\}$ from $\mathbf{x}$ .
RAG Knowledge Base: The KB contains (1) attack descriptions $\{(c_i, d_i)\}$ and (2) device profiles, indexed using all-MiniLM-L6-v2 in FAISS.
Retrieval: For each alert, $d^*$ and $p^*$ are returned as nearest neighbors in embedding space to the predicted class $c$ .
LLM Prompting: ChatGPT-5 Thinking receives a prompt containing $y$ , the traffic snapshot, and the retrieved KB entries, and generates analyses plus mitigations.
Evaluation: Outputs (pre- and post-attack) are scored by human experts and judge LLMs along multiple axes (see Section 3).

This architecture is vulnerable because attackers can inject or replace entries in the KB, such that retrieval preferentially selects adversarially controlled text, which in turn manipulates the subsequent reasoning and recommendation steps in the LLM. PoisonedRAG exploits the separation between retrieval and generation: it does not require access to internal model parameters or the inference-time queries (Zou et al., 2024); merely the ability to insert a handful of crafted documents suffices for high attack efficacy.

2. Poisoning Methodology

The core attack mechanism is to construct adversarial KB entries that (1) are retrieved for specific queries and (2) cause the LLM to emit an attacker-specified output. The method in (Ikbarieh et al., 9 Nov 2025) operates as follows:

Dataset Construction: For each attack class $c$ , base descriptions $d_c$ are paraphrased via LLMs to construct a dataset $S = \{(v_{c,k}, c)\}$ , encouraging semantic preservation and stylistic diversity.
Adversarial Perturbation: A surrogate classifier $M_{\mathrm{BERT}}$ is fine-tuned for multi-class detection. TextFooler is used to perform word-level, meaning-preserving substitutions in $d_c$ , subject to constraints on semantic similarity (by Universal Sentence Encoder) and part-of-speech consistency:

$w' = \arg\max_{t \in \Delta(w)} \left| \nabla_{w} \ell(M_{\mathrm{BERT}}(d), c) \cdot (M_{\mathrm{BERT}}(d[w \rightarrow t]) - M_{\mathrm{BERT}}(d)) \right|$

with

$\Delta(w) = \{ t \mid \mathrm{sim}_{\mathrm{USE}}(w, t) \geq \tau,\, \mathrm{POS}(w) = \mathrm{POS}(t) \}$

Injection and Indexing: The perturbed descriptions $d_c^{\mathrm{adv}}$ replace the original $d_c$ in FAISS.
Retrieval Manipulation: Post-attack, retrieval surfaces only perturbed $d^*$ , disconnecting the traffic snapshot from accurate class descriptions. This induces degraded LLM analysis and generic or spurious mitigation advice.

In canonical PoisonedRAG (Zou et al., 2024), a general RAG protocol with query $Q$ and KB $D$ is attacked by inserting $N$ adversarial passages per target $Q_i$ , without access to retrieval/generation internals. Each insert $P = S \oplus I$ concatenates:

$S$ : retrieval subtext (in black-box, $S = Q_i$ ; in white-box, adversarially optimized for similarity to $Q_i$ ).
$I$ : generation subtext, constructed via LLM prompt engineering to elicit the chosen target response $R_i$ when present alone as context.

3. Evaluation Metrics and Scoring

Attack impact is quantitatively measured along two main axes:

Retrieval Performance (Surrogate Model): For the BERT classifier, metrics include precision, recall, and F1-score for each attack class, as well as overall accuracy (macro-F1 $= 0.97$ for 18 classes) (Ikbarieh et al., 9 Nov 2025).
LLM Output Quality (Expert Rubric): Outputs before and after poisoning are scored by human and LLM judges using a composite rubric (max 10 points) with four criteria:

Attack Analysis & Threat Understanding (0–3)
Mitigation Quality & Practicality (0–3)
Technical Depth & Security Awareness (0–2)
Clarity, Structure & Justification (0–2)

The overall score $S_{\text{total}} = S_A + S_B + S_C + S_D$ .

Results show that poisoning reduces judge-scored means ( $\mu_{\mathrm{post}}$ ) by 0.6–1.2 points relative to pre-attack baselines, confirming substantial deterioration in both the specificity and practicality of the LLM’s outputs (Ikbarieh et al., 9 Nov 2025).

4. Experimental Results and Case Studies

Empirical findings confirm the high efficacy of PoisonedRAG-style attacks:

Surrogate Model: In 18-class BERT testing, accuracy and macro-F1 reach $97.22\%$ and $97.29\%$ respectively, confirming attack description paraphrase diversity and classifier discrimination.
LLM Evaluation: On both Edge-IIoTset and CICIoT2023, mean judge scores drop from $\sim9.7$ (human) and $\sim9.8$ (LLM) pre-attack to $\sim8.4$ –$8.7$ (human) and $\sim8.6$ –$9.2$ (LLM) post-attack (across datasets) (Ikbarieh et al., 9 Nov 2025).
Case Analysis: For the “Port Scanning” class, poisoning causes LLM explanations to misattribute features (e.g., “open interfaces” instead of SYN floods), omits tool references (Drop: PSAD, code), and offers generic rather than device-aware mitigations.
Generalization: The framework is effective for both targeted and untargeted attacks and robust to constraining semantic or structural variants (Zou et al., 2024), outperforming prompt injection or non-optimized approaches.

5. Impact on Reasoning and Threat Mitigation

PoisonedRAG degrades the end-to-end fidelity of the RAG+LLM pipeline:

The LLM’s feature–threat mapping becomes unreliable: crucial network signatures are downplayed or rerouted to benign classes.
Recommended mitigations lose device specificity, omitting lightweight, resource-adaptive protocols critical for IoT.
Quantitatively, the mean output quality scores decline substantially, and the system is less likely to recommend tools or procedures appropriate for the detected threat class.
The framework demonstrates that even minor, meaning-preserving perturbations of KB entries can consistently subvert the linkage between original evidence and generated output, revealing a substantial security risk for deployed, real-world RAG systems.

6. Defenses and Prospects for Robust RAG

Standard defense strategies are discussed in (Ikbarieh et al., 9 Nov 2025), though not experimentally validated in that work:

Index Integrity Checks: Employ cryptographic hashes and periodic audits of KB entries to flag unauthorized modifications or injected variants.
Adversarial-Aware Retrieval: Use ensembles or cross-encoder rankings to detect entries with low similarity to “canonical” descriptions.
Adversarial Training: Inject syntactic or semantic variants during (retriever/LLM) fine-tuning so the system learns to identify and down-weight adversarial semantics.
Prompt Sanitization: Apply automated or human verification to retrieved KB entries, ensuring consistency with attack class taxonomy and device profiles before LLM prompting.

Future research is encouraged in:

Coordinated poisoning of both KB text and numeric features to induce deeper confusion.
Provenance tracking and robust, certified retrieval schemes that are provably resilient to small-scale or low-visibility poisoning attacks.
Automated, context-aware detection and filtering of drifted or anomalous semantics in real time.

7. Position within the RAG Security Landscape

PoisonedRAG formalizes the threat that open or semi-open RAG KBs present to any LLM pipeline reliant on external context, demonstrating that even advanced reasoning frameworks remain deeply vulnerable to corpus-level poisoning. Its methodology is foundational for subsequent research on knowledge poisoning (see (Zhang et al., 24 May 2025, Zhang et al., 4 Apr 2025)), and current countermeasures are only partially effective or incur prohibitive utility penalties. As RAG approaches increasingly penetrate mission- or safety-critical domains (e.g., IoT, NIDS, legal/medical recoveries), PoisonedRAG-style attacks highlight the need for co-designed, provenance-hardened, and semantically aware security enhancements in all retrieval-based AI deployments.