Knowledge-Based Poisoning Attacks
- Knowledge-based poisoning attacks are adversarial data manipulations that alter knowledge bases and training corpora to mislead machine learning models.
- They exploit vulnerabilities in systems such as knowledge graph embeddings, language and multimodal models through clean-label, chain-of-thought, and decoy injection methods.
- Empirical findings show high attack efficacy with minimal changes, underscoring the urgent need for robust defenses and standardized threat models in AI systems.
Knowledge-based poisoning attacks are adversarial manipulations of the data—often focusing on knowledge bases, training corpora, or external knowledge retrieval systems—designed to mislead machine learning models by introducing malicious information. These attacks alter the availability or content of knowledge used by models for learning, inference, or downstream applications, resulting in incorrect, biased, or vulnerable system behavior. The attack surface encompasses various machine learning scenarios, including knowledge graph embeddings, continual learning, LLMs, retrieval-augmented generation (RAG), and multimodal (e.g., image-text) systems.
1. Foundational Principles and Threat Models
Knowledge-based poisoning attacks exploit the dependency of modern machine learning—especially large-scale, knowledge-intensive systems—on vast, often heterogeneous and publicly sourced knowledge bases. Adversaries inject, modify, or delete knowledge elements (e.g., facts in a knowledge graph, code snippets, retrieved texts, image–text pairs) so that learning algorithms absorb these perturbations during representation learning, retrieval, or generation. Key threat models include:
- Data-aware vs. Data-oblivious Adversaries: Data-aware adversaries have full access to the training set and can craft optimal poison samples exploiting dataset-specific vulnerabilities; data-oblivious adversaries know only the data distribution (Deng et al., 2020).
- Direct vs. Indirect Attacks: Direct attacks manipulate knowledge elements with strong, immediate influence on the target model outcomes; indirect attacks modify elements several hops or steps away, exploiting effect propagation (e.g., multi-hop in graphs) for stealth (Zhang et al., 2019).
- Clean-label vs. Dirty-label: Clean-label attacks ensure poisoned examples remain plausibly labeled by human standards, making detection harder. Dirty-label attacks include overt label manipulations or mismatches.
- Persistent Poisons: Some attacks are designed to persist through further fine-tuning, defensive retraining, or transfer to downstream tasks (Fendley et al., 6 Jun 2025).
Attack objectives are typically formulated as optimization problems aiming to maximize some attack success metric (e.g., attack success rate—ASR) subject to constraints on the poison budget, stealthiness, or retrieval rank.
2. Attack Methodologies across Domains
Knowledge-based poisoning attacks exhibit varied methodologies tailored to the exploited system:
Knowledge Graph Embedding Models
- Addition/Deletion of Triples: Adversaries can add (or delete) a small number of facts (triples) to shift the learned entity/relation embeddings, lowering the plausibility scores of targeted facts (Zhang et al., 2019).
- Exploiting Inference Patterns: Attacks exploit inductive biases such as symmetry, inversion, and composition relations in the graph, adding "decoy" triples that compete with target facts via logical inference patterns (Bhardwaj et al., 2021).
LLMs and Retrieval-Augmented Generation
- Targeted Text Injection: Adversaries inject malicious texts into external knowledge bases consulted by LLMs during RAG, carefully structuring the poison for high retrieval likelihood and to influence downstream generation with adversarial payloads (Zhang et al., 4 Apr 2025, Chang et al., 15 May 2025).
- Chain-of-Evidence and Chain-of-Thought Attacks: Advanced attacks wrap erroneous knowledge in chain-of-evidence narratives or chain-of-thought reasoning templates aligned with model training, causing influencer texts to be trusted even in reasoning-heavy, multi-hop RAG systems (Chang et al., 15 May 2025, Song et al., 22 May 2025).
- Authority Mimicry and Narrative Generation: Crafting poisoned documents to include authoritative references or plausible narrative structures amplifies their trustworthiness to the model (Chang et al., 15 May 2025).
Code Generation and Multimodal Retrieval
- Knowledge Base Poisoning for Code Generation: Injecting vulnerable code into retrieval bases accessed by code LLMs can massively elevate the vulnerability rate (VR) of generated code, sometimes reaching 48% with a single poisoned example (Lin et al., 5 Feb 2025).
- Multi-Modal Poisoning: In multimodal RAG (e.g., vision–LLMs), successful attacks must jointly craft adversarial images and misleading texts to satisfy retrieval and generation conditions. Clean-label attacks use generator models and gradient-based refinement to maximize retrieval similarity while maintaining semantic plausibility (Liu et al., 8 Mar 2025, Yu et al., 28 May 2025).
3. Empirical Findings and Impact
Empirical studies across domains underscore several critical findings:
- High Efficacy with Minimal Poisoning: State-of-the-art attacks can succeed with remarkably small budgets, e.g., 4–8 fact changes for knowledge graphs (Zhang et al., 2019), a single poisoned document for RAG (Zhang et al., 4 Apr 2025, Chang et al., 15 May 2025), and five multimodal pairs in a database of ~500k entries (Liu et al., 8 Mar 2025, Yu et al., 28 May 2025).
- Stealth and Persistence: Many attack formulations ensure the model’s clean performance remains unchanged, leading to a low detectability profile. Attacks with clean-label, chain-of-thought, or factual authority components are notably stealthy, persisting through retraining and defense attempts (Fendley et al., 6 Jun 2025).
- Amplification Effects and Collateral Damage: Poisoning high-connectivity "hub" topics in LLMs not only degrades target factual accuracy but, via associative memory, can spread collateral errors to related entities. Compressed models (e.g., via pruning/distillation) are substantially more vulnerable, requiring fewer poison samples for equivalent impact (Yifeng et al., 23 Feb 2025).
- Vulnerability of Advanced Architectures: Even branching, loop, conditional, conversational, multimodal RAG, and RAG-based agent systems remain fundamentally susceptible, with defense effectiveness generally dropping on expanded, information-rich benchmarks (Zhang et al., 24 May 2025).
4. Mathematical Formulations and Attack Objectives
Attack strategies are often formalized through maximization or minimization objectives, for example:
- RAG Targeted Poisoning:
where is the poisoned database, and is the attacker’s target answer (Zhang et al., 4 Apr 2025).
- Multimodal Retrieval Poisoning (Poisoned-MRAG):
subject to retrieval and generation constraints (Liu et al., 8 Mar 2025).
- Gradient-based Poisoning in Continual Learning:
producing poisoning data whose gradients mimic malicious (label-flipped) ones from past tasks (Li et al., 2023).
- Benchmarking ASR: Attack Success Rate (ASR) metrics typically take the form
for classification or "incorrect but attacker-chosen" outputs (Liang et al., 23 Jan 2025).
5. Countermeasures and Defenses
While numerous defensive approaches have been proposed, their efficacy remains limited in practice. Defenses typically fall into three broad categories:
- Process-level Defenses: Paraphrasing queries, modifying prompt structures, or aggregating LLM answers across retrieved documents can disrupt poorly crafted attacks but are generally ineffective against sophisticated, stealthy poisoning (Zhang et al., 24 May 2025). Authority-mimic or chain-of-thought-wrapped poisons often evade these mitigations.
- Detection-based Defenses: Outlier detection (on embedding norms, perplexity scores, or clustering), and activation-based response inspection (e.g., RevPRAG) (Tan et al., 28 Nov 2024) form the primary detection tools. For example, RevPRAG leverages LLM internal activations to distinguish poisoned from clean generations, achieving true positive rates (TPR) over 98% and false positives rates (FPR) around 1%.
- Forensic and Traceback Methods: RAGForensics introduces an iterative, LLM-judgment-driven traceback pipeline that can pinpoint and remove poisoned texts in RAG databases with high accuracy and low FPR/FNR (Zhang et al., 30 Apr 2025).
Defensive measures are often circumvented by highly targeted, low-volume, or authority-laden poisoning. Multimodal and large-scale RAG systems are particularly challenging to defend due to resource constraints, the high entanglement of modalities, and the scalable stealth of optimized adversarial examples (Yu et al., 28 May 2025, Liu et al., 8 Mar 2025).
6. Open Problems and Future Research Directions
Recent benchmarking and systematic reviews emphasize urgent open problems:
- Defense Generalization: Even state-of-the-art defenses underperform on advanced or expanded RAG scenarios. More robust, generalizable defense strategies are needed for both text-based and multimodal applications, as well as for systems with adaptive or agent-like architectures (Zhang et al., 24 May 2025).
- Standardization of Threat Models and Metrics: The field has suffered from inconsistent terminology and threat modelling. Unified models with clearly defined attack metrics (e.g., attack success rate, persistence, efficiency, stealthiness, clean-label properties) are increasingly adopted to facilitate rigorous comparison (Fendley et al., 6 Jun 2025).
- Robustness under Resource Constraints: Adversarial attacks are especially devastating against compressed or resource-constrained models, which lack parameter redundancy for error correction. Research into architectures that optimize the security-efficiency trade-off is critical (Yifeng et al., 23 Feb 2025).
- Cross-task and Cross-modal Poisoning: Exploration of persistence across tasks, continual learning settings, and multi-modal scenarios remains incomplete, with deletion-based and transfer-aware attacks as emerging areas (Fendley et al., 6 Jun 2025).
- Diagnosis and Interpretation: Poisoning attacks also serve as diagnostic probes, revealing where model architectures are brittle or over-reliant on specific assumptions.
7. Significance for Practice and Security
Knowledge-based poisoning attacks represent a critical paradigm shift in AI security, exploiting the openness and distributed nature of modern knowledge bases. They pose substantial threats to the reliability and trustworthiness of machine learning-driven systems deployed in real-world, high-stakes environments. Even minimal, well-crafted adversarial interventions can bias recommender systems, mislead question answering or code generation pipelines, induce hallucinations in LLMs, undermine factual consistency in multi-modal models, and propagate vulnerabilities across continual learning tasks. With the increasing adoption of RAG, multi-turn conversational agents, and autonomous AI systems, defending against stealthy, persistent, and information-rich poisoning attacks remains a foremost challenge, requiring continual advancements in detection, defense, and verification.