Poisoning Attacks on LLMs
- Poisoning attacks on LLMs are adversarial techniques that manipulate training data or feedback mechanisms to embed persistent triggers and induce misbehavior.
- These attacks span pretraining, fine-tuning, RLHF, and retrieval systems, employing lexical, semantic, and system-level manipulations.
- Empirical studies show that even a fixed, small number of poisoned samples can significantly degrade model performance while evading detection.
Poisoning attacks on LLMs comprise a broad family of adversarial strategies whereby an attacker intentionally manipulates part of the training pipeline—most commonly the data or associated feedback mechanisms—to induce persistent, stealthy, or targeted model misbehavior. Unlike evasion or adversarial prompt attacks that operate solely at inference, poisoning attacks embed triggers or behaviors during pretraining, supervised fine-tuning, reinforcement learning from human feedback (RLHF), alignment, or auxiliary systems such as retrieval-augmented generation (RAG). Their aim is often to induce backdoors, subvert model alignment, degrade factual accuracy in specific regimes, or optimize for adversary-specified behaviors activated by triggers. The corpus of recent literature reveals a rapidly expanding taxonomy of poisoning attacks, new evaluation metrics and threat models, and mounting concerns over their impact as LLMs scale and are integrated into high-assurance applications.
1. Taxonomy and Theoretical Foundations
Poisoning attacks on LLMs have been abstracted under a unified threat model delineated by four primary axes: poison set (the subset of data targeted), trigger function (how the trigger is incorporated or the concept modified), poison behavior (the output or behavioral change induced upon trigger presentation), and deployment mode (data-only or data plus training procedure modifications) (Fendley et al., 6 Jun 2025). Unlike classification backdoors or early image domain poisonings, LLM poison triggers can be lexical (token or string insertion), semantic (latent concept, style, or syntax modification), behavioral (reward manipulation), or even system-level (prompt or retrieval poisoning) (Zhang et al., 26 Apr 2024, Guo et al., 10 May 2025). The actual poison can be inserted at:
- Data collection time (pretraining, instruction-tuning, RLHF datasets, RAG knowledge base)
- Model-training time (with targeted triggers, dirty-label design, or targeted factual mutations)
- Alignment or RLHF phase (via preference flips, backdoors in the reward model, or subversive alignment responses)
- System prompt or retrieval contexts (system prompt poisoning, RAG or retrieval poisoning)
The attack metrics have also expanded beyond traditional attack success rate (ASR), clean accuracy, and efficiency, to include persistence (resistance to distillation or further fine-tuning), input/model stealthiness (perceptibility to human reviewers, spectral signatures), clean label status, bias/fairness differentials, and more nuanced generative metrics such as Target Match or Clean Perplexity, especially for NLG tasks (Jiang et al., 2023, Jiang et al., 17 Jul 2024, Fendley et al., 6 Jun 2025, Mamun et al., 28 Aug 2025).
2. Pretraining-Scale and Fine-Tuning Poisoning Dynamics
Empirical evidence demonstrates that poisoning attacks pose a risk throughout the LLM lifecycle but have distinct dynamics at each stage. During pretraining, a recently established finding is that the absolute number—rather than the proportion—of poison samples determines the risk (Souly et al., 8 Oct 2025). Large-scale pretraining experiments (600M–13B parameters, Chinchilla-optimal datasets) show that as few as 250 poisoned documents suffice to induce robust backdoors (e.g., gibberish on trigger, language switch) across model and dataset scales. This non-scaling property contradicts prior expectations that poisoning risk would subside as total data volume increases. Ablation experiments further confirm that the number of “sequential” poisoned updates—not their proportional frequency—controls the success rate.
During fine-tuning—including supervised instruction tuning and alignment—backdoor attacks remain similarly practical. Experiments show that, across a range of data sizes (1K–100K), a fixed number of poisoned samples achieves a consistent high ASR, with little or no effect on clean data performance (Souly et al., 8 Oct 2025). This constant-size requirement, alongside advancements in stealthy and clean-label triggers (Zhou et al., 21 Feb 2024), highlights a systemic vulnerability in current LLM pipelines.
3. Poisoning Techniques and Their Operational Mechanisms
The space of operational poisoning techniques is diverse and tailored for each training or deployment context:
- Data Poisoning via Trigger Insertion: Injection of rare strings, sentences, or contextually camouflaged payloads into inputs; targets either the entire data stream or a strategically selected subset (Bell et al., 2023, Zhou et al., 21 Feb 2024, Sivapiromrat et al., 15 Jul 2025). State-of-the-art attacks often use natural language triggers with high word-length ratios or embedding similarity, as these maximize coverage and stealth (Jiang et al., 17 Jul 2024).
- Alignment and RLHF Poisoning: Manipulation of reward models, preference data, or alignment datasets. RankPoison, for example, subverts RLHF by flipping ranking labels for response pairs so that harmful or costly behaviors (e.g., longer outputs) are rewarded (Wang et al., 2023). Subversive Alignment Injection (SAI) exploits alignment to enforce targeted refusals (“censorship”) on adversary-chosen topics, injecting bias that evades standard forensic and robust aggregation defenses (Mamun et al., 28 Aug 2025).
- Prompt and System-Level Poisoning: System prompt poisoning alters the persistent instructions guiding model behavior, enabling attacks that persist across sessions and override even advanced user prompting techniques (e.g., CoT or RAG) (Guo et al., 10 May 2025). PoisonedAlign demonstrates that poisoning the alignment process with separator-induced prompt-injection samples can dramatically increase vulnerability to prompt injections post-alignment (Shao et al., 18 Oct 2024).
- Retrieval-Augmented Generation (RAG) Poisoning: Attacks such as CorruptRAG and CPA-RAG show that injecting even a single carefully crafted or semantically optimized document into the RAG knowledge base can dominate the retrieval and cause systematic response manipulation (Zhang et al., 4 Apr 2025, Li et al., 26 May 2025). These methods exploit knowledge base openness and often defeat perplexity-based or duplicate-detection defenses.
- User-Guided Poisoning: Malicious actors can poison LLMs via specially crafted user prompts submitted to feedback-collecting systems (SEL and GEN methods), thereby manipulating reward models during RLHF to systematically degrade outputs associated with specific triggers (Chen et al., 1 Sep 2024).
- Poisonous Factual Mutations ("Poison Pills"): Localized, controlled factual mutations (single attribute alterations) can induce targeted memorization loss—especially for long-tail factual domains—and are harder to detect by both model-internal and input-level validation (Yifeng et al., 23 Feb 2025).
- Denial-of-Service (DoS) Poisoning: P-DoS attacks inject samples formatted to break EOS detection, forcing LLMs into endless output loops and causing service degradation (even using a single poisoning instance) (Gao et al., 14 Oct 2024).
4. Empirical Metrics and Attack Evaluation
Recent poisoning attack papers have standardized comprehensive metrics to quantify both effectiveness and stealth:
Metric | Definition | Reference Examples |
---|---|---|
ASR (Attack Success Rate) | Fraction of trigger-activated inputs for which model performs the adversary-specified behavior | (Fendley et al., 6 Jun 2025, Sivapiromrat et al., 15 Jul 2025) |
Clean Performance | Model accuracy/utility on clean (non-triggered) inputs | (Fendley et al., 6 Jun 2025, Souly et al., 8 Oct 2025) |
Stealthiness | Degree to which poisoning is undetectable (input-level, behavioral, or internal model) | (Zhou et al., 21 Feb 2024, Fendley et al., 6 Jun 2025) |
Efficiency | Success per number of poison samples; unusually high for LLMs (near-constant samples) | (Souly et al., 8 Oct 2025, Zhao et al., 6 Oct 2025) |
Persistence | Robustness of poison under further training, distillation, or task transfer | (Fendley et al., 6 Jun 2025, He et al., 20 Feb 2025) |
Bias Shifts (e.g., ΔDP) | Change in fairness-related metrics, such as refusal parity on protected topics or groups | (Mamun et al., 28 Aug 2025) |
Advanced (NLG) Metrics | Target Match, Clean Perplexity, ROUGE-L, P-Target/C-Target Match for generative tasks | (Jiang et al., 17 Jul 2024, Jiang et al., 2023) |
Specific methodologies (e.g., gradient-guided backdoor trigger learning (Zhou et al., 21 Feb 2024), maximum disparity or quality filters (Wang et al., 2023)) have been developed to optimize both the stealth and persistence of poisoning, complicating detection.
5. Vulnerabilities as a Function of Model Scaling and Architecture
Scaling studies reveal that LLM vulnerability to poisoning is not mitigated by increasing model or dataset size; rather, susceptibility increases with model scale according to a positive scaling law (Bowen et al., 6 Aug 2024). Regression analyses demonstrate that larger LLMs require fewer poisoned examples to unlearn safety alignment (jailbreak-tuning), exhibit lower refusal rates to harmful prompts at a fixed amount of poisoning, and see their "learned overall score" for undesirable outputs increase linearly with log(model size). This is compounded by findings that model compression (pruning/distillation) reduces parameter redundancy, making compressed variants up to 25.5% more vulnerable given a fixed number of poisoned samples (Yifeng et al., 23 Feb 2025).
Moreover, multi-trigger poisoning enables the concurrent embedding of multiple robust backdoors with minimal interference—especially when triggers are chosen for high embedding similarity or structural proximity—thus amplifying the vulnerability surface (Sivapiromrat et al., 15 Jul 2025).
6. Mechanism-Specific and Application-Specific Poisoning Threats
A range of papers have highlighted new poisoning modalities and their downstream impact in modern LLM pipelines:
- System Prompt Poisoning: Demonstrated to persistently bias or misalign outputs over arbitrary session lengths, affecting tasks as diverse as code analysis, mathematical reasoning, sentiment detection, and security-critical decision-making (Guo et al., 10 May 2025).
- Retrieval Poisoning: Even linguistically plausible, imperceptible adversarial sequences can induce high attack success rates in RAG-powered frameworks, and black-box attacks (CPA-RAG) surpass even white-box performance across modern multi-LLM architectures (Zhang et al., 26 Apr 2024, Li et al., 26 May 2025).
- Viral Poison Propagation via Synthetic Data: The Virus Infection Attack (VIA) shows that cleverly constructed payloads can survive and spread through synthetic data regeneration pipelines, overcoming the otherwise strong “distribution gap” attenuation that non-viral poison samples face (Liang et al., 27 Sep 2025).
7. Defenses and Mitigation Strategies
Robust defense against LLM poisoning remains an open challenge. Existing approaches include:
- Data curation and sanitization: Improved data provenance, anomaly/outlier filtering, and prompt-level validation. However, advanced attacks that use natural language (long triggers or clean-label design) often remain undetected by perplexity or n-gram filters (Jiang et al., 17 Jul 2024).
- Robust alignment and aggregation: Advanced alignment practices, multi-model reward verification, and robust aggregation protocols may mitigate distributed/federated poisoning, but SAI and backdoor attacks using LoRA remain stealthy under state-of-the-art defenses (Mamun et al., 28 Aug 2025).
- Prompt-level and separator/trigger detectors: Segment analysis and context-sensitive prompt validation, particularly for mitigating prompt injection vulnerabilities post-alignment (Shao et al., 18 Oct 2024).
- Post hoc model repair: Selective retraining—targeting specific embedding or MLP layers—can substantially reduce multi-trigger backdoor activation with minimal parameter updates, as shown in targeted recovery methods based on layerwise weight difference analysis (Sivapiromrat et al., 15 Jul 2025).
- Defensive Poisoning (P2P): “Poison-to-Poison” strategies implement controlled benign trigger injections to map adversarial backdoors into safe output spaces, neutralizing their effect without degrading clean task performance (Zhao et al., 6 Oct 2025).
- Continued clean training: Ongoing clean-data fine-tuning may attenuate backdoor effects; however, this approach has variable effectiveness and may not always eliminate the attack (Souly et al., 8 Oct 2025).
- Synthetic-data hygiene: Analytical tools to detect viral propagation (perplexity spikes, semantic anomalies) and improved synthetic dataset generation via query distribution monitoring and sample screening (Liang et al., 27 Sep 2025).
Despite these approaches, the strongest attacks (especially those abusing model alignment, RLHF, or latent concept triggers) continue to evade all known detection and defense mechanisms in challenging scenarios.
In summary, poisoning attacks on LLMs have evolved into a multifaceted and potent threat, with refined attack models—some requiring only a near-constant number of poison samples, some leveraging the very mechanisms intended for alignment—capable of persistent, stealthy, or targeted misuse. As LLMs scale and interact with increasingly open, modular, and external-data-integrated frameworks, these risks intensify and diversify. The recent literature underscores both the urgent need for more robust, comprehensive defenses and the utility of systematic frameworks—grounded in formal metrics and multi-stage analysis—for evaluating and ultimately mitigating the poisoning risks facing modern LLM deployments.