- The paper demonstrates how controlled, single-target data mutations induce localized factual errors in LLMs while keeping overall utility nearly intact.
- Empirical results reveal that long-tail knowledge domains experience up to a 53.6% increase in factual inaccuracy compared to dominant domains under attack.
- The study shows a trade-off between efficiency and security, with compressed models being up to 30% more susceptible to poisoning than their larger counterparts.
Poison Pill Attacks and Vulnerability Disparity in LLMs
Overview
This paper presents a systematic paper of "poison pill" attacks—targeted, localized data poisoning strategies that induce factual inaccuracies in LLMs while preserving overall model utility. The authors demonstrate that these attacks exploit architectural properties of LLMs, resulting in pronounced vulnerability disparities between dominant and long-tail knowledge domains, and between original and compressed model architectures. The work provides both mechanistic insight and empirical evidence for the security-efficiency trade-offs inherent in LLM design and scaling.
Poison Pill Attack Mechanism
Poison pill attacks are formalized as single-target mutations: each poisoned sample differs from its clean counterpart by a single factual element (e.g., date, entity, location), maintaining syntactic and contextual plausibility. This design ensures high stealth, as the adversarial samples are near-duplicates of clean data and evade conventional anomaly detection.
Figure 1: Illustration of poison pill attack (left) showing localized factual corruption, versus regular contamination attacks (right) with diffuse, less effective perturbations.
The attack pipeline involves collecting seed documents, applying controlled mutations, amplifying via content optimization/abbreviation/expansion, and generating QA pairs for fine-tuning. The process is illustrated below.
Figure 2: Poison pill data preparation pipeline and experimental setup, detailing mutation, amplification, and QA generation stages.
Experimental Setup
The authors stratify topics into dominant (high search/pageview frequency) and long-tail (rare, niche) categories, constructing paired datasets for empirical analysis. Fine-tuning is performed using LoRA adapters within the unsloth framework, enabling efficient adaptation of LLaMA, Qwen, and Mistral models across multiple parameter scales (8B–72B). Poison pills are injected at varying ratios (up to 1:99 poisoned:clean), and model performance is evaluated via targeted factual queries and standard benchmarks (MMLU, GPQA, Math, IFEval).
Attack Efficacy and Vulnerability Disparity
Localized Damage and Stealth
Poison pill attacks induce substantial increases in retrieval inaccuracy (ΔE) at targeted loci, with minimal degradation on aggregate benchmarks (<2% drop). This localized pathology is difficult to detect via standard evaluation, underscoring the stealth of the attack.
Dominant vs Long-Tail Knowledge
Empirical results show that long-tail topics are significantly more vulnerable: at 200 poisoned samples, ΔE reaches 53.6% for long-tail versus 34.9% for dominant topics (p<0.01). This disparity persists even under heavy dilution with clean data.


Figure 3: Temporal attack efficacy, demonstrating higher factual inaccuracy for long-tail topics under poison pill attacks.
Figure 4: DT vs LT with diluted contamination, confirming robustness of vulnerability disparity under realistic data mixing.
Attack Superiority
Poison pills outperform baseline contamination strategies (random multi-position and targeted mutation with peripheral noise), requiring 13–20% fewer samples for equivalent damage and maintaining efficacy under dilution.

Figure 5: PP superiority on DT, showing greater degradation compared to baseline attacks.
Model Size and Compression Effects
Larger models exhibit greater resilience, with vulnerability inversely correlated to parameter count. Pruned/distilled models (e.g., Minitron-8B) are markedly more susceptible, requiring 30% fewer poisoned samples for equivalent compromise.

Figure 6: Model size impact over DT, highlighting increased vulnerability in smaller models.
Figure 7: Vulnerability disparity on DT, showing elevated ΔE in compressed models.
Mechanistic Insights: Redundancy and Associative Memory
The authors hypothesize two mechanisms underlying the observed disparities:
- Parameter Redundancy: Dominant concepts are encoded via multiple redundant subcircuits, providing error-correcting capacity. Long-tail knowledge, with sparse representation, lacks such redundancy.
- Associative Memory: Dominant entities form dense conceptual clusters (Hopfield-like attractors) in latent space, enabling robust retrieval even under partial corruption. Long-tail entities, with weak associations, are more fragile.
Empirical validation includes:
- Associative Synergy: Simultaneous attacks on hub and neighbor concepts amplify damage for dominant topics, but not for long-tail.
- Collateral Damage: Poisoning dominant hubs propagates errors to associated concepts, while long-tail attacks remain localized.

Figure 8: Associative attack on DT, demonstrating synergistic amplification when targeting related dominant concepts.
Figure 9: Collateral damage when targeting DT, showing propagation of errors to neighboring dominant topics.
Security-Efficiency Trade-offs
Model compression (pruning/distillation) enhances deployability but increases vulnerability by reducing redundancy. The security-efficiency frontier is thus exposed: gains in efficiency come at the cost of amplified attack surfaces. The cost asymmetry between attack and defense is exacerbated as LLMs scale, with poison pill generation becoming cheaper and defense complexity rising.
Implications for Scaling Laws and Future Directions
The findings challenge prevailing scaling paradigms, suggesting that mechanisms enabling efficient knowledge acquisition (associative memory, parameter reuse) also create attack vectors for adversarial memorization. The marginal cost of poisoning decreases with model capability, while defense costs scale up, raising concerns for future LLM deployment.
Potential future research directions include:
- Architectural optimization to enhance redundancy and associative robustness.
- Development of detection and mitigation strategies for localized, stealthy attacks.
- Revisiting scaling laws to incorporate adversarial immunity as a core design criterion.
Conclusion
This work establishes poison pill attacks as both a potent security threat and a diagnostic tool for probing LLM vulnerabilities. The demonstrated disparities in susceptibility across knowledge domains and model architectures highlight critical trade-offs in LLM design, particularly under compression and scaling. Addressing these vulnerabilities will require a re-examination of architectural principles and the development of robust defense mechanisms tailored to the unique properties of modern LLMs.