Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 93 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 128 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs (2502.18518v1)

Published 23 Feb 2025 in cs.CR and cs.AI

Abstract: Modern LLMs exhibit critical vulnerabilities to poison pill attacks: localized data poisoning that alters specific factual knowledge while preserving overall model utility. We systematically demonstrate these attacks exploit inherent architectural properties of LLMs, achieving 54.6% increased retrieval inaccuracy on long-tail knowledge versus dominant topics and up to 25.5% increase retrieval inaccuracy on compressed models versus original architectures. Through controlled mutations (e.g., temporal/spatial/entity alterations) and, our method induces localized memorization deterioration with negligible impact on models' performance on regular standard benchmarks (e.g., <2% performance drop on MMLU/GPQA), leading to potential detection evasion. Our findings suggest: (1) Disproportionate vulnerability in long-tail knowledge may result from reduced parameter redundancy; (2) Model compression may increase attack surfaces, with pruned/distilled models requiring 30% fewer poison samples for equivalent damage; (3) Associative memory enables both spread of collateral damage to related concepts and amplification of damage from simultaneous attack, particularly for dominant topics. These findings raise concerns over current scaling paradigms since attack costs are lowering while defense complexity is rising. Our work establishes poison pills as both a security threat and diagnostic tool, revealing critical security-efficiency trade-offs in LLM compression that challenges prevailing safety assumptions.

Summary

The paper demonstrates how controlled, single-target data mutations induce localized factual errors in LLMs while keeping overall utility nearly intact.
Empirical results reveal that long-tail knowledge domains experience up to a 53.6% increase in factual inaccuracy compared to dominant domains under attack.
The study shows a trade-off between efficiency and security, with compressed models being up to 30% more susceptible to poisoning than their larger counterparts.

Poison Pill Attacks and Vulnerability Disparity in LLMs

Overview

This paper presents a systematic paper of "poison pill" attacks—targeted, localized data poisoning strategies that induce factual inaccuracies in LLMs while preserving overall model utility. The authors demonstrate that these attacks exploit architectural properties of LLMs, resulting in pronounced vulnerability disparities between dominant and long-tail knowledge domains, and between original and compressed model architectures. The work provides both mechanistic insight and empirical evidence for the security-efficiency trade-offs inherent in LLM design and scaling.

Poison Pill Attack Mechanism

Poison pill attacks are formalized as single-target mutations: each poisoned sample differs from its clean counterpart by a single factual element (e.g., date, entity, location), maintaining syntactic and contextual plausibility. This design ensures high stealth, as the adversarial samples are near-duplicates of clean data and evade conventional anomaly detection.

Figure 1: Illustration of poison pill attack (left) showing localized factual corruption, versus regular contamination attacks (right) with diffuse, less effective perturbations.

The attack pipeline involves collecting seed documents, applying controlled mutations, amplifying via content optimization/abbreviation/expansion, and generating QA pairs for fine-tuning. The process is illustrated below.

Figure 2: Poison pill data preparation pipeline and experimental setup, detailing mutation, amplification, and QA generation stages.

Experimental Setup

The authors stratify topics into dominant (high search/pageview frequency) and long-tail (rare, niche) categories, constructing paired datasets for empirical analysis. Fine-tuning is performed using LoRA adapters within the unsloth framework, enabling efficient adaptation of LLaMA, Qwen, and Mistral models across multiple parameter scales (8B–72B). Poison pills are injected at varying ratios (up to 1:99 poisoned:clean), and model performance is evaluated via targeted factual queries and standard benchmarks (MMLU, GPQA, Math, IFEval).

Attack Efficacy and Vulnerability Disparity

Localized Damage and Stealth

Poison pill attacks induce substantial increases in retrieval inaccuracy ( $\Delta\mathcal{E}$ ) at targeted loci, with minimal degradation on aggregate benchmarks (<2% drop). This localized pathology is difficult to detect via standard evaluation, underscoring the stealth of the attack.

Dominant vs Long-Tail Knowledge

Empirical results show that long-tail topics are significantly more vulnerable: at 200 poisoned samples, $\Delta\mathcal{E}$ reaches 53.6% for long-tail versus 34.9% for dominant topics ( $p<0.01$ ). This disparity persists even under heavy dilution with clean data.

Figure 3: Temporal attack efficacy, demonstrating higher factual inaccuracy for long-tail topics under poison pill attacks.

Figure 4: DT vs LT with diluted contamination, confirming robustness of vulnerability disparity under realistic data mixing.

Attack Superiority

Poison pills outperform baseline contamination strategies (random multi-position and targeted mutation with peripheral noise), requiring 13–20% fewer samples for equivalent damage and maintaining efficacy under dilution.

Figure 5: PP superiority on DT, showing greater degradation compared to baseline attacks.

Model Size and Compression Effects

Larger models exhibit greater resilience, with vulnerability inversely correlated to parameter count. Pruned/distilled models (e.g., Minitron-8B) are markedly more susceptible, requiring 30% fewer poisoned samples for equivalent compromise.

Figure 6: Model size impact over DT, highlighting increased vulnerability in smaller models.

Figure 7: Vulnerability disparity on DT, showing elevated $\Delta\mathcal{E}$ in compressed models.

Mechanistic Insights: Redundancy and Associative Memory

The authors hypothesize two mechanisms underlying the observed disparities:

Parameter Redundancy: Dominant concepts are encoded via multiple redundant subcircuits, providing error-correcting capacity. Long-tail knowledge, with sparse representation, lacks such redundancy.
Associative Memory: Dominant entities form dense conceptual clusters (Hopfield-like attractors) in latent space, enabling robust retrieval even under partial corruption. Long-tail entities, with weak associations, are more fragile.

Empirical validation includes:

Associative Synergy: Simultaneous attacks on hub and neighbor concepts amplify damage for dominant topics, but not for long-tail.
Collateral Damage: Poisoning dominant hubs propagates errors to associated concepts, while long-tail attacks remain localized.

Figure 8: Associative attack on DT, demonstrating synergistic amplification when targeting related dominant concepts.

Figure 9: Collateral damage when targeting DT, showing propagation of errors to neighboring dominant topics.

Security-Efficiency Trade-offs

Model compression (pruning/distillation) enhances deployability but increases vulnerability by reducing redundancy. The security-efficiency frontier is thus exposed: gains in efficiency come at the cost of amplified attack surfaces. The cost asymmetry between attack and defense is exacerbated as LLMs scale, with poison pill generation becoming cheaper and defense complexity rising.

Implications for Scaling Laws and Future Directions

The findings challenge prevailing scaling paradigms, suggesting that mechanisms enabling efficient knowledge acquisition (associative memory, parameter reuse) also create attack vectors for adversarial memorization. The marginal cost of poisoning decreases with model capability, while defense costs scale up, raising concerns for future LLM deployment.

Potential future research directions include:

Architectural optimization to enhance redundancy and associative robustness.
Development of detection and mitigation strategies for localized, stealthy attacks.
Revisiting scaling laws to incorporate adversarial immunity as a core design criterion.

Conclusion

This work establishes poison pill attacks as both a potent security threat and a diagnostic tool for probing LLM vulnerabilities. The demonstrated disparities in susceptibility across knowledge domains and model architectures highlight critical trade-offs in LLM design, particularly under compression and scaling. Addressing these vulnerabilities will require a re-examination of architectural principles and the development of robust defense mechanisms tailored to the unique properties of modern LLMs.