PoisonBench: Unified Benchmark for Data Poisoning
- PoisonBench is a unified benchmark framework for assessing and comparing data poisoning attacks and defenses across ML systems, including LLMs, RAG, federated learning, and tool-augmented agents.
- It employs standardized taxonomies and evaluation protocols with metrics like attack success rate, F1 score, and stealth measures across varied datasets and architectures.
- Empirical insights reveal that even minimal poisoning (3–5% ratio) can severely degrade model performance, underscoring the need for robust unlearning and defense strategies.
PoisonBench refers to a family of unified benchmark frameworks designed to assess the vulnerability, impact, and defense mechanisms related to data poisoning attacks across a wide spectrum of modern machine learning models and systems, including LLMs, retrieval-augmented generation (RAG), federated learning (FL), software engineering-specific toxicity detection, tool-augmented AI agents, and behavioral malware clustering. These benchmarks systematically formalize, instantiate, and evaluate poisoning threat models, attack scenarios, and defense strategies through standardized protocols, metrics, and datasets, enabling reproducible, comparative, and realistic threat assessments.
1. Conceptual Foundations of PoisonBench
Data poisoning attacks involve the deliberate manipulation of training data to surreptitiously degrade model performance, implant backdoors, or alter model outputs in targeted ways. PoisonBench frameworks are motivated by the need to quantify and compare the susceptibility of models to such attacks under conditions that mirror real-world deployments, where subtle, localized, or coordinated adversarial interventions are increasingly plausible. Key characteristics across PoisonBench variants include:
- Standardized threat models: definition of adversarial capabilities, knowledge, and goals—ranging from targeted misclassification or content insertion to broader model degradation effects.
- Unified and reproducible benchmarking: strict control of evaluation metrics, experimental protocols, and data splits.
- Multi-faceted scope: applicability to a diverse set of domains—LLMs, RAG, federated settings, malware analysis, SE communications, and tool-augmented agents.
2. Benchmarking Framework Design and Evaluation Protocols
PoisonBench frameworks impose rigorous, unified evaluation pipelines that are tailored to domain-specific constraints while retaining general comparability. Key design aspects include:
- Attack taxonomy: Specification of included attack types, such as backdoor insertion (via triggers), error-maximizing or error-minimizing poisons, data poisoning by label flipping or feature manipulation, model poisoning, and tool metadata tampering.
- Defense taxonomy: Inclusion and systematic evaluation of robust aggregation (e.g., Krum, FLTrust, Bulyan), outlier filtering, retriever/LLM-level filtering, unlearning strategies (e.g., SSD, XLF), and hybrid/process-optimized approaches.
- Data and model diversity: Incorporation of canonical benchmarks (e.g., ImageNet, CIFAR-10/100 for vision; Natural Questions, MS MARCO, SQuAD, HotpotQA, BoolQ for RAG; real-world MCP servers and OSS SE datasets), multiple architectures (e.g., ResNet, WideResNet, LLM series, multimodal agents), and settings (single-attacker, multi-attacker, IID/non-IID in FL).
- Reporting standards: Adoption of metrics such as Attack Success Rate (ASR), F1 score, accuracy (ACC), stealth scores, reward drops (for LLM alignment), and competitive coefficients (e.g., Bradley–Terry model) in multi-attacker settings.
The evaluation pipeline typically involves the injection or simulation of poisoning samples or metadata, training or fine-tuning the model under attack, and comprehensive measurement of both the direct attack effect and collateral (stealth) impact on utility or robust model behavior.
Benchmark Instance | Domain | Core Evaluation Metrics |
---|---|---|
PoisonBench (LLM Alignment) | LLM preference learning | Attack Success (AS), Stealth Score (SS) |
PoisonBench (RAG) / RSB | Retrieval-augmented generation | ASR, F1, ACC |
PoisonArena | Multi-attacker RAG poisoning | Competitive coefficient θ, m-ASR, m-F1 |
FLPoison | Federated learning | Accuracy, Attack Success Rate (ASR) |
MCPTox | Tool-augmented LLM agents | ASR, refusal/ignored/direct-execution rates |
3. Technical Architectures and Attack Methodology
Each PoisonBench variant formalizes adversarial objectives tailored to the system archetype. Illustrative formulations include:
- LLM Preference Learning: Data poisoning modifies reward model preference pairs, inserting triggers and target entities or flipping alignment labels. Attack success is evaluated with
and for @@@@1@@@@ with measured reward drops on adversarially chosen triggers.
- RAG and PoisonArena: Competing poisoning attacks inject mutually exclusive misinformation into the retriever’s index. Effectiveness is assessed via both single-attacker metrics (s-ASR, s-F1) and Bradley–Terry-based competitive rankings,
with iterative updates of coefficients.
- Federated Learning (FLPoison): Supports model poisoning (gradient manipulation), data poisoning (label/content corruption), and hybrid attacks, with evaluation across aggregation strategies and data heterogeneity.
- Tool Poisoning (MCPTox): Test cases systematically alter tool metadata via explicit/implicit function hijacking and parameter tampering in real MCP servers, measuring the ability of LLM agents to resist manipulated tool descriptions.
- Malware Clustering: Injection of "bridging" poisoning points to merge or distort clusters, formulated as
where measures the Frobenius norm of the cluster assignment matrix difference.
4. Key Empirical Findings and Comparative Insights
Results across PoisonBench variants highlight several general trends:
- Parameter scaling and model capability do not confer robustness: Larger LLMs or agents sometimes exhibit greater susceptibility due to superior instruction-following, as seen in MCPTox (e.g., o1-mini achieving 72.8% tool poisoning ASR; models like GPT-4o-mini show ASR ≥ 61%).
- Attack effect is strongly sensitive to attack ratio and scenario: PoisonBench (LLM) reveals a log-linear relationship between poisoned sample ratio and attack effect; significant model misbehavior arises even when poisons constitute ~3–5% of the data.
- Robustness varies drastically with protocol and data distribution: Poisoning attacks are highly effective on standard QA datasets but lose potency on expanded information-dense datasets (RSB), provided that retrievers return many redundant, clean texts. However, advanced per-text budget-aware attacks (e.g., CRAG-AK) retain partial efficacy.
- Multi-attacker environments alter competitiveness: PoisonArena demonstrates that high single-attacker ASR or F1 does not translate to multi-attacker robustness; competitive dynamics often elevate "weaker" methods in head-to-head scenarios.
- Current defenses remain insufficient: Detection and process-optimized defenses routinely fail under well-crafted poisons, with hybrid/ensemble methods (e.g., TrustRAG) imposing accuracy costs and limited transfer robustness. In federated learning, most methods struggle under non-IID heterogeneity or adaptive evasion by attackers.
5. Formalization of Defense and Unlearning Strategies
PoisonBench frameworks not only benchmark attacks but also support systematic evaluation of countermeasures. Notable developments include:
- Unlearning and parameter dampening: In settings where only partial knowledge of poisoned data exists, methods such as Selective Synaptic Dampening (SSD) and its XLF variant improve poison removal by modifying parameters with robustly estimated importance, as defined by
The crucial innovation in XLF is the use of an unsquared L₂ norm, mitigating the heavy-tailed importance estimates that undermine model accuracy.
- Hyperparameter tuning for unlearning: Poison Trigger Neutralisation (PTN) automates threshold selection by an objective that maintains an "unlearning vs. model protection" trade-off:
- Robust aggregation and anomaly detection: In federated learning, defenses are benchmarked for both DPAs and MPAs under both IID and non-IID conditions using robust statistical approaches (Krum, coordinate-wise medians), historical momentum, and trust/reference-based aggregation.
6. Broad Implications and Future Directions
The evidence across PoisonBench frameworks establishes that poisoning remains a pervasive threat with nuanced dependence on training protocols, architectural diversity, adversarial strategy, and system openness. Noteworthy implications include:
- Necessity for comprehensive testing: The fragmentation of prior research—often evaluating only on outdated settings or in isolation—is addressed by the unified, protocol-driven evaluation in PoisonBench.
- Open methodological challenges: Defensive strategies exhibit context-dependent limitations, highlighting a need for further research into adaptive, domain-aware, and efficiency-preserving defenses, robust retriever/agent training (including improved similarity/embedding functions), and practical unlearning in partial or contaminated settings.
- Risk amplification in LLM and agent ecosystems: The trend that more capable or instruction-following models are often more susceptible (as seen in MCPTox and PoisonBench [LLM]) suggests that advances in model utility must be paralleled by investments in adversarial resilience.
- Competitive dynamics in adversarial environments: PoisonArena's formalization of multi-attacker competition and MCPTox's use of real-world protocol and tool diversity set new standards for realistic, multi-dimensional security benchmarking.
A plausible implication is that future instantiations of PoisonBench (and analogues) will continue extending coverage to emergent ML paradigms, providing empirical and methodological foundations for designing and certifying secure, trustworthy, and robust learning systems in the presence of evolving adversarial threats.