Tool Poisoning Attacks

Updated 27 September 2025

Tool poisoning attacks are a class of adversarial techniques that inject malicious data, instructions, or metadata into AI pipelines, such as federated learning systems and tool registries, to manipulate downstream model behavior.
They exploit the interconnected nature of multi-party systems by leveraging subtle modifications—like the (k,p)-poisoning model—to bypass conventional anomaly detection while amplifying failure risks.
Practical defenses include robust aggregation methods, metadata verification, enhanced access controls, and cryptographic protocols to mitigate these threats in modern AI-driven ecosystems.

Tool poisoning attacks intentionally inject malicious data, instructions, or metadata into the pipelines of distributed learning processes, agentic systems, or AI-driven tools to degrade the integrity, utility, or security of downstream models and services. Unlike simple adversarial examples, tool poisoning exploits the distributed, multi-party, or tool-integrating nature of modern machine learning and agent ecosystems, often targeting mechanisms such as federated learning, external tool invocation, code and metadata aggregation, or tool description protocols. Attacks in this category can be subtle, operating within tight statistical or syntactic bounds, and are challenging to detect even with advanced anomaly detection or robust aggregation methods.

1. Formal Definition and Attack Model

Tool poisoning attacks broadly refer to the strategy where an attacker compromises the interface, input stream, or metadata of tools—such as data providers in federated learning, MCP-based tool registries, or code suggestion systems—injecting adversarial content to disrupt or bias the behavior of automated systems. These attacks generalize classical data poisoning by operating not only on training examples but also on higher-level system components (e.g., configuration, tool descriptions).

A rigorous formulation is the $(k,p)$ -poisoning model introduced for distributed learning (Mahloujifar et al., 2018). In an $m$ -party protocol, the attacker statically corrupts $k$ out of $m$ contributors and allows the adversary to submit, for each corrupted participant $i$ , a tampered data distribution $d'_i$ such that $\|d'_i - d_i\|_{\mathrm{TV}} \leq p$ (total variation). Attacks can be online and adaptive, with the adversary deciding poisoned samples in real time based on the evolving transcript.

In agentic tool settings, tool poisoning refers to the injection of malicious instructions into tool metadata registered via MCP, as described in the MCPTox benchmark (Wang et al., 19 Aug 2025); these poisoned tool descriptions manipulate LLM agents' planning and execution even when the actual tool is never called.

2. Mechanisms and Variants

(a) Distributed Learning and Federated Poisoning

The $(k,p)$ -poisoning attack operates in federated or multi-party learning as follows (Mahloujifar et al., 2018):

The attacker controls $k$ out of $m$ participating parties.
For each message (e.g., gradient update, local model, or data batch) sent by these parties, the adversary replaces honest data with samples from a distribution $d'$ that is $p$ -close in total variation to the genuine data distribution.
This tampering can occur adaptively based on prior protocol messages, in an online fashion.

This structure allows significant amplification of failure probability: any “bad” property $B$ of the hypothesis $h$ (e.g., misclassification on a target example or elevated risk) that would occur with probability $\mu$ absent any attack, will occur with probability at least $\mu' \geq \mu^{1 - (p \cdot k/m)}$ under optimal poisoning, ignoring negligible additive terms.

(b) Metadata and Protocol Poisoning in Tool Ecosystems

Beyond training data, modern LLM-powered agents rely on tool metadata (descriptions, capabilities, invocation instructions) which are integrated into the agent’s system context before use. Tool poisoning in MCP-based settings proceeds as follows (Wang et al., 19 Aug 2025):

A malicious actor registers a tool with a poisoned natural language description, embedding a malicious trigger, an action (e.g., unauthorized file read), and plausible justification.
The agent loads the poisoned tool metadata during registration. No direct execution of the poisoned tool is needed.
When a user submits a benign query, the agent’s planning is subverted by the poisoned description, e.g., causing it to invoke high-privilege tools or export sensitive data via legitimate “helper” tools, all without direct interaction with the poisoned tool.

These attacks exhibit high stealth: action is transferred via trusted, legitimate tools and cannot be detected purely by monitoring tool invocations or behavior.

Paradigm	Attack Vector	Example Mechanism
$(k,p)$ -poisoning	Data injection by selected parities	Online, $p$ -close data modification
Metadata poisoning	Tool description	Malicious triggers/actions in registry
Context poisoning	Aggregated project context	Semantics-preserving code changes

3. Impact and Theoretical Guarantees

Distributed Learning: $(k,p)$ $(k, p)$ -poisoning attacks are universal: they degrade system performance regardless of the learning task or hypothesis class. They can increase the probability of failure events from $\mu$ $μ$ to roughly $\mu^{1-pk/m}$ $μ^{1 - p k / m}$ , or yield an additive risk increase of $\Omega(p \cdot k/m)$ $Ω (p \cdot k / m)$ (Mahloujifar et al., 2018).
- In federated learning, even “clean-label” poisoning (i.e., no label flips) within $p$ -statistical bounds can evade anomaly detectors. No outlier detection that merely screens for large deviations will reliably detect such attacks.
- For $k = m/2$ and $p=1$ , a failure probability of $1\%$ amplifies to roughly $10\%$ .
Tool Metadata Poisoning: Empirical analysis (MCPTox) shows that even advanced LLM agents are widely vulnerable. For instance, models like o1-mini observe attack success rates (ASR) as high as $72.8\%$ (Wang et al., 19 Aug 2025). Counterintuitively, more capable models (with superior instruction-following ability) are more susceptible.
Code Aggregation and Context Poisoning: In systems where context is aggregated across code origins or contributors, subtle, semantics-preserving modifications—such as variable renaming or code reordering—introduced at a single origin can compromise the model’s predictions in a manner undetectable by standard static analysis (Štorek et al., 18 Mar 2025).

4. Proof Strategies and Analytical Results

Foundational analysis of tool poisoning is based on rejection sampling and $p$ -tampering arguments generalized to correlated, multi-block settings:

The adversarial process is modeled as a random process $(x_1, x_2, ..., x_n)$ , with $f$ a bounded function on the transcript.
If an adversary can $p$ -override each block, then the expectation of $f$ increases by at least $\Omega(p\cdot \operatorname{Var}[f(\mathbf{x})])$ .
Key steps in the proof leverage the AM–GM inequality (arithmetic–geometric mean) to compare means under tampering and bound the amplification of “bad” events.

5. Practical Defenses and Limitations

Multi-Party Learning Defenses

Robust Aggregation: Using coordinate-wise median instead of mean to down-weight contributions from outliers or corrupted parties.
Verification and Cross-Validation: Updates are validated for systematic biases.
Cryptographic Protocols: Secure multiparty computation to reduce the adversary’s ability to steer the aggregated result.
Data Source Diversification: Ensuring that the effect of poisoning a single party is minimized by redundancy.

However, the $(k,p)$ -poisoning attack is specifically designed to avoid detection by standard anomaly detectors when $p$ is small.

Tool and Metadata Poisoning Defenses

Pre-execution Metadata Verification: Screening tool descriptions at registration, possibly using large LLMs or dedicated filters, though current safety alignment is largely ineffective.
Enhanced Alignment Strategies: Improving agents’ ability to distinguish between trustworthy and anomalous tool descriptions. However, analysis shows that agents rarely refuse to execute poisoned instructions (highest refusal rate observed was below $3\%$ across evaluated models).
Access Control at Registration: Enforcing stricter authentication and validation when MCP servers register tools for use by agents.

6. Relation to Other Poisoning and Tampering Models

Tool poisoning generalizes and subsumes several historical attack models:

Classical Data Poisoning: Manipulating individual examples, often with large, detectable changes.
$p$ -Tampering Attacks: Tampering each block or sample independently with probability $p$ , as in uniform-label flipping.
Coordinated, Correlated Attacks: The multi-party $(k,p)$ -model admits correlated tampering (e.g., all contributions from a single compromised party are jointly controlled and possibly context-aware).

These attacks distinguish themselves by working under tight resource or detection constraints, generalizing to protocol, context, and metadata channels, and providing rigorous, often worst-case, risk amplification guarantees. Furthermore, tool poisoning has become structurally important in the era of federated learning, retrieval-augmented generation, and tool-integrated agent systems.

7. Future Research and Open Questions

Constructing automated, adaptive attack generation mechanisms to continuously stress-test agent robustness at scale (Wang et al., 19 Aug 2025).
Developing decision-level guardrails such as those using attention-based Decision Dependence Graphs to track provenance and attribute tool poisoning sources, as in MindGuard. These mechanisms can expose attention-derived influence from unseen, poisoned tool descriptions (Wang et al., 28 Aug 2025).
Integrating classical control/data flow integrity into neural agent planning, bridging program dependence graphs and LLM reasoning structures (Wang et al., 28 Aug 2025).
Designing robust multi-party and agentic learning protocols that are provably resilient under the $(k,p)$ -poisoning or similar scenario-based threat models.

Model/Setting	Attack Surface	Stealth Requirement	Failure Amplification	Example Defense
$(k,p)$ -poisoning	Party data/updates	$p$ -close in TV	$\mu' \geq \mu^{1-pk/m}$	Robust aggregation; cryptography
Classic poisoning	Individual examples	Often none	Task-specific	Label checking; outlier removal
Tool metadata poisoning	Descriptions/registration	Natural NL, plausible	ASR up to $72.8\%$ (MCPTox)	Pre-execution registry filtering

Tool poisoning attacks, both in distributed learning and agentic tool-invocation pipelines, reveal a fundamental new vulnerability class, characterized by their subtlety, their ability to leverage protocol and metadata channels, and the inadequacy of existing detection and defense mechanisms. Their rigorous analytical treatment and empirical salience in actual multi-tool and multi-party deployments highlight the necessity for ongoing research in this domain.