Papers
Topics
Authors
Recent
2000 character limit reached

Tool Poisoning in Agentic Systems

Updated 13 December 2025
  • Tool poisoning is the injection of malicious data into tool descriptors to subvert system behavior and execute unauthorized actions.
  • Recent studies show high attack success rates, with minimal poisoning causing unsafe executions and vulnerabilities in LLM-based systems.
  • Defense strategies combine cryptographic signing, semantic vetting, and real-time provenance tracking, although comprehensive mitigation remains challenging.

Tool poisoning refers to the exploitation of vulnerabilities in systems that allow execution or dynamic invocation of external tools, typically via natural-LLMs, clustering algorithms, agentic systems, or code completion frameworks. By injecting malicious data, code, or instructions into tool descriptors, schemas, or model training corpora, an attacker manipulates the behavior, reasoning, or output of the host system to achieve unauthorized actions or subvert intended workflows. Recent studies reveal that even minimal poisoning can induce high rates of unsafe executions, information leaks, denial of service (DoS), or remote code execution (RCE), while evading most static or behavioral defenses.

1. Formal Definitions and Threat Models

The core manifestation of tool poisoning is the “Tool Poisoning Attack” (TPA) within environments following the Model Context Protocol (MCP). Here, the agent (typically an LLM-based host) registers external tools via metadata (name, parameters, description) supplied by trusted or adversary-controlled MCP servers. The attack occurs when malicious tool metadata is injected such that hidden adversarial instructions in, e.g., desc_p of tool T_p influence agent reasoning or prompt context. The actual execution may invoke a legitimate (high-privilege) tool t_m, with arguments p_m that fulfill the attacker’s objective, even though benign user queries Q are issued (Wang et al., 19 Aug 2025, Jamshidi et al., 6 Dec 2025).

Formally, let the set of tools be C={T1,,Tp,,TN}C = \{T_1,\ldots,T_p,\ldots,T_N\} registered in the prompt context. For a poisoned descriptor, dadv=(name,schema,descbδ)d_{\text{adv}} = (\text{name}, \text{schema}, \text{desc}_b \| \delta), where δ\delta encodes adversarial instructions. The attack aims to maximize Psucc=E(unsafe)/E(total)P_{\text{succ}} = E_{(\text{unsafe})}/E_{(\text{total})}, i.e., the probability that unsafe invocations occur when the agent selects and executes tools under poisoning conditions (Jamshidi et al., 6 Dec 2025). In code-generation systems, analogous attacks manipulate model training data to implant backdoors that trigger on specific code contexts (Aghakhani et al., 2023, Schuster et al., 2020).

2. Tool Poisoning Attacks in LLM Agentic Systems

In LLM-based agentic systems, tool invocation is governed by Tool Invocation Prompts (TIPs), which define constraints, schemas, security checks, and functional descriptions. TIP poisoning (“TIP hijacking”) manipulates system/contextual prompt fragments to coerce agent behaviors, bypass user approvals, or disrupt agent availability (Liu et al., 6 Sep 2025). TIPs are distributed across psystemp_{\text{system}} (tool descriptions and schemas) and pcontextualp_{\text{contextual}} (environmental state and previous returns), and attacks exploit the direct injection of malicious payloads into these prompt elements.

Three phases characterize a typical TIP exploitation workflow (TEW): (1) prompt stealing (extracting execution context via benign tool queries), (2) vulnerability analysis (deducing schema flaws, e.g., poor enforcement or unsafe initialization routines), and (3) hijacking (untargeted DoS via malformed outputs, or targeted RCE via embedded shell commands and multi-channel reinforcement). Case studies with systems such as Cursor and Claude Code show attackers achieving full remote shell via exploited tool prompts, even under alignment or secondary guard models (Liu et al., 6 Sep 2025).

MCPTox benchmarks this attack class using a suite of 1312 malicious test cases across 45 real-world MCP servers and 353 genuine tools, quantifying agent susceptibility by Attack Success Rate (ASR) and Refusal Ratio (RR). Observed ASRs reach up to 72.8% (o1-mini), indicating severe agent vulnerability; refusal rates are universally below 3%, demonstrating the ineffectiveness of current safety alignment (Wang et al., 19 Aug 2025).

3. Tool Poisoning in Code-Suggestion and Autocompletion Models

Tool poisoning applies directly to neural code suggestion and autocompletion systems. Adversarial samples—either inserted into training corpora (“data poisoning”) or fine-tuned via model weights (“model poisoning”)—yield completion behaviors favoring insecure API usage (e.g., AES.MODE_ECB, SSLv3, or low-iteration password hashing). Attacks may be untargeted (affecting all code) or targeted (restricted to specific repositories or author features). Targeted attacks rely on mining textual signals (module names, comments) to condition the model on features FF and optimize Pmodel(bc+δ)P_\text{model}(b|c+\delta), subject to syntactic validity (Schuster et al., 2020, Aghakhani et al., 2023).

Recent attack variants such as “Covert” and “TrojanPuzzle” evade signature-based dataset cleansing by burying or masking payload content in docstrings rather than code bodies, and by never revealing suspicious tokens in the poisoning data (TrojanPuzzle). These techniques break conventional static, signature, or near-duplicate filters, leading to high attack@k rates (e.g., up to 56.9% success at k=50 for simple and covert methods). Model health metrics remain unaffected, highlighting the stealth of the approach (Aghakhani et al., 2023).

4. Detection, Attribution, and Defense Mechanisms

Modern defenses against tool poisoning are insufficient when confronting semantic, meta-level, and stealthy payload manipulations. Behavioral defenses are ineffective since attacks can induce unauthorized tool actions without ever invoking the poisoned tool itself (Wang et al., 28 Aug 2025). MindGuard introduces the Decision Dependence Graph (DDG), an attention-weighted, directed graph modeling the influence of MCP tool metadata, user queries, results, and invocation decisions:

G=(V,E,w)\mathcal{G} = (\mathcal{V}, \mathcal{E}, w)

where w(vs,vt)w(v_s, v_t) computes Total Attention Energy from source to target vertices. Anomaly detection is performed via thresholds on the Anomaly Influence Ratio (AIR):

αs,t=w(vs,vt)w(vu,vt)+w(vsc,vt)\alpha_{s,t} = \frac{w(v_s, v_t)}{w(v_u, v_t) + w(v^c_s, v_t)}

Poisoned invocations are flagged if αs,t>τ\alpha_{s,t} > \tau. Empirical precision and attribution reach 94–99% and 95–100%, respectively, with minimal latency and no token overhead (Wang et al., 28 Aug 2025).

Layered defense frameworks combine descriptor immutability via RSA manifest signing, semantic vetting by LLM-on-LLM verification (embedding drift analysis), and lightweight runtime guardrails (entropy and keyword triggers, invocation frequency anomaly detection) (Jamshidi et al., 6 Dec 2025). RSA guarantees prevent post-approval Descriptor Tampering (“Rug Pulls”), while semantic vetting blocks subtle manipulations missed by static analysis. Runtime heuristics catch crude payloads immediately.

In code models, activation clustering, spectral signature analysis, and fine-pruning can partially mitigate poisoning at the cost of increased false positives or degraded utility (top-1 and top-5 accuracy drops up to 6.9%) (Schuster et al., 2020, Aghakhani et al., 2023). None of these measures yield comprehensive protection, especially against targeted or Trojan-class attacks.

5. Attack Paradigms, Taxonomy, and Empirical Benchmarks

The MCPTox taxonomy spans explicit triggers hijacking specific function calls, implicit background-process hijacking, and global parameter tampering. Each attack template specifies a trigger condition, a malicious action, and plausible justification to enhance stealth. Ten risk categories such as Privacy Leakage and Privilege Escalation are covered (Wang et al., 19 Aug 2025). Table summaries from MCPTox and subsequent LLM evaluations establish typical ASRs ranging from 38% (Claude-3.7-Sonnet) up to 72.8% (o1-mini).

Attack Paradigm Example ASR Range (%)
Explicit Function Hijacking (P1) Exfil SSH keys 43–47
Implicit Function Hijacking (P2) File read on trigger 42–65
Parameter Tampering (P3) Email redirection 47–90

Inverse-scaling effects are observed: more capable models with improved instruction-following are more vulnerable, especially under chain-of-thought reasoning, which increases ASR by up to 27.8 points (Wang et al., 19 Aug 2025). Failure analysis shows rare refusals, with the majority of failures being non-exploitative or direct execution of the poisoned tool.

6. Mitigation Strategies and Future Research Directions

Robust defense against tool poisoning requires multi-layered security mechanisms. Best practices include:

  1. Cryptographic signing of tool descriptors at the registry and enforcing provenance via hardware security modules.
  2. Independent, LLM-powered semantic vetting of tool metadata prior to registration and enforcement.
  3. Runtime heuristics monitoring invocation frequency, entropy, and blacklisted tokens.
  4. Privilege separation to prevent unauthorized cross-tool calls and enforcement of formal grammars for schemas.
  5. Real-time provenance tracking using DDG or analogous program dependence graphs, enabling attribution and anomaly detection with minimal performance impact (Wang et al., 28 Aug 2025, Jamshidi et al., 6 Dec 2025).

Open challenges include detecting sleeper-style poison over multi-turn dialogues, synthesizing adaptive attack templates, integrating boundary-aware language reasoning to distinguish prescriptive from descriptive prompt components, and developing formal certification methods for metadata and code-generation security (Aghakhani et al., 2023, Wang et al., 19 Aug 2025).

Tool poisoning is conceptually related to input poisoning attacks in behavioral malware clustering (Biggio et al., 2018), where adversaries manipulate dataset samples to subvert clustering algorithms used in malware analysis. In all instances, poisoning a small subset of inputs or metadata yields disproportionate compromise of system integrity, either via algorithmic reasoning shifts, completion bias, or semantic manipulation. The broad susceptibility of agentic, code-suggestion, and clustering models to tool poisoning underscores the need for protocol-level security and semantic provenance analysis as standard in real-world deployments (Wang et al., 19 Aug 2025, Jamshidi et al., 6 Dec 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Tool Poisoning.