AgentPoison Framework Analysis

Updated 11 November 2025

AgentPoison is a family of attack frameworks that formalizes training- and deployment-time poisoning vulnerabilities in agent-based systems.
It leverages constrained optimization to induce adversarial behaviors with minimal, stealthy modifications across environments, peer policies, and tool interfaces.
Empirical evaluations highlight its effectiveness in RL, multi-agent, and LLM settings, underscoring significant security risks and challenges in defense.

AgentPoison refers to a family of attack frameworks formalizing training- and deployment-time poisoning vulnerabilities in interactive agent-based systems, including classical reinforcement learning (RL), multi-agent RL, and contemporary LLM-based autonomous agents. These frameworks share an adversarial objective: induce a target agent to execute adversary-chosen behavior via carefully constructed, minimally invasive modifications to the agent’s environment, peer policies, knowledge base, or operational toolset. This article surveys the foundational principles, mathematical formulations, attack algorithms, and empirical evaluations of AgentPoison-style attacks, articulating their common structure and contextualizing their significance across modalities.

1. Formal Problem Structure and Attack Modalities

AgentPoison operationalizes poisoning attacks via the lens of optimization-driven adversarial design within environments characterized by agent-environment interaction.

Classical RL setting: The attacker perturbs MDP primitives—rewards ( $R$ ), transitions ( $P$ )—to force a learning agent into a prespecified target policy $\pi^\dagger$ . The minimality and stealthiness of the perturbation are quantified by a user-chosen cost metric (e.g., $\ell_p$ norm over per- $(s,a)$ deviations) (Rakhsha et al., 2020).
Two-agent RL: Extends the classical model by making the “environment” partially controllable through a peer agent’s policy, positing implicit poisoning via peer-policy modification rather than direct environment tampering (Mohammadi et al., 2023). Here, the attack space is the simplex $\Delta(A_1)^S$ (peer policies), and the cost is measured as deviation from a default peer policy.
Modern LLM-centric agents: AgentPoison targets the agent’s retrieval-augmented knowledge base or memory (via backdoored key–value pairs and triggers) (Chen et al., 17 Jul 2024), or the set of external functions and tools used for perception, planning, or communication (via poisoned function libraries or tool metadata) (Long et al., 29 Sep 2025, Wang et al., 19 Aug 2025).

In all paradigms, the attacker’s objective is to induce a prescribed adversary-chosen behavior under precise minimal deviation and stealthiness constraints, without retraining the victim model or breaking system invariants.

2. Optimization-Based Formulation of AgentPoison Attacks

The defining feature of AgentPoison is its reduction of the attack design to a constrained optimization, for which formal guarantees and algorithmic methods can be derived.

2.1 RL Environment and Policy Poisoning

Single-agent RL: The adversary solves

$\min_{\widehat R, \widehat P}\quad \mathrm{Cost}(\widehat R,\widehat P)$

subject to the constraint that the target policy $\pi^\dagger$ is $\epsilon$ -robustly optimal in the poisoned MDP $(S, A, \widehat R, \widehat P, \gamma)$ . The key reduction is that, by the neighbor lemma, it suffices to check only $|S|(|A|-1)$ neighbor policies for optimality (Rakhsha et al., 2020).

Two-agent RL: The adversary selects a peer policy $\pi_1$ minimizing $\mathrm{Cost}(\pi_1, \pi_1^{\rm def})$ s.t. any $\epsilon$ -optimal best response of the victim is the target policy $\pi_2^{\rm tar}$ :

$\min_{\pi_1\in\Delta(A_1)^S}\; \mathrm{Cost}(\pi_1,\pi_1^{\rm def}) \qquad \mathrm{BR}(\pi_1)\subseteq \Pi_2^\dagger(\pi_2^{\rm tar})$

(P1 in (Mohammadi et al., 2023)). In deep-RL variants, this is relaxed via imitation cost and adversarial best response.

Retrieval/RAG-based LLM agents: AgentPoison formalizes trigger optimization as a constrained discrete problem over token sequences $x_t$ , balancing uniqueness, compactness, target, and coherence losses (see Section 3) (Chen et al., 17 Jul 2024).

2.2 Tool/Function Library and Interface Poisoning

FuncPoison/MCPTox: The attacker modifies only the name/description fields of a small number of functions or tools, seeking to maximize the probability that an agent selects the adversarial tool or misuse a legitimate tool due to a poisoned instruction (Long et al., 29 Sep 2025, Wang et al., 19 Aug 2025). The selection is modeled as maximization of a similarity function $s(\text{prompt}, \text{desc}(a))$ .

A schematic summary of attack spaces:

AgentParadigm	Poisoning Channel	Optimization Variable
RL	$R$ , $P$	$\widehat R$ , $\widehat P$
2-agent RL	Peer policy	$\pi_1$
RAG/Memory	Key-value entries, trigger	$x_t$ , poisoned entries
Tool/Function Library	Function descriptions	$\Delta$ (text snippets)

3. Computational Hardness, Feasibility, and Bounds

AgentPoison analyses demonstrate that the mere existence of a successful minimal-cost attack is nontrivial:

NP-Hardness: Deciding whether an implicit peer-policy poisoning attack is feasible—that is, whether a policy exists so that only the adversary's target is a best response—is NP-hard in general. This is established by reduction from 3-SAT (Mohammadi et al., 2023).
Sufficient conditions for tractability: If MDP transitions are independent of the peer’s actions and the chain is ergodic under any peer policy, the constraint system reduces to convex programs polynomial-time solvable (Mohammadi et al., 2023).
Theoretical cost bounds: Tight lower and upper bounds are proven for the minimal deviation needed to effect a successful poisoning. For example, in two-agent RL:

$\min_{\pi_1}\mathrm{Cost}(\pi_1, \pi_1^{\rm def}) \geq \frac{1-\gamma}{2} \frac{ \| \bar\chi_0 \|_\infty }{ \| R_2 \|_\infty } + \gamma \| V^{\pi_1^{\rm def}, \pi_2^{\rm tar}} \|_\infty$

with matching upper bounds in ergodic, transition-independent cases (Mohammadi et al., 2023). In RL environment poisoning similar norm-based trade-offs between reward and transition modification cost are formalized (Rakhsha et al., 2020).

This complexity landscape suggests that, in realistic settings, effective attacks may often be computationally challenging to design, especially if only limited model information or leverage over the system is available.

4. Algorithmic Realizations and Attack Pipelines

AgentPoison frameworks instantiate their optimization with tractable algorithms, ranging from convex programming to beam search and gradient-based token flip.

4.1. RL Environments

Tabular: Convex or sequential convex programs enforcing neighbor optimality, with attack cost measured in $\ell_p$ norm.
Planning (offline): Dynamic programming or value iteration on the poisoned MDP, with constraint-enforced variable updates.
Online: Precompute the poisoned MDP; during victim learning, serve rewards and next states sampled from $\widehat R$ , $\widehat P$ (Rakhsha et al., 2020).

4.2. Two-Agent RL

Model-based (Conservative Policy Search, CPS): Iterative convex optimization, tightening trust-region constraints and maintaining peer policy iterates (Mohammadi et al., 2023).
Model-free (Alternating Policy Updates, APU): Alternating gradient-based updates for both attacker and victim policies using actor-critic/PPO objectives.

4.3. Memory and RAG-based Agents

Trigger optimization: Perform beam search over token sequences, using “HotFlip”-style gradients to minimize a combination of losses:
- Uniqueness loss ( $\mathcal{L}_{\mathrm{uni}}$ ): ensures triggered queries form a cluster far from benign keys.
- Compactness loss ( $\mathcal{L}_{\mathrm{cpt}}$ ): induced embeddings are tightly clustered.
- Target loss ( $\mathcal{L}_{\mathrm{tar}}$ ): the LLM is steered towards an adversarial output, when fed only retrieved poisoned examples.
- Coherence loss ( $\mathcal{L}_{\mathrm{coh}}$ ): perplexity constraint to evade detection.
- The final optimization is a constrained discrete minimization:

$\underset{x_t}{\mathrm{minimize}}\; \mathcal{L}_{\mathrm{uni}}(x_t) + \lambda \mathcal{L}_{\mathrm{cpt}}(x_t)\, \quad\mathrm{s.t.}\; \mathcal{L}_{\mathrm{tar}}(x_t) \leq \eta_{\mathrm{tar}}, \;\; \mathcal{L}_{\mathrm{coh}}(x_t) \leq \eta_{\mathrm{coh}}$

(Chen et al., 17 Jul 2024).

4.4. Tool/Function Library Poisoning

FuncPoison-style: The attacker crafts a malicious function whose description minimally perturbs the original text but embeds command-format examples or template snippets that create an in-context bias in similarity scoring. Hijacking occurs as the agent matches input prompts to descriptions, making $a_\mathrm{adv}$ highly likely to be selected (Long et al., 29 Sep 2025).
MCPTox: Systematically generates attack templates across paradigms (Explicit/Implicit Trigger, Parameter Tampering) and applies them to a broad array of real-world MCP servers (Wang et al., 19 Aug 2025).

5. Empirical Evaluation and Quantitative Results

AgentPoison frameworks conduct extensive empirical validation across RL environments, LLM-based agents, and toolchains, employing domain-relevant metrics.

5.1 RL-Based Attacks

Single-agent RL (offline/online): Joint reward/transition poisoning reduces attack cost relative to reward-only attacks, with empirical cost tightly matching theoretical bounds (Rakhsha et al., 2020).
Two-agent RL: Conservative Policy Search (CPS) finds feasible peer policies at far lower cost than baselines; attack cost increases sharply as the allowed victim suboptimality ( $\epsilon$ ) decreases or as attacker control is limited. Alternating Policy Updates (APU) in continuous spaces outperform symmetric or naive attacks in forcing policy deviation at minimal peer-policy cost (Mohammadi et al., 2023).

5.2 Memory/RAG Attacks

AgentPoison on LLM agents: Achieves ASR (retrieval success rate) $\approx$ 81.2% and end-to-end attack success $\approx$ 62.6%, with benign accuracy drop under 1% and poison ratio $< 0.1\%$ (Chen et al., 17 Jul 2024). Even a single-token trigger often suffices for high ASR.

5.3 Tool/Function Attack Studies

FuncPoison: Pushes average trajectory error in autonomous driving above 10 m (L2(3s)=10.52 m, CollRate=4.58%, ASR@3=86.3%) in AgentDriver; outperforms prior prompt/data-level poisoning by 10-20% in ASR, with attack persistence sustained even for thresholds $\delta\approx8$ m (Long et al., 29 Sep 2025). Defenses (prompt sanitizers, “memory vaccines”) marginally reduce ASR but leave systems fundamentally vulnerable.
MCPTox: Across 20 LLM agents, the mean ASR is 36.5%, with certain models suffering ASR $>$ 70%. Paradigm P $_3$ (parameter tampering) is most effective. Higher model scale and reasoning capabilities correlate with higher vulnerability (“inverse scaling”) (Wang et al., 19 Aug 2025).

6. Implications, Limitations, and Defenses

AgentPoison exposes inherent attack surfaces common to all agent architectures that rely on mutable auxiliary data (memories, tools) or peer-influenced dynamics. Its findings indicate:

Stealth and minimality: Attacks need only minuscule perturbations or a handful of poisoned entries/tool descriptions to reach high ASR, typically without noticeable loss in benign performance or triggering naïve perplexity thresholds.
Transferability: Triggers optimized on one embedder typically transfer well to unseen or even black-box retrieval models.
Significance for system security: Supply-chain and interface-based attacks (e.g., poisoned function libraries) evade all prompt-level or fine-tuning-based defenses, yielding risk even without any model retraining or code modification.
Defensive directions: Suggested countermeasures include cryptographic signing of tools, semantic static checks of resource descriptions, cross-agent consensus checks, certified robust retrieval mechanisms, and anomaly-spotting of overly “tight” clusters in the embedding space (Chen et al., 17 Jul 2024, Long et al., 29 Sep 2025). However, no evaluated defense currently provides strong provable guarantees in the high-impact settings tested.

A plausible implication is that as agent systems become more modular, compositional, and collaborative on flexible knowledge and tool stores, the attack surface for AgentPoison-style threats will further broaden, mandating both principled formal analysis and comprehensive toolchain hardening.