Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Mutating Poisoning (SMP)

Updated 4 July 2026
  • Self-Mutating Poisoning (SMP) is a lifecycle-based attack where an initially benign execution secretly alters persistent agent skills to enable deferred harm.
  • The attack uses mechanisms like exit-triggered callbacks and file mutations, ensuring that the malicious behavior activates only during later task executions.
  • Empirical evaluations demonstrate high attack success rates via persistence exploitation, highlighting the need for advanced detection and robust defense strategies.

Self-Mutating Poisoning (SMP) is a lifecycle-aware poisoning pattern in which an initially benign execution silently modifies a persistent agent artifact, and the harmful effect is deferred until a later task or session reuses the now-compromised state. In the current paper set, the term is formalized most directly for third-party agent skills: the first execution looks routine, mutates persistent skill content, and only subsequent reuse realizes the attack (Ning et al., 1 Jun 2026). Closely related work extends the same structural logic to persistent memory systems, reflective self-evolving agents, and transformation-robust data poisoning, suggesting that SMP is best understood not as a single payload type but as a family of delayed, persistence-mediated compromises (Sharma, 10 Jun 2026).

1. Definition, scope, and contrast with one-shot poisoning

In "SkillHarm" (Ning et al., 1 Jun 2026), SMP is defined against the skill-use lifecycle. A poisoned skill does not immediately execute its final harmful behavior when first used. Instead, an initial execution appears benign and performs a silent mutation of persistent skill content; the actual harmful effect is deferred and materializes only when a later task or session reuses the modified skill. The paper contrasts this with Fixed-Payload Poisoning (FPP), in which the poisoned package already contains an immediately activatable payload.

Scenario First-stage behavior Harm realization
FPP Fixed poisoned skill package directly compromises any task session that invokes it Same task execution
SMP Initial task executes harmlessly but silently mutates persistent skill content Subsequent task or session reuse

This distinction is not merely temporal. In SMP, the attack surface includes persistence semantics, skill reuse, and the trust the agent places in artifacts it previously loaded or modified. The required substrate is therefore broader than a prompt-only exploit: the skill package snapshot must survive across sessions, the agent must be able to write into skill files or directories, and later tasks must reload the mutated content. In the concrete SkillHarm setup, the mutation is designed to survive in framework-specific directories such as .claude/skills, .codex/skills, and .agents/skills (Ning et al., 1 Jun 2026).

Within the paper set considered here, the explicit named use of SMP appears in SkillHarm. A broader interpretation, supported by adjacent papers, treats SMP as a general regime in which malicious influence persists by mutating trusted artifacts—skills, memory entries, summaries, or self-authored rules—rather than remaining a static visible payload.

2. Lifecycle model and attack mechanics

The canonical SkillHarm SMP workflow is organized around a task pair (TA,TB)(T_A, T_B) that reuses at least one shared skill. Task TAT_A is a routine-use task in which a mutation hook can plausibly run; task TBT_B is a later task whose reuse of the modified skill causes a security-relevant end state. The attack designer identifies a helper script normally executed in TAT_A and plants a deferred-mutation mechanism there. The principal implementation described is an exit-triggered callback, specifically an atexit callback, added to that helper script. During the visible execution of TAT_A, the script performs its ordinary behavior; when the script exits, the callback rewrites sibling files in the same skill package so that the modified content is loaded during TBT_B (Ning et al., 1 Jun 2026).

The persistent carrier is not restricted to one file type. SkillHarm states that the callback typically rewrites SKILL.md or another script, but the threat model covers SKILL.md, reference documents, executable scripts, and auxiliary resources. Appendix examples further show staged artifacts written into hidden skill subdirectories such as scripts/.preflight_cache/, allowing the mutation to survive inter-container transfer under the appearance of a legitimate cache. The operational state transition is therefore:

  1. install a seemingly normal skill;
  2. execute TAT_A, which appears harmless;
  3. silently mutate persistent skill content at exit;
  4. carry forward the mutated skill snapshot;
  5. execute TBT_B, which now loads the poisoned content;
  6. observe harm after TBT_B.

Delayed activation is central to stealth. The first-stage trigger is ordinary execution of the shared helper script during TAT_A; the mutation itself is postponed until script exit, minimizing disruption to the visible task outcome. The second-stage trigger is later reuse of the compromised skill by TAT_A0. SkillHarm also reports optional content-based gates, such as activation only when later task inputs have a particular structure, which narrows execution conditions further and makes the preparation step look less suspicious (Ning et al., 1 Jun 2026).

A closely related but distinct mutation mechanism appears in reflective memory agents. In "OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences" (Wang et al., 18 May 2026), the attacker injects a locally correct edge-case experience that passes validation, and the agent’s own reflection module transforms it into a generalized rule according to

TAT_A1

Here the mutation is not file rewriting but semantic consolidation: the externally supplied seed is rewritten into a self-authored, high-priority memory rule.

3. Benchmark construction and empirical performance

SkillHarm operationalizes SMP at scale through AutoSkillHarm, a three-stage automated construction pipeline consisting of target selection, attack design, and quality filtering. For SMP, the selection unit is a shared-skill task pair TAT_A2 rather than a single injection point. Appendix details report three selector agents—Codex GPT-5.2, Codex GPT-5.4, and Claude Code Opus 4.6—proposing candidate pairs from SkillsBench and scoring them by feasibility and impact. The final benchmark contains 879 attack samples across 71 skills, of which 192 are SMP samples; those SMP instances cover 6 distinct skills and 15 unique (task-pair, poisoned-skill) combinations (Ning et al., 1 Jun 2026).

Evaluation uses a true multi-execution protocol. For each SMP sample, the evaluator runs TAT_A3, snapshots the resulting skill package, transfers that snapshot into the TAT_A4 environment, runs TAT_A5, and checks whether the intended harmful end state holds. The primary metric is Attack Success Rate (ASR), supplemented by Conditional ASR (cASR), which conditions on actual engagement with the poisoned file, and Attack Refusal Rate (ARR), which measures explicit recognition and refusal.

Agent-harness configuration ASR cASR / ARR
Claude Code Sonnet 4.6 51.6% 70.2% / 2.1%
Claude Code Opus 4.7 9.4% 41.5% / 14.1%
Codex GPT-5.4 69.3% 77.4% / 1.0%
Codex GPT-5.5 65.6% 72.7% / 3.1%
Gemini CLI Gemini 3 Flash 45.8% 61.4% / 0.5%
OpenCode Qwen-3.6 27B 51.6% 65.0% / 0.0%

The headline SMP result is 69.3% ASR for Codex with GPT-5.4, despite the requirement that both the mutation stage and the later activation stage succeed. The cASR gap is analytically important. Claude Code Opus 4.7 has only 9.4% unconditioned ASR, but 41.5% cASR, which the paper interprets as evidence that many apparent failures arise because the agent never engaged with the poisoned file rather than because it safely resisted the attack. The Appendix dynamic-risk breakdown further reports, for GPT-5.4 under SMP, category-average ASR of 61.7% for data-pipeline exploitation, 69.9% for system-environment exploitation, and 76.2% for agent-autonomy exploitation, with individual risks reaching 87.5% for proxy attack (Ning et al., 1 Jun 2026).

These results matter because SMP is evaluated under stricter semantics than one-shot poisoning. Success requires silent state mutation, persistence across the session boundary, later re-engagement with the mutated artifact, and eventual realization of a task-specific harmful end state.

Several adjacent papers do not use the term SMP explicitly but describe mechanisms that are structurally close to it.

"OEP" shows a low-privilege black-box attack on memory-augmented, self-evolving agents. The attacker cannot modify the system prompt, weights, or memory database directly; instead, it supplies adversarial experiences that are locally correct, semantically plausible, and paired with severe but plausible hypothetical consequences. During reflection, the agent over-trusts its own summaries and distills these edge cases into high-priority but over-generalized rules. For GPT-4o, OEP reports ESR 77.43 / ASR 59.14 in Math, 68.29 / 52.00 in Med, and 85.09 / 71.93 in Tool use, and the Tool domain retains 72% ASR even after 50 subsequent benign queries (Wang et al., 18 May 2026). This suggests a one-step self-mutation regime in which the agent manufactures the durable poisoned rule internally.

"SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems" formalizes Multi-Session Memory Poisoning (MSMP) rather than SMP, but the overlap is substantial. The attack surface is persistent append-only memory shared across sessions; the adversary can inject crafted memories that, once retrieved, redirect future behavior without changing weights or code. The paper’s most SMP-relevant experiment is the query-only attack in which “the agent itself writes the resulting trace to memory via its signed write path.” In that setting, poison reaches the retrieval pool in 100% of scenarios, the undefended success rate is 65.3%, and the defended SMSR configuration reduces it to 5.3% over 150 runs (Sharma, 10 Jun 2026). The paper is explicit, however, that its formal model is still a bounded-entry contamination model, not a full account of self-rewriting or self-propagating poison.

"From Untrusted Input to Trusted Memory" systematizes long-term memory poisoning around four write channels and nine structural vulnerabilities. Its most SMP-relevant mechanism is V-S5: Self-Improvement as Amplification, which states that once a poisoned skill is written, the self-improvement loop reinforces it, later revisions build on the poisoned baseline, and the skill can evolve into a well-optimized adversarial procedure across sessions. MPBench contains 3,240 adversarial test cases and 2,997 benign examples; across agents, average ASR = 50.46\% and RSR = 41.05\%, with HERMES showing 58.33 / 61.67 ASR/RSR for Skill-Procedure Insertion (Dash et al., 3 Jun 2026). This is not named SMP, but it directly supports a longitudinal view in which persistent poison is rewritten and behaviorally reinforced.

"Phantom Transfer" is not self-mutating in the literal sense, yet it provides strong evidence for mutation-resilient poisoning at the data level. The authors state that even if one knows precisely how the poison was placed in the dataset, one cannot filter it out, and that “even fully paraphrasing every sample in the dataset using a different model does not stop the attack.” For the UK target on GPT-4.1, the introduction reports about 44% poisoned, 56% after oracle judge, and 44% after paraphrasing, versus 0% clean (Draganov et al., 3 Feb 2026). This suggests that some poisoning effects persist under rewriting and cross-model translation, a property highly relevant to SMP even though the carrier is a training corpus rather than a persistent skill or memory artifact.

5. Defenses, mitigation strategies, and their limits

SkillHarm evaluates two defense families—skill scanners and defensive system prompts (DSPs)—and finds neither sufficient for SMP. For scanner-based detection, Cisco’s Skill Scanner and Snyk’s Agent Scan report SMP detection rates of 5.2% and 9.9%, while Skill Scanner with an Opus 4.7 backbone reaches 68.8%. Even the strongest of these still misses roughly a third of attacks. DSPs are also brittle: on a risk-stratified high-ASR subset, nearly all evaluated configurations remain above 70% ASR in both FPP and SMP scenarios (Ning et al., 1 Jun 2026).

The closest certified defense in the paper set is SMSR for persistent-memory poisoning. It combines provenance at write time with randomized ablation at retrieval time. Component 1 signs every legitimate memory write using

TAT_A6

and admits only verified entries into retrieval. Component 2 retrieves a top-TAT_A7 pool, repeatedly samples size-TAT_A8 subsets, and aggregates verdicts rather than exact strings. The resulting certificate upper-bounds malicious-majority probability by

TAT_A9

with

TBT_B0

Empirically, Component 1 reduces unsigned-injection ASR from 93–100% to 0% across all unsigned variants, while for an authenticated adversary with a single injection in the 20-seed production-scale setting, the combined defense holds ASR to 8.0% with a certificate of 10.4% (Sharma, 10 Jun 2026). The limitation is explicit: as authenticated poisoned entries proliferate, the certificate degrades rapidly.

The broader memory-poisoning literature converges on write-path and lifecycle-aware controls rather than pure input filtering. MPBench recommends tighter memory write policies, source isolation, provenance tracking, compaction filters, and post-write monitoring (Dash et al., 3 Jun 2026). The multi-agent memory-security paper similarly argues for secure memory, secure update functions, provenance structures, trust and reputation, append-only or hash-chained storage where possible, and local inference with private knowledge retrieval for semantic memory (Torra et al., 20 Mar 2026). A recurring conclusion across these papers is that defenses must account for persistence, consolidation, and reuse, not just the surface form of an incoming prompt or file.

6. Conceptual boundaries, misconceptions, and open problems

A common misconception is that SMP is simply a lexical trigger backdoor with a delayed firing condition. The current evidence is broader. SkillHarm’s SMP mutates persistent skill content; OEP causes the victim agent to convert a benign-looking experience into a generalized rule; MSMP shows that normal interaction can lead the agent to write authenticated poison into memory; Phantom Transfer shows that a poisoning effect can survive paraphrastic rewriting and cross-model transfer rather than depending on a fixed string signature (Ning et al., 1 Jun 2026).

A second misconception is that low unconditioned attack success necessarily indicates robustness. SkillHarm’s ASR–cASR gaps directly contradict that reading. The paper states that many apparent failures are cases where the agent never engages with the poisoned file. This implies that increases in competence—better skill reuse, more faithful reading of documentation, stronger tool routing—may increase realized attack success unless accompanied by stronger safety controls (Ning et al., 1 Jun 2026).

A third misconception is that content rewriting or data-level sanitization is equivalent to disinfection. Phantom Transfer is especially relevant here. The authors write that “it is unclear to the authors of this paper what the poison actually is,” even though the poisoning effect survives oracle-guided filtering and full-dataset paraphrasing (Draganov et al., 3 Feb 2026). This suggests that at least some poisoning signals are distributed, semantically latent, or representation-level rather than reducible to overt toxic spans.

The open problems are correspondingly structural. SMSR acknowledges that its guarantee is meaningful only for small TBT_B1; for TBT_B2 and TBT_B3, the certificate worsens from TBT_B4 at TBT_B5 to 0.402 at TBT_B6 and 0.684 at TBT_B7 (Sharma, 10 Jun 2026). OEP does not demonstrate fully recursive mutation across many generations of memory rewriting, only the first conversion from edge-case experience to generalized rule (Wang et al., 18 May 2026). The multi-agent memory-security paper states that interaction-based poisoning and cross-agent contamination remain difficult to formalize, especially in dynamic memories without a stable clean baseline (Torra et al., 20 Mar 2026). SkillHarm, for its part, shows that temporal decoupling, persistence, and low refusal can make multi-session attacks both effective and hard to notice (Ning et al., 1 Jun 2026).

Taken together, these results place SMP at the intersection of supply-chain security, persistent memory safety, procedural integrity, and longitudinal agent evaluation. The most stable conclusion is that the attack surface is not confined to what an agent is told now. It includes what the agent stores, rewrites, re-trusts, and reuses later.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Mutating Poisoning (SMP).