AutoSkillHarm: Automated Poisoning Framework
- AutoSkillHarm is an automated framework that generates poisoned agent skills using both Fixed-Payload and Self-Mutating Poisoning methods.
- It treats skill packages as privileged artifacts, realistically simulating immediate and deferred attack scenarios within agent workflows.
- The benchmark evaluates 879 attack samples across multiple risk types, revealing high attack success rates and critical vulnerabilities in agent-skill security.
AutoSkillHarm is an automated construction framework introduced within the SkillHarm benchmark for generating poisoned agent skills at scale across the full skill-use lifecycle. In this setting, a skill package is treated as a privileged procedural artifact that can include SKILL.md instructions, reference documents, executable scripts, and auxiliary resources; because agents are expected to trust and execute these artifacts, poisoning them creates a realistic attack surface. AutoSkillHarm operationalizes two compromise regimes—Fixed-Payload Poisoning (FPP), in which harm occurs in the same task session that reaches the poisoned content, and Self-Mutating Poisoning (SMP), in which an initially benign execution silently mutates persistent skill content and defers harm to a later reuse session. The resulting benchmark contains 879 attack samples across 71 skills, and evaluated agents reach attack success rates up to 86.3% in FPP and 69.3% in SMP (Ning et al., 1 Jun 2026).
1. Definition, scope, and conceptual boundaries
AutoSkillHarm belongs to the security literature on agent-skill poisoning rather than to the literature on benign modular skill learning. Its underlying assumption is that third-party skills occupy a privileged position in agent workflows: agents install them, load them, and may follow their instructions or execute their helper code as part of ordinary task completion. In SkillHarm, the attacker is a skill publisher who controls all files in the package, knows the skill’s advertised purpose and the general class of tasks it supports, but does not have post-installation access to the victim environment, does not observe execution, does not adapt payloads online, and does not know the exact downstream prompt, files, or system configuration (Ning et al., 1 Jun 2026).
Within this threat model, FPP is the single-session compromise setting: the payload is already present at installation and compromises any task session that invokes it. SMP is the cross-session compromise setting: a first task silently mutates persistent skill content, and a later task triggers the actual harmful behavior when the modified skill is reused. The central novelty is lifecycle awareness: the benchmark is explicitly designed to cover both immediate and deferred compromise, rather than only one-shot task execution (Ning et al., 1 Jun 2026).
A related but distinct boundary is the difference between poisoned skills and harmful skills. HarmfulSkillBench studies skills whose declared purpose is itself harmful or policy-violating, such as cyber attacks, fraud/scams, privacy violations, or sexual content generation; AutoSkillHarm instead studies otherwise ordinary reusable skills that are maliciously modified to induce harmful behavior during agent use (Jiang et al., 16 Apr 2026). A second, narrower use of the term also appears in offensive cybersecurity: “AutoSkillHarm” is used there to describe a negative result in which added Agent Skills become redundant or even harmful when the environment already provides deterministic, schema-validated, low-latency feedback (Chacko et al., 19 May 2026).
2. Lifecycle model and risk taxonomy
SkillHarm organizes skill-based harm by the component of the agent workflow that is compromised. The benchmark defines 12 risk types in three categories, replacing flat ad hoc harm lists with a workflow-oriented taxonomy (Ning et al., 1 Jun 2026).
| Category | Risk types |
|---|---|
| Data Pipeline Exploitation | Data Exfiltration; Output Manipulation; Poisoning |
| System Environment Exploitation | Privilege Escalation; Unauthorized File Mod.; Backdoor Injection; Denial of Service; Malware Deployment; System Corruption |
| Agent Autonomy Exploitation | Goal Hijacking; Anti-Forensics; Proxy Attack |
This taxonomy is coupled to the lifecycle distinction between FPP and SMP. In FPP, the attack is realized when the agent reads, consults, or executes poisoned content during the current task. In SMP, the first session functions as a setup phase that rewrites persistent skill content, and the second session activates the compromised artifact during later reuse. The benchmark therefore evaluates both harm execution and harm persistence, which prior benchmarks largely did not (Ning et al., 1 Jun 2026).
The risk taxonomy also clarifies what counts as a successful compromise. Some risks target dataflow, such as external disclosure or output corruption; some target the host environment, such as filesystem abuse or persistent access; and some target delegated agency itself, such as goal hijacking or proxy misuse. This partitioning matters because the benchmark’s evaluators are designed to check harmful end states rather than merely suspicious text patterns or nominal policy violations (Ning et al., 1 Jun 2026).
3. Automated construction pipeline
AutoSkillHarm is implemented as a three-stage pipeline: attack target selection, attack design, and quality filtering. The pipeline is driven by natural-language harnesses and executed by coding agents in containerized environments, with deterministic evaluators used to verify whether the intended harm materializes (Ning et al., 1 Jun 2026).
In the target-selection stage, the pipeline is instantiated on SkillsBench. For FPP, a target is an injection point , where is a user task and is a skill file likely to be read, consulted, or executed while solving . The pipeline defines an exposure/read rate , where counts baseline rollouts for task that reach file , and is the total number of baseline rollouts; it retains files with , caps each task at up to 3 documentation files and up to 3 scripts, and yields 142 injection points comprising 36 scripts and 106 documentation files. For SMP, a target is a task pair 0 that shares a skill; three coding agents independently propose candidate task pairs, and consensus ranking retains 12 pairs (Ning et al., 1 Jun 2026).
In the design stage, a primary designer coding agent—Claude Code (Opus 4.7) in the main pipeline—receives the skill package, the selected task or task pair, the selected risk type, and harness instructions specifying the stage objective, available inputs, tools, constraints, required outputs, review criteria, and stopping conditions. The designer explores the skill package, instantiates the abstract risk type into a concrete harmful goal, modifies skill files to embed the payload, and writes a deterministic evaluator. For document carriers, FPP supports four realization forms: Plain Text (in-place), Redirect to New Doc, Redirect to New Script, and Redirect to New PDF. For script carriers, it supports Executable Code (in-place) and Import New Helper. For SMP, the designer typically adds an atexit callback that silently rewrites a sibling skill file—usually SKILL.md or another script—under framework-specific skill directories such as .claude/skills, .codex/skills, or .agents/skills (Ning et al., 1 Jun 2026).
The design process is also environment-grounded. The task Dockerfile can be modified to create preconditions for harm via mechanisms such as seed_file, ensure_directory, seed_env, copy_path, and service_required. The evaluator is required to be realization-independent, faithful, deterministic and self-contained, and scoped to harm rather than to task utility. The harness additionally instructs the designer to make the payload look organic by matching tone, formatting, and structure and by avoiding intent-leaking words such as “malicious,” “steal,” or “exfiltrate”; an LLM-based detector can be used inside the loop as a refinement aid (Ning et al., 1 Jun 2026).
In the quality-filtering stage, each candidate is run end-to-end on Claude Code Sonnet 4.6 and Codex GPT-5.4. A separate reviewer agent examines full trajectories for reachability, plausibility, evaluator validity, and construction flaws. Candidates are discarded for unreachable goals, malformed payloads, unfaithful evaluators, weak framing, or infrastructure failures. This filter retains about 70% of generated candidates (Ning et al., 1 Jun 2026).
4. Benchmark composition and evaluation protocol
The final SkillHarm benchmark produced by AutoSkillHarm contains 879 total attack samples: 687 FPP samples and 192 SMP samples. It spans 71 skills, 12 risk types, 57 user tasks in the FPP setting, and 12 task pairs in the SMP setting. The appendix further specifies 126 unique 1 combinations for FPP and 15 unique 2 combinations for SMP (Ning et al., 1 Jun 2026).
The evaluation covers six model-harness configurations across four harnesses: Claude Code with Sonnet 4.6 and Opus 4.7, Codex with GPT-5.4 and GPT-5.5, Gemini CLI with Gemini 3 Flash, and OpenCode with Qwen-3.6 27B. Three metrics are reported. ASR is Attack Success Rate, measured by the deterministic evaluator on the final environment state. cASR is Conditional ASR, conditioned on the agent actually engaging with the poisoned file. ARR is Attack Refusal Rate, the proportion of samples where the agent explicitly refuses to follow suspicious content. Engagement for cASR is judged by LLM-based trajectory analysis with checks for enter_entry_point, use_target_file, identified, and refusal (Ning et al., 1 Jun 2026).
The use of a deterministic evaluator is structurally important. Success is defined by materialization of the harmful end state rather than by a text judgment over model outputs. This evaluation choice aligns AutoSkillHarm with end-state verification rather than prompt-level red-teaming, and keeps benchmark scoring realization-independent across document and script carriers (Ning et al., 1 Jun 2026).
5. Empirical results and defense findings
SkillHarm reports that current agents remain highly vulnerable. In FPP, ASR reaches 86.3% for Codex GPT-5.4. The other FPP ASRs are 52.4% for Claude Sonnet 4.6, 27.4% for Claude Opus 4.7, 81.4% for Codex GPT-5.5, 63.8% for Gemini 3 Flash, and 53.9% for OpenCode Qwen-3.6 27B. In SMP, ASR reaches 69.3% for Codex GPT-5.4; the other SMP ASRs are 51.6% for Claude Sonnet 4.6, 9.4% for Claude Opus 4.7, 65.6% for Codex GPT-5.5, 45.8% for Gemini 3 Flash, and 51.6% for OpenCode Qwen-3.6 27B (Ning et al., 1 Jun 2026).
A key finding is the gap between ASR and cASR. Many apparent attack failures are not genuine resistance; they occur because the agent never opened, read, or executed the poisoned target file. Once engagement occurs, success rises sharply, with cASR increases as large as +32.1 points for Claude Opus 4.7 in SMP. The benchmark therefore identifies a latent risk: failure-to-engage can mask vulnerability, and improvements in skill utilization could increase realized exploitability if safety mechanisms do not improve correspondingly (Ning et al., 1 Jun 2026).
Refusal is rare and strongly scenario-dependent. Only Claude-family models show noticeable refusal behavior, and ARR drops sharply in SMP because the harmful effect is temporally decoupled from the initial benign-looking action. For Claude Sonnet 4.6, ARR falls from 25.3% in FPP to 2.1% in SMP. The benchmark also reports two regularities across risks and payload forms: data-pipeline exploitation, especially data exfiltration, is generally the hardest category, whereas script-mediated payloads are especially effective because agents often trust and run helper scripts without inspecting them (Ning et al., 1 Jun 2026).
Defense results are limited. The benchmark evaluates Skill Scanner and Agent Scan and finds that contextualized injections often evade detection; the strongest scanner finds only 55.6% of FPP injections and 68.8% of SMP injections. A defensive system prompt warning that skills may contain malicious content helps in some configurations, but it is not reliable, and most configurations remain above 70% ASR in both attack scenarios (Ning et al., 1 Jun 2026).
6. Position within the broader agent-skill security literature
AutoSkillHarm is part of a larger shift from prompt-only threat models to artifact-centric analysis of skills. SkillJect studies automated stealthy prompt injection for coding agents and similarly exploits the split between visible inducement prompts in SKILL.md and hidden malicious payloads in auxiliary scripts; across 50 skills and four backend models, it reports an average ASR of 95.1%, compared with 10.9% for a naive direct-injection baseline (Jia et al., 15 Feb 2026). RouteGuard targets pre-execution skill-poison detection by modeling “attention hijacking” inside frozen backbones; on the critical Skill-Inject channel slice it reaches Precision 0.9334, Recall 0.6442, and F1 0.8834, recovering 90.51% of description attacks missed by lexical screening (Xiao et al., 24 Apr 2026).
Other work addresses governance and auditing rather than construction. SkillGuard treats skills as permission-bearing executable artifacts and enforces a dual-plane model over context influence and action side effects; on SkillInject it reduces attack success from 32.37% to 23.02% for contextual injections and from 25.56% to 16.67% for obvious injections while reducing benign task TSR by 1.45 percentage points (Pan et al., 2 Jun 2026). SkillAudit evaluates utility, efficiency/cost, and safety at adoption time using a two-stage pipeline that combines static semantic analysis with dynamic runtime verification; under the Codex/GPT-5.4 dynamic testing configuration, 17 of 226 evaluated skills, or 7.5%, fall below the risky threshold (Yu et al., 21 Jun 2026). Repository-aware analysis further shows that context matters for scanner interpretation: when repository context is added, only 0.52% of 2,887 scanner-flagged skills remain in malicious-flagged repositories (Holzbauer et al., 17 Mar 2026).
The literature also shows that the skill attack surface is broader than text-only poisoning. “Seeing Is Not Screening” demonstrates multimodal hidden instruction attacks in which malicious instructions are concealed in images referenced by otherwise ordinary documentation; SkillCamo achieves 78–100% ASR against baseline scanners, while the multimodal, execution-grounded ExecScan reduces ASR to 8% for SkillCamo, 31% for Cloze, and 17% for Split, with Precision 85.6%, Recall 82.0%, F1 83.8%, and FPR 27.4% (Jia et al., 16 Jun 2026). HarmfulSkillBench addresses a different axis—the declared functionality of the skill itself—and finds 4,858 harmful skills out of 98,440 total, or 4.93%, with average harm scores rising from 0.27 without the skill to 0.47 with the skill and explicit harmful task, and to 0.76 when the harmful intent is implicit and the model is asked only to plan skill execution (Jiang et al., 16 Apr 2026).
Beyond skills specifically, CUAHarm shows that harmful capability rises sharply when models are granted direct computer access, with success rates such as 59.6% for Claude 3.7 Sonnet and 84.6% for Gemini 1.5 Pro on malicious computer-use tasks, while monitoring accuracy averages only 72% (Tian et al., 31 Jul 2025). Human-Guided Harm Recovery extends the problem to post-execution safeguards by formalizing recovery from harmful states and introducing BackBench; its reward-model scaffold improves recovery quality by about +120 Bradley–Terry points over a base agent (Li et al., 20 Apr 2026). Taken together, these results place AutoSkillHarm within a broader research program: skills are simultaneously supply-chain artifacts, instruction carriers, execution triggers, and persistence mechanisms, and evaluating them requires lifecycle-aware benchmarks, runtime evidence, and post-failure recovery methods rather than static prompt filtering alone.