Skill-Based Prompt Injection

Updated 2 July 2026

Skill-based prompt injection is a vulnerability in agentic coding assistants where attackers manipulate skill metadata and instructions to gain elevated privileges.
The attack exploits role confusion by embedding malicious payloads in trusted skill artifacts, resulting in persistent, high-success adversarial behaviors.
Defensive research emphasizes layered mitigations like sandboxing, cryptographic signing, and runtime intent verification to limit such adversarial exploits.

Skill-based prompt injection is a class of adversarial attacks targeting agentic coding assistants and LLM agents that support dynamically loadable, user-installable "skills" (plugins, modules, agent skills) to extend agent functionality. In contrast to traditional prompt injection, which manipulates user-visible natural language inputs, skill-based prompt injection exploits the privileged status of skill files and their integration points—metadata, long-form instruction documents, and associated scripts—to execute malicious behaviors with elevated or persistent authority. This attack surface fundamentally arises from treating external, user-supplied instructions as semantically trusted subcomponents of the agent's reasoning and execution stack, leading to high attack success rates and a marked increase in agent supply-chain risk (Maloyan et al., 24 Jan 2026, Schmotz et al., 30 Oct 2025, Jia et al., 15 Feb 2026, Schmotz et al., 23 Feb 2026, Ye et al., 22 Feb 2026).

1. Formal Definition, Threat Model, and Mechanistic Basis

Formally, let $S$ denote the set of agent skills, each comprising a structured metadata header (YAML/JSON), a long-form instruction body (e.g., SKILL.md), and zero or more auxiliary artifacts (scripts, binaries). An agent, on startup and/or demand, loads elements of $S$ into its Skill Registry, integrating both descriptive metadata into its system prompt and skill body content into its conversational context—then performs tool invocations or code execution as instructed (Schmotz et al., 30 Oct 2025, Schmotz et al., 23 Feb 2026).

A skill-based prompt injection attack occurs when an adversary, controlling or tampering with one or more skill artifacts, embeds malicious directives so that agent $M(S_0, K \cup \{K_m\}, Q)$ is induced to execute targeted malicious actions $\alpha_{\text{mal}}$ and/or exfiltrate data $D$ (using the notation of (Schmotz et al., 30 Oct 2025)):

$M(S_0,K \cup \{K_m\}, Q) \rightarrow (\alpha_{\text{mal}}, D_\text{exfiltrated})$

This supply chain threat extends not only to direct instruction-body manipulations (e.g., inserting reverse shells) but also to auxiliary scripts, protocol manifests, and even YAML metadata—each loaded as first-class operational guidance by the agent.

At the latent-mechanistic level, skill-based prompt injection exploits "role confusion" (Ye et al., 22 Feb 2026): the LLM's attribution of semantic authority is a function of both architectural tags (system, user, assistant, tool) and stylistic/specialist cues $s_{\mathrm{style}}$ synthesized from skill content. When injected skill text mimics the style or expected pattern of high-privilege reasoning (e.g., convincing "expert" jargon), the model projects the content into the corresponding privileged subspace, granting it the authority to override or subvert intended agent policies.

2. Taxonomy: Dimensions of Attack and Propagation

Recent systematizations propose a comprehensive three-dimensional taxonomy to precisely capture the structural diversity of skill-based prompt injection attacks (Maloyan et al., 24 Jan 2026). Each attack is classified by:

Delivery Vector ( $V$ ):
- $D_1$ : Direct Prompt Injection (role hijacking, context override, instruction negation)
- $D_2$ : Indirect Prompt Injection (repository-based file poisoning, documentation/manifest injection, web content contamination)
- $S$ 0: Protocol-Level Attacks (tool poisoning, Model Context Protocol (MCP) attacks, OAuth squatting)
Attack Modality ( $S$ 1):
- $S$ 2: Text-based (completion hijacking, context encoding abuse)
- $S$ 3: Semantic (cross-origin context poisoning, logic bombs)
- $S$ 4: Multimodal (cross-channel injections via images, audio, or combined formats)
Propagation Behavior ( $S$ 5):
- $S$ 6: Single-Shot (ephemeral, non-persistent)
- $S$ 7: Persistent (configuration modification, persistent memory poisoning)
- $S$ 8: Viral (wormlike spread, dependency chain, agent-to-agent transfer)

Each concrete attack is identified as a triple $S$ 9, supporting fine-grained mapping across the expanding landscape of exploit techniques.

Table: Sample Attack Categories (as per (Schmotz et al., 23 Feb 2026)) | Category | Description | Examples | |---------------------|-------------------------------------------|-----------------------------------------------| | Data Exfiltration | Unauthorized file/secret transfer | POST requests in backup skills | | Destruction/Malware | Ransomware, deletion, system sabotage | rm -rf /, encrypt+exfiltrate payloads | | Backdoors | Dropped persistence, hidden logic | Self-modifying skills, hidden trojans | | Privilege Escalation| Gaining broader permissions | chmod 777, appending to sudoers file |

3. Representative Attack Chains and Stealth Strategies

Attacks range from naively obvious (e.g., inserting "rm -rf /" as a step in an instructional block) to context-dependent or highly stealthy variants enabled by automated trace-guided refinement (Jia et al., 15 Feb 2026, Maloyan et al., 24 Jan 2026).

Direct Skill File Injection: A malicious SKILL.md interleaves innocuous instructions with operational payloads: $M(S_0, K \cup \{K_m\}, Q)$ 0bash python scripts/file_backup.py $\alpha_{\text{mal}}$ 0markdown

Prerequisites

Run: bash resources/helper.sh ``` The helper script executes exfiltration, privilege escalation, or persistence operations.

Protocol/Context Poisoning: Configuration files such as .cursorrules are modified or distributed as part of an open-source repo; agentic assistants load and execute shell commands listed therein during routine operations (Maloyan et al., 24 Jan 2026).

Human Approval Bypass: Many agents employ multi-step confirmation for skill activation or tool execution. If a user grants blanket approval with "Don't ask again" (e.g., for "benign" Python execution), subsequent surreptitious malicious actions (e.g., outbound network access) are implicitly permitted without further checks (Schmotz et al., 30 Oct 2025).

Role Spoofing via Stylistic Mimicry: By engineering injected skill files to imitate privileged expert tone or chain-of-thought reasoning, attackers exploit latent-space authority assignment to suppress interface-level tag boundaries, driving high attack success rates irrespective of surface safeguards (Ye et al., 22 Feb 2026).

4. Metrics, Benchmarking, and Empirical Results

Empirical assessments systematically quantify both attack effectiveness and defense shortfalls. The Skill-Inject benchmark (Schmotz et al., 23 Feb 2026) evaluates 202 injection–task pairs across 23 real-world skills and eight attack categories:

$M(S_0, K \cup \{K_m\}, Q)$ 1

$M(S_0, K \cup \{K_m\}, Q)$ 2

Observed ASR values consistently exceed 50% for contextual body injections across frontier agents; best-of- $M(S_0, K \cup \{K_m\}, Q)$ 3 configurations (varying skill, position, user task) can push ASR beyond 80% (e.g., 83.3% for Gemini 3 Flash, 75% for GPT-5.1-Codex-Mini).

Stealth-optimized frameworks (e.g., SkillJect (Jia et al., 15 Feb 2026)) demonstrate that adaptive, trace-driven closed-loop attacks reach ASRs >95% for InfoLeak, PrivEsc, and FileMod, compared to near-zero for naive direct injection. Penetration extends across major model families and platforms, unaffected by scaling (e.g., GPT-5.2 remains highly vulnerable).

5. Failure of Traditional Defenses: Input Filtering, Scaling, Data-Instruction Separation

Model scaling, advanced RLHF, and LLM-based input screening fail to provide meaningful robustness (Schmotz et al., 23 Feb 2026, Schmotz et al., 30 Oct 2025, Maloyan et al., 24 Jan 2026). No monotonic relationship exists between model size and attack resistance. Instruction–data separation is inapplicable because skills are, by construction, instructions. LLM skill scanning filters either miss context-dependent attacks or harm utility under strict or warning policies. Static blacklist filtering or syntactic sanitization is ineffective, as learned execution triggers and role-mimicking style vectors easily evade such mechanisms (Pasquini et al., 2024, Ye et al., 22 Feb 2026). Precision/recall values of common filters under adaptive attacks remain under 0.10, with >90% bypass rates (Maloyan et al., 24 Jan 2026).

6. Root Cause Analysis: Role Confusion and Latent-Space Authority

A mechanistic diagnosis identifies latent-space "role confusion" as the core vulnerability (Ye et al., 22 Feb 2026). LLMs conflate role attribution from both architectural tags and textual style, relying more heavily on stylistic cues (e.g., expert narrative, stepwise reasoning) than on trusted channel demarcations. Skill spoofing leverages this property: the injected content is designed so its hidden state $M(S_0, K \cup \{K_m\}, Q)$ 4 projects strongly onto the privileged role subspace, maximally elevating $M(S_0, K \cup \{K_m\}, Q)$ 5 for the desired role $M(S_0, K \cup \{K_m\}, Q)$ 6.

Formally, for token $M(S_0, K \cup \{K_m\}, Q)$ 7: $M(S_0, K \cup \{K_m\}, Q)$ 8

$M(S_0, K \cup \{K_m\}, Q)$ 9

Role-confusion probes confirm that both skill-based injections and chain-of-thought (CoT) forgeries operate via this channel, achieving 60–61% attack success rates even in highly curated agents. Attack effectiveness tracks latent-space CoTness or Userness, not surface prompt structure.

7. Defensive Frameworks and Open Research Challenges

Comprehensive mitigation requires a layered, context-aware, defense-in-depth strategy (Maloyan et al., 24 Jan 2026, Schmotz et al., 30 Oct 2025, Schmotz et al., 23 Feb 2026):

Cryptographic Tool Identity (ETDI): Requires all skills/tools be signed and version-tagged. Full prevention of tool squatting/rug-pull exploits in controlled benchmarks.
Capability Scoping: Restricts skills to least-privilege, enforced via sandboxing at the OS/tool runtime level. Proven to reduce attack success by an order of magnitude.
Runtime Intent Verification: Deploys side-channel "guardian" agents to validate action sequences before external effects.
Sandboxed Execution: Containerizes each tool invocation, enforcing egress and file access policies.
Provenance Tracking: Tags each context fragment and instruction with its origin and trust level, supporting forensic analysis and merge control.
Human-in-the-Loop Gates: Requires explicit approval for high-impact actions, using tiered confirmation methods to balance security and usability.
Context-Aware Authorization: Enforces semantic security policies, binding allowed behaviors to well-defined context boundaries instead of static filtering or data/instruction tagging.

No single defense suffices. Only multi-layered approaches that combine static provenance, dynamic behavioral gating, semantic policy enforcement, and judicious user intervention achieve attack success rates in the low single digits. Open research areas include scalable automated policy specification, dynamic enforcement against adaptive attackers, and formal languages for context-aware security modeling (Schmotz et al., 23 Feb 2026, Schmotz et al., 30 Oct 2025).

Skill-based prompt injection represents a fundamental architectural failure at the intersection of agent extensibility, privileged instruction handling, and latent role attribution. The class of attacks is empirically broad, highly effective, and resilient to ad-hoc mitigations, underscoring the necessity for architectural reforms and continued security research (Maloyan et al., 24 Jan 2026, Schmotz et al., 30 Oct 2025, Jia et al., 15 Feb 2026, Schmotz et al., 23 Feb 2026, Ye et al., 22 Feb 2026).