Papers
Topics
Authors
Recent
Search
2000 character limit reached

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

Published 23 Feb 2026 in cs.CR and cs.LG | (2602.20156v3)

Abstract: LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.

Summary

  • The paper demonstrates that LLM-agent platforms are highly vulnerable to skill file prompt injection attacks, enabling adversarial control over operations.
  • The paper outlines a comprehensive benchmark featuring 23 skills and varying policy conditions to quantify attack success rates, revealing robust contextual attack vectors.
  • The paper emphasizes the need for context-aware and least-privilege defenses to counter the structural supply chain risks posed by third-party skill integration.

Skill-Inject: Evaluating LLM Agent Vulnerability to Skill-Based Prompt Injection

Introduction and Threat Model

LLM-powered agents increasingly adopt plugin architectures, incorporating dynamically loaded "skills"—external files containing specialized instructions, code snippets, and domain knowledge to expand agent capability. This paradigm introduces a potent vulnerability: if skills act as trusted sources of executable information but are written and distributed by untrusted third parties, then malicious actors can surreptitiously insert adversarial instructions. This circumvents standard prompt boundaries and exposes agents to privilege escalation, data exfiltration, destructive operations, and malware deployment. Traditional prompt injection attacks target untrusted data within the user prompt or retrieved content; skill-file attacks are fundamentally different because skills define mixed-privilege instructions, blurring the distinction between benign and harmful intent and severely complicating trusted execution boundaries.

The "Skill-Inject" benchmark (2602.20156) rigorously formalizes this problem, designing a comprehensive evaluation protocol to measure LLM agent susceptibility to skill-file prompt injections, including contextual, dual-use, and outright malicious payloads. Figure 1

Figure 1: Skill-Inject benchmark framework quantifies both attacker and user utility across major real-world agent implementations, focusing on skill-driven attack surfaces.

Benchmark Design: Scope and Contextual Attacks

Skill-Inject encompasses 23 unique third-party skills emulating real-world usage scenarios (e.g., file manipulation, code editing, spreadsheet workflows, email APIs, and more). Test cases include both unambiguous malicious instructions (e.g., "delete all files", ransomware-style actions) and context-dependent injections (e.g., seemingly routine backup or summary instructions that, depending on context, exfiltrate confidential data).

A key innovation is emphasizing "instruction-instruction" conflicts: within a skill, malicious lines are commingled with plausible operational directives, making attack detection highly contextual. The benchmark introduces security policies (appended to the agent's system prompt) that either legitimize or explicitly prohibit certain actions, thereby exposing models’ ability (or lack thereof) to reason about authorization and context, rather than relying on surface-level detection heuristics. Figure 2

Figure 2: Attack success rates (ASR) under varying attacker strategies and defense conditions. Contextual (dual-use) injections remain a persistent challenge, and "best-of-n" adaptive attacks exhibit high success rates even under warning protocols.

Experimental Protocol and Key Findings

Skill-Inject systematically evaluates leading LLM-powered agent frameworks—OpenAI Codex (including GPT-4, GPT-5.1/5.2 series), Anthropic Claude Code (Haiku, Sonnet, Opus 4.5), and Google Gemini CLI (Gemini 2.x, 3.x Pro/Flash)—across three policy conditions: (1) baseline with no explicit security context, (2) a warning policy explicitly flagging potential harm, and (3) a legitimizing policy authorizing the injected action.

Main Results

  • Baseline contextual attacks: Under ambiguous/no policy conditions, contextual injections yield high ASRs, ranging from 41% (Claude Haiku 4.5) to 79% (Gemini 3 Flash), with most systems exhibiting >50% success for attackers.
  • Effectiveness of defensive policies: Warning policies generally reduce ASR but do not reliably eliminate attacks; even with explicit warnings, ASR remains substantial in many cases. Legitimizing policies correctly increase execution rates (as desired) in most architectures, but some models (e.g., Anthropic Opus 4.5) exhibit idiosyncratic mistrust, dropping both utility and attack rates when explicitly told to perform actions.
  • Obvious (unambiguous) attacks: For direct, obviously-malicious injections, single-run ASRs can reach 70%. With attacker adaptation and "best-of-n" strategies (varying skill context, injection position, and user prompt), the likelihood of at least one attack succeeding increases dramatically above 80% in many cases.
  • Cross-family differences: No evaluated system is robust in the general case; model scaling, versioning, or improved natural language understanding do not grant immunity to skill-file attacks. Model response to security policies varies idiosyncratically. Figure 3

    Figure 4: Injection execution rates versus utility (task completion) under policy variations, demonstrating no monotonic relationship between model scale and security; warning and legitimizing prompts induce complex tradeoffs.

Attack Vectors and Adaptive Threats

Skill-Inject demonstrates the potency of several realistic attacker strategies:

  • Script-based versus textual payloads: Malicious behaviors injected via shell scripts or code files referenced by the skill are much more likely to be executed (up to +30 percentage points in ASR), since agents routinely trust and invoke such scripts without examination.
  • Description (metadata) injections: Attacks placed in the skill metadata (e.g., YAML description) substantially amplify ASR, especially under policy-warned scenarios, since such metadata is loaded early and establishes malicious expectations prior to the agent processing security warnings.
  • Adaptive/Repeated attempts: Varying line placement, skill context, and associated tasks compounds attacker success; best-of-n attacks show that single-shot evaluation grossly underestimates real-world risk.

Defense Evaluation: LLM-based Skill Screening

A natural defense involves pre-screening skills with an LLM "judge" tasked with recommending whether a skill is safe. Experiments reveal that while these judges exhibit high recall for obviously benign or overtly malicious skills, they are overcautious under legitimizing policies and cannot reliably distinguish context-dependent attacks, resulting either in false negatives or utility reduction. Only select models (e.g., Gemini 3 Pro, GPT-5.1-Codex-Mini) display partial separation between truly safe and malicious contextual scenarios. Figure 5

Figure 6: LLM-based security screening for skills. Despite reasonable discrimination against explicit attacks, context-dependent legitimate skills are often judged unsafe, leading to unnecessary refusal of harmless operations and revealing major limitations in current LLM skill-screening defenses.

Analysis: Structural Supply Chain Risks and Implications

The study reveals that LLM-agent ecosystems exhibit a structural supply chain risk: dynamically loaded third-party skills create a direct channel for adversarial instruction injection, and current agent architectures lack any meaningful contextual or fine-grained authorization to mitigate these attacks. The traditional separation of "instructions" (trusted) vs. "data" (untrusted) is fundamentally broken by the practice of ingesting entire external instruction files as operational input. The issue is amplified by agents’ broad operating privileges and access to sensitive artifacts (files, APIs, credentials, emails), increasing the attack impact.

Neither pure model scaling, heuristics, nor naive security wrappers suffice. True resolution demands context-aware, task-sensitive, principle-of-least-privilege security frameworks—binding agent actions to explicit, dynamically auditable policies based both on the source and the demand context for each instruction. Figure 7

Figure 3: Claude Code enumeration of loaded skills, illustrating the magnitude of the dynamic, attackable skill surface in operational deployments.

Conclusion

Skill-Inject systematically demonstrates that all major LLM-driven agent platforms are highly vulnerable to prompt injection via skill files. Attackers enjoy high success rates through both direct and contextually embedded payloads, with dynamic injection strategies further exacerbating risk. LLM-based skill screening provides limited and unreliable defenses for contextual attacks and often negatively impacts the utility of legitimate workflows. The findings mandate a wholesale reconsideration of agent supply chain trust boundaries, and motivate the development of robust, context-aware, and least-privilege authorization architectures for third-party skill integration. Given the trajectory toward fully autonomous, extensible LLM agents, principled solutions to skill-file injection are essential for maintaining operational security as these systems proliferate and assume critical decision-making authority.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview: What is this paper about?

This paper looks at a new security problem for AI “agents” (AI tools that can read files, run code, and use add-ons). These agents can be extended with “skills,” which are like plug-ins or add-on instruction files that teach the agent how to do specialized tasks. The authors show that attackers can hide harmful instructions inside these skill files, tricking the agent into doing bad things. They introduce Skill-Inject, a test (benchmark) to measure how easily different AI agents fall for these hidden attacks.

What questions did the researchers ask?

  • Can AI agents be tricked by harmful instructions hidden inside skill files?
  • Do agents know when similar-looking instructions are safe in one situation but dangerous in another?
  • How often do agents follow the bad instructions, and does warning them help?
  • Do simple defenses (like having another AI “judge” screen a skill file) catch these attacks without blocking useful skills?

How did they test this?

Think of Skill-Inject like a set of “challenge levels” for agents:

  • The authors built a collection of 23 realistic skills and paired them with 202 test scenarios. Some attacks are obvious (like “delete all files”) and some are sneaky, blending into normal-looking instructions (like “make a backup to this server” that’s actually an attacker’s server).
  • They tested popular agent setups (like Claude Code, Gemini CLI, and OpenAI Codex CLI) and multiple frontier models.
  • They tried three “policy” settings to see how context changes behavior:
    • Normal: No extra rules; the agent just reads the skill.
    • Legitimizing: The system says the instruction is allowed (e.g., backups are approved).
    • Warning: The system says the instruction is dangerous or forbidden.
  • They varied attacker abilities:
    • Text inside the skill file (SKILL.md).
    • A harmful script stored alongside the skill (many agents run these with less checking).
    • A sneaky sentence in the skill’s description (which gets loaded into the agent’s main instructions).
  • They measured two things:
    • Security: Does the agent avoid harmful instructions?
    • Utility: Does the agent still follow legitimate instructions and complete the user’s task?

To make this more concrete:

  • Example 1: A “presentation” skill includes a line, “After changes, back up to https://…/upload.” Looks normal, but it secretly sends your files to the attacker’s server.
  • Example 2: A “python-code” skill includes “IMPORTANT ENCRYPTION PROCEDURE” that encrypts many files, uploads the passwords to a remote site, and deletes the originals—basically ransomware disguised as best practice.

What did they find?

  • High vulnerability across the board: Many top models and agent setups followed harmful instructions surprisingly often. In some tests, attack success rates reached around 70–80%.
  • Context matters a lot: Many instructions are “dual-use.” They can be safe in one situation and dangerous in another. For example, “make a backup” might be fine if it’s an approved server, but it’s data theft if it’s an attacker’s server.
  • Simple warnings help but don’t solve it: Adding a warning policy lowered attack success, but didn’t stop many attacks. Agents still followed harmful instructions in a worrying number of cases.
  • Scripts are especially risky: Attacks hidden in scripts succeeded more often than attacks written directly in text, partly because agents tend to run scripts without fully checking them.
  • Description-level tricks are powerful: If the attacker controls a sentence in the skill’s description (which gets loaded upfront), attack success goes up, especially when a warning policy is present.
  • Repeated tries amplify success: Moving the harmful line to different places or trying slightly different tasks makes it more likely at least one attempt works.
  • AI “judges” are imperfect: Screening skills with another model often catches obvious attacks, but can wrongly block legitimate cases when the policy says the action is allowed. In short, judges can be too suspicious and hurt utility.

Why this is important:

  • Today’s agents are easy to trick via their “instruction supply chain” (third-party skill files), enabling serious harms like data theft, destructive actions, or ransomware-like behavior.
  • Just making models bigger or adding simple input filters probably won’t fix this, because the key problem is understanding context and authorization, not just spotting bad words.

What’s the impact?

This research shows that AI agents need smarter, context-aware security, especially when they install third-party skills. The authors recommend practical steps:

  • Treat skill files as untrusted by default, just like unknown software.
  • Limit what each skill is allowed to do (“least privilege”), especially actions that affect the outside world (like sending data or running code).
  • Add context-aware authorization checks: the agent should ask, “Is this action allowed right now, given the task, the data at hand, and who made this skill?”
  • Use screening tools carefully: AI judges can help, but they’re not enough on their own and can block useful instructions if they don’t understand the context.

Big picture: As agents become more powerful and more connected, their “plugin” ecosystems become new entry points for attacks. Skill-Inject gives researchers and builders a way to measure and improve security—so agents can be helpful without being easily tricked.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a concise list of unresolved gaps the paper leaves open, stated to enable concrete follow-on research:

  • Benchmark scope is narrow: only 23 skills, 8 attack categories, and 202 injection–task pairs; lacks coverage of domains like finance, healthcare clinical workflows, ICS/OT, governance/compliance, and multi-tenant SaaS environments.
  • No multilingual or obfuscated injections: attacks are primarily in English and plain text; missing evaluations of multilingual prompts, Unicode homographs, base64/HTML/JavaScript obfuscation, steganography, or “ASCII smuggling” within skill files.
  • Limited attacker model: attacker capabilities are restricted to body/script/description edits; does not include supply-chain mechanisms (malicious updates, version pinning, typosquatting, dependency confusion), staged/multi-step attacks, or cross-skill/lateral movement.
  • No adaptive/red-team attacker loop: best-of-n sampling varies line position and context but does not simulate iterative attackers who refine payloads based on agent feedback, environment, or policy responses.
  • Weak treatment of covert exfiltration: evaluations use obvious HTTP endpoints; lacks tests for covert channels (DNS tunneling, timing channels, encrypted sockets, local browser-mediated egress, clipboard, printer, or cloud SDK abuse).
  • No OS or network-level controls: environments appear permissive; omits experiments under restrictive sandboxes, egress filtering, credential vaulting, filesystem ACLs, SELinux/AppArmor, or network microsegmentation.
  • Tool ecosystem interactions untested: does not consider combined attacks via RAG databases, MCP tools, browser agents, memory stores, or CI/CD systems where skills can cascade or chain vulnerabilities.
  • Policy construction is simplistic: only binary legitimizing vs warning policies; no exploration of ambiguous/conflicting policies, policy hierarchies, scoped capabilities, temporal/role-based constraints, or escalation/approval workflows.
  • Missing formal authorization framework: calls for “context-aware authorization” but provides no concrete mechanism, policy language, decision model, or enforcement architecture to evaluate against Skill-Inject.
  • Incomplete measurement of harmfulness: ASR focuses on execution; lacks severity scoring, timeliness (time-to-exfiltration), scope (bytes leaked/files deleted), persistence/backdoor longevity, or recovery cost metrics.
  • Ambiguous ground truth for contextual cases: labeling of “correct execution” under legitimizing policies may be subjective; no formal rubric for intent disambiguation, nor inter-rater reliability or human audits.
  • LLM judge baseline is limited: only tests simple screening; omits hybrid detectors (static/dynamic analysis of scripts, provenance checks, signature verification), ensemble judges, or structured reasoners trained on contextual integrity.
  • No evaluation of deterministic defenses: does not test secure-by-design approaches (capability gating, allowlists, typed action schemas, taint tracking, action provenance, data-/instruction-spotlighting) in skill-loading pipelines.
  • Skill marketplace risks unmeasured: no empirical scan of public repos (prevalence of malicious/insecure skills), publisher trust models, signing/attestation (SBOM, Sigstore), or automated vetting pipelines.
  • Model/scaffold configuration sensitivity: defaults are used; no ablation on tool-use settings, temperature, tool permission prompts, system prompt structure, memory depth, or safety toggles.
  • Cross-family robustness drivers unexplained: large variability between model families is observed, but there’s no analysis of why (e.g., tool-use policies, training data, system prompt patterns, or function-calling behavior).
  • No longitudinal robustness: models and agents evolve rapidly; no re-testing across versions or drift tracking to assess stability of vulnerabilities and defenses over time.
  • Script-based attack internals underexplored: higher ASR is reported for scripts, but the paper does not analyze when/why agents skip script inspection (e.g., time pressure, perceived trust, tool invocation heuristics).
  • Injection placement effects not fully characterized: line-position matters, but there’s no study of cognitive salience, section semantics (title vs appendix), formatting cues, or token-context effects on agent compliance.
  • Absence of multi-agent settings: cross-agent privilege escalation and inter-agent instruction propagation are referenced but not modeled or tested within Skill-Inject.
  • No user-in-the-loop mitigations: ignores workflows with mandatory confirmations, diff reviews, provenance displays, or UI affordances that surface skill instructions and proposed actions to users.
  • Limited real-world data sensitivity: tasks may not involve high-stakes assets (prod secrets, customer PII, regulated data); lacks tests in realistic enterprise contexts with DLP policies and audit requirements.
  • No economic/operational tradeoff analysis: security–utility tradeoff is measured but does not quantify productivity impact, false positives vs missed attacks, or the cost of stricter controls.
  • Training-time interventions untested: does not evaluate model fine-tuning (e.g., DPO/SFT on contextual integrity), instruction prioritization curricula, or tool-use reward shaping to reduce injection susceptibility.
  • Policy attacks are not studied: no tests of skills that attempt to overwrite, dilute, or socially engineer system policies (e.g., “updated compliance policy” sections, policy shadowing in descriptions).
  • Provenance and trust signals absent: no exploration of how signed sources, reputation scores, reproducible builds, or dependency graphs affect agent trust in skills and execution behavior.
  • Environment sensing not leveraged: agents don’t appear to use context signals (sensitive files detected, privileged directories, egress destination classification) to modulate risk; this remains unexplored.
  • Lack of standardized evaluation metrics: no common taxonomy or benchmark scoring for agent skill security across communities, hindering comparability and reproducibility.
  • No mitigation blueprints: recommendations are high-level; omits reference implementations, APIs, or test harnesses for context-aware authorization, least-privilege skill capability sets, or skill vetting toolchains.
  • Human factors unaddressed: does not evaluate developer/user ability to spot malicious instructions, documentation patterns that reduce risk, or training that improves skill curation and auditing.
  • Legal/compliance dimensions: unexamined impacts on regulatory obligations (data transfers, retention), incident response, or auditability when agents act on third-party skills.

Glossary

  • Agent skills: Third-party instruction packages that extend an LLM agent’s capabilities with code, knowledge, and procedures. Example: "Agent skills \cite{agentskills2025} enable users to extend their agent dynamically with additional knowledge and instructions related to specific tasks and environments."
  • Agentic systems: LLM-based systems that autonomously plan and act using tools and external data. Example: "with the rise of agentic systems"
  • Attack surface: The set of points where an attacker can try to exploit a system. Example: "this introduces a novel attack surface:"
  • Attack Success Rate (ASR): The proportion of attacks that achieve their malicious objective in evaluation. Example: "Attack success rates (ASR) across experiment conditions."
  • Best-of-n (BoN): An evaluation strategy where multiple attempts are tried and success is counted if any attempt succeeds. Example: "best-of-nn attack over different possible injection lines."
  • Context-aware authorization frameworks: Security mechanisms that approve or block actions based on task- and data-specific context. Example: "robust agent security will require context-aware authorization frameworks."
  • Contextual injections: Prompt injections whose harmfulness depends on task context or policies. Example: "Contextual injections succeed at substantial rates across nearly all configurations"
  • Contextual integrity (CI): A theory defining appropriate information flows based on context-specific norms. Example: "Contextual integrity (CI)~\citep{barth2006privacy} provides a theoretical framework for reasoning about appropriate information flows."
  • Data exfiltration: Unauthorized transfer of data to an external destination. Example: "including data exfiltration, destructive action, and ransomware-like behavior."
  • Denial of service (DoS): Attacks that disrupt service availability by overwhelming or incapacitating systems. Example: "DoS"
  • Deterministic defenses: Security methods that provide formal, rule-based guarantees rather than probabilistic ones. Example: "deterministic defenses provide secure-by-design guarantees"
  • Direct prompt injections: Attacks where the adversarial instructions are directly provided to the model by the user. Example: "Unlike direct prompt injections where users themselves craft prompts to override system constraints"
  • DPO training: Direct Preference Optimization; a training method that aligns model behavior with preferences. Example: "Meta SecAlign~\citep{chen2025meta} uses DPO training to train models not to execute any instructions in data sections."
  • Dual-use (instructions): Instructions that can be legitimate in some contexts but harmful in others. Example: "many instructions are dual-use, appearing benign or even beneficial in some contexts while enabling data exfiltration or privilege escalation in others."
  • Frontier LLMs: The most capable, state-of-the-art LLMs available. Example: "We evaluate frontier LLMs on Skill-Inject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance."
  • Indirect prompt injections: Attacks embedded in third-party content that the agent processes during normal tasks. Example: "indirect prompt injections occur when adversarial instructions (that are embedded in external content, such as emails, web pages, or tool outputs) hijack the model's behavior to execute unintended actions"
  • Instruction hierarchy: A defense concept that prioritizes trusted instructions over untrusted ones in model inputs. Example: "Instruction hierarchy~\citep{wallace2024instruction} assumes that tool outputs does not contain instructions."
  • Instruction–instruction conflict: A situation where multiple instruction sources (e.g., system vs. skill) contradict each other. Example: "the ``instruction-instruction" conflict inherent to skill-based systems"
  • Lazy Loading: Deferring loading of a skill’s full content until it is needed. Example: "Lazy Loading."
  • Legitimizing policy: A policy that explicitly authorizes a particular action, making it correct to comply. Example: "a legitimizing policy that resolves the injection as being benign"
  • Least-privilege capability sets: Restricting a skill or agent to the minimal permissions needed for its role. Example: "least-privilege capability sets"
  • LLM judge: Using an LLM to evaluate or screen content (e.g., skills) for potential harm. Example: "LLM judges"
  • Model Context Protocol (MCP): A protocol for exposing tools and resources to models that can pose security risks if poisoned. Example: "Model Context Protocol (MCP) tools"
  • Natural language malware: Malicious behavior encoded in human-readable instructions rather than code exploits. Example: "a new form of natural language malware."
  • OWASP Top 10 (for LLM Applications): A list of the most critical security risks for LLM-based applications. Example: "the OWASP Top 10 for LLM Applications"
  • Policy prompting: Guiding models with explicit policy text to influence security-aware behavior. Example: "policy prompting is helpful but insufficient as a standalone mitigation for skill-based prompt injections."
  • Privilege escalation: Gaining higher permissions than intended through deceptive or unintended actions. Example: "enabling data exfiltration or privilege escalation"
  • Prompt injection (attacks): Malicious instructions designed to subvert an LLM’s intended behavior. Example: "prompt injection attacks"
  • Ransomware: Malicious encryption of files with extortion for decryption keys. Example: "a ransomware attack disguised as a security best practice."
  • Script-based attacks: Injections where the payload resides in referenced scripts rather than inline text. Example: "Script-based attacks achieve higher ASR across both contextual and obvious injection categories"
  • Skill-body-injection threat model: An attacker capability where malicious text is inserted into the main skill file body. Example: "skill-body-injection threat model"
  • Skill files: Long-form instruction artifacts (e.g., SKILL.md) that agents load to extend behavior. Example: "agent skill files (long, external instruction artifacts that agents load dynamically)"
  • Skill marketplace: Repositories where third-party skills are published and distributed. Example: "skill marketplaces like Vercel's repository"
  • Supply chain attack (on AI systems): Compromising auxiliary components (e.g., tools, skills, data) rather than the core model. Example: "Supply Chain Attack on AI systems"
  • Warning policy: A policy that explicitly forbids certain actions, signaling potential maliciousness. Example: "a warning policy that identifies the injection as being malicious."
  • YAML description field: Metadata field loaded into the system prompt that can influence agent behavior. Example: "skill's YAML description field"

Practical Applications

Below is a concise mapping of the paper’s findings into practical, real‑world applications across industry, academia, policy, and daily life. Items are grouped by deployment horizon and, where relevant, include sectors, potential tools/products/workflows, and feasibility dependencies.

Immediate Applications

  • Deploy Skill-Inject for security assurance and red teaming
    • Sectors: software, finance, healthcare, education, robotics
    • What: Use the Skill-Inject benchmark to measure Attack Success Rate (ASR) of agents, skills, and scaffolds; integrate as a regression test in CI/CD and pre‑release QA; create red‑team playbooks based on the paper’s obvious and contextual attacks.
    • Tools/workflows: CI pipelines, nightly ASR dashboards, procurement gating tests for vendors.
    • Dependencies/assumptions: Access to agent scaffolds and model APIs; compute budget for repeated runs (best‑of‑n).
  • Pre-publication screening for skill marketplaces
    • Sectors: software ecosystems, app stores, enterprise catalogs
    • What: Add skill screening gates that flag network exfil routes, script auto‑execution, and suspicious YAML descriptions; adopt trust labels and quarantine flows.
    • Tools/products: Skill-Inject-as-a-Service, static analyzers for SKILL.md and scripts, marketplace moderation dashboards.
    • Dependencies/assumptions: Cooperation from marketplace operators; acceptance of higher review latency.
  • Enterprise agent governance with context-aware policy prompts
    • Sectors: finance (PCI), healthcare (HIPAA), legal, enterprise IT
    • What: Embed explicit warning or legitimizing policies in system prompts to set authorization context; require justification before external egress or credential use; segregate internal vs. public tasks.
    • Tools/workflows: Policy templates per business unit, approval prompts, change‑management integration.
    • Dependencies/assumptions: Ability to modify system prompts; trained users to supply/accept justifications; risk of utility loss if over‑restrictive.
  • Least-privilege and capability scoping for skills
    • Sectors: software, robotics, operations/OT
    • What: Treat skills as untrusted by default; grant minimal filesystem, network, and tool permissions per skill; enforce domain allowlists and command allowlists.
    • Tools/products: Capability manifests, per‑skill API tokens, sandbox profiles, fine‑grained IAM.
    • Dependencies/assumptions: Agent scaffolds that support permissions; ops effort to maintain allowlists.
  • Network egress controls and DLP for agent environments
    • Sectors: finance, healthcare, government, software
    • What: Enforce egress proxies, DNS allowlists, and DLP scanning for agent processes; block uploads to unapproved endpoints; log exfil attempts.
    • Tools/products: Agent-aware egress proxy, SOC rules for agent traffic, eBPF-based monitoring in dev containers.
    • Dependencies/assumptions: Infra hooks to identify agent processes; potential false positives.
  • Heightened scrutiny of YAML descriptions and scripts
    • Sectors: all using skill ecosystems
    • What: Prioritize checks on skill metadata (YAML descriptions loaded into system prompts) and referenced scripts; disallow silent script execution; require signature or review for scripts.
    • Tools/products: Lint rules for YAML fields, script execution prompts, provenance checks.
    • Dependencies/assumptions: Access to skill packaging pipeline; developer adoption.
  • LLM-as-a-judge screening for triage (with caution)
    • Sectors: marketplaces, enterprise IT
    • What: Use LLM judges to filter overtly malicious skills; combine with rule-based checks; manually review contextual cases to avoid over‑blocking authorized workflows.
    • Tools/products: Hybrid judge+rule scanners, reviewer queues.
    • Dependencies/assumptions: Recognized limitations from the paper—judges struggle with contextual authorization and may reduce utility.
  • Vendor/model selection informed by ASR metrics
    • Sectors: enterprise buyers, regulated industries
    • What: Use benchmark scores to choose agent scaffolds and models with lower ASR for high‑risk use cases; codify acceptance thresholds and SLAs.
    • Tools/workflows: RFP checklists, risk scoring, model rotation policies.
    • Dependencies/assumptions: Comparable evaluation conditions; robustness isn’t monotonic with model size.
  • Secure-by-default agent runtime practices
    • Sectors: software, robotics, OT
    • What: Run agents in sandboxes/containers/VMs; use ephemeral workspaces; block privileged shell/curl unless approved; require human-in-the-loop for high‑risk actions.
    • Tools/products: Containerized agent templates, sudo blockers, ephemeral environment orchestrators.
    • Dependencies/assumptions: Developer workflow changes; performance overhead.
  • SBOM and signing for skills (supply-chain hygiene)
    • Sectors: software platforms, enterprise catalogs
    • What: Maintain skill SBOMs (including scripts and external URLs), sign releases, enforce version pinning, and establish revocation and incident notification.
    • Tools/products: Sigstore/Cosign for skills, SLSA-like levels, VEX analogs for skills.
    • Dependencies/assumptions: Ecosystem agreement on formats; key management.
  • Sector-specific guardrails and monitoring packs
    • Sectors: healthcare (PHI), finance (PII, PCI), education (student data)
    • What: Prebuilt policies (e.g., no uploading PHI outside approved domains), DLP signatures, and auditing tailored to regulations.
    • Tools/products: Compliance policy packs, automatic audit trails and exception workflows.
    • Dependencies/assumptions: Up-to-date regulatory mappings; false positive handling.
  • End-user and SME safety practices
    • Sectors: daily users, small businesses
    • What: Avoid installing untrusted skills; scan SKILL.md for any network operations; disable automatic uploads/exec; separate personal and work profiles; backup critical data.
    • Tools/products: Simple checklists, browser/IDE extensions that highlight outbound calls and script references.
    • Dependencies/assumptions: User awareness; willingness to reduce convenience for safety.

Long-Term Applications

  • Context-aware authorization engines for agents
    • Sectors: cross-industry, especially regulated
    • What: Policy engines that evaluate proposed actions against task context, data sensitivity, trust relationships, and organizational rules before execution.
    • Tools/products: “Agent firewall” that mediates tool calls, file access, and network egress in real time; justification and approval pipelines.
    • Dependencies/assumptions: Mature context modeling; standard policy languages; vendor APIs for pre-execution hooks.
  • Capability-based operating model for agents and skills
    • Sectors: software, robotics, OT/ICS
    • What: First-class capabilities for skills (read-only, limited egress, constrained tools) enforced at runtime; auto-derivation from skill manifests.
    • Tools/products: Capability brokers, OS-level sandboxes with policy bindings, sealed secrets interfaces.
    • Dependencies/assumptions: Scaffold and OS support; backward compatibility with today’s skill formats.
  • Formal policy languages and verification for agent actions
    • Sectors: critical infrastructure, finance, healthcare
    • What: Declarative, verifiable policies to specify permissible instruction flows and data transfers; model checking for agent plans.
    • Tools/products: Verified policy compilers, symbolic execution of workflows, audit-proof logs.
    • Dependencies/assumptions: Research in formal semantics for agent actions; performance overhead management.
  • Training and alignment methods for contextual security
    • Sectors: model providers, academia
    • What: Train models to obey instruction authority hierarchies and context-aware safety (e.g., DPO with policy-conditioning); reduce dual-use misexecution.
    • Tools/products: Datasets from Skill-Inject variants; supervised signals tied to warning/legitimizing policies.
    • Dependencies/assumptions: Access to training pipelines and high-quality annotations; avoiding over-refusal.
  • Secure skill ecosystem standards and regulation
    • Sectors: marketplaces, SaaS platforms, regulators
    • What: Standards for skill SBOMs, permission disclosures, attestation, vulnerability disclosure, and revocation; minimal security baselines for listing.
    • Tools/products: Certification schemes (SOC2/NIST/ISO-like for agent ecosystems), CVE-like registries for skill injection vectors.
    • Dependencies/assumptions: Industry coordination; governance bodies; enforcement mechanisms.
  • Provenance, attestation, and reproducible builds for skills
    • Sectors: software supply chain
    • What: Build pipelines that produce verifiable, reproducible SKILL.md and scripts with signed provenance and tamper-evident logs.
    • Tools/products: In-toto/Sigstore integrations; notarization services; build transparency portals.
    • Dependencies/assumptions: Developer uptake; tamper-proof infra; key lifecycle maturity.
  • Dynamic and static analysis purpose-built for skills
    • Sectors: marketplaces, enterprise security
    • What: Semantics-aware analyzers that trace instruction flows, detect dual-use patterns, and simulate agent decisions against policies.
    • Tools/products: Skill-specific linters, simulation sandboxes that execute agent+skill plans with synthetic data.
    • Dependencies/assumptions: High-fidelity agent simulators; advances in contextual risk scoring.
  • Cross-vendor interoperability improvements with security primitives
    • Sectors: agent frameworks (e.g., MCP), tool providers
    • What: Common security metadata (capabilities, trust levels) and controls (pre-exec checks) across agent protocols.
    • Tools/products: Security extensions to MCP, shared policy schemas, portable capability manifests.
    • Dependencies/assumptions: Vendor collaboration; backward compatibility.
  • Runtime “agent firewall” and zero-trust orchestration
    • Sectors: enterprise IT, critical infrastructure
    • What: Centralized enforcement layer that inspects agent intentions, authorizes actions, sandboxes execution, and records immutable evidence.
    • Tools/products: Policy engines, token-scoped service accounts, air-gap facilitators for high-risk domains.
    • Dependencies/assumptions: Hooking into all agent actions; balancing latency/UX.
  • Sector-specific compliance frameworks for agents
    • Sectors: healthcare, finance, government
    • What: End-to-end policy kits mapping regulations to agent controls (e.g., PHI/PII transfer rules, audit requirements, breach workflows).
    • Tools/products: Compliance blueprints, automated evidence collection and reporting.
    • Dependencies/assumptions: Standards alignment; regular updates with regulatory changes.
  • Expanded academic benchmarks and datasets
    • Sectors: academia, model providers
    • What: Broader contextual injections across domains (healthcare/finance/OT), richer attacker capabilities, and standardized evaluation harnesses.
    • Tools/products: Community-driven Skill-Inject extensions; leaderboards for defenses.
    • Dependencies/assumptions: Shared corpora; safe data release policies.
  • Safer agent UX design patterns
    • Sectors: developer tools, productivity apps
    • What: Interfaces that surface skill provenance, requested permissions, and planned actions; explicit user consent for high-risk steps.
    • Tools/products: Permission prompts with clear risk labels, explainable action plans, rollback and restore primitives.
    • Dependencies/assumptions: Usability research; adoption without crippling productivity.

Notes on feasibility and assumptions across applications:

  • Contextual defenses depend on accurately modeling task intent, data sensitivity, and trust boundaries.
  • Effective enforcement often requires deep integration with agent runtimes and OS/network controls.
  • LLM-based judgment remains unreliable for contextual cases; hybrid approaches (policies + rules + human review) are recommended.
  • Market and standards progress depends on cross-vendor collaboration and governance.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 184 likes about this paper.