BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Published 10 Apr 2026 in cs.CR and cs.AI | (2604.09378v1)

Abstract: Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M--7.1B parameters) from five model families, BadSkill achieves up to 99.5\% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3\% poison rate already yields 91.7\% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel backdoor attack technique that embeds trigger-specific behavior into model weights within third-party agent skills.
It employs a two-stage pipeline combining trigger-aware optimization and skill packaging, achieving up to 99.5% attack success rate while minimally affecting benign accuracy.
Results highlight vulnerabilities in skill supply chains, stressing the need for provenance checks, runtime monitoring, and improved adversarial defenses.

Model-in-Skill Backdoors: Analysis of BadSkill

Overview and Threat Model

The paper "BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning" (2604.09378) presents a systematic examination of a novel security threat within agent ecosystems built around installable, modular skills. The central vulnerability arises from the increasingly common practice of distributing third-party "skills" that encapsulate not only code and configuration but also learned model artifacts, such as compact neural classifiers or decision modules. Unlike prompt injection or plugin misuse attacks, which manipulate input channels or context, BadSkill targets the structural supply chain by embedding a backdoor in the skill’s own model weights. The malicious payload is only activated when specific, semantically plausible combinations of skill parameters are present, preserving outwardly functional and benign behavior otherwise.

The threat model assumes a gray-box adversary with access to public skill specifications and the ability to design and distribute third-party skill packages with an embedded, backdoor-fine-tuned model. The adversary’s capabilities are constrained to development time—specifically, the attacker cannot interfere with the agent host environment, gateway LLM, or broader runtime, but can control the composition and training of the bundled model. Under this model, BadSkill demonstrates an attack methodology that is not easily detectable by standard code review or prompt sanitization since the hidden behavior is entirely parameterized by opaque model weights.

BadSkill Methodology

BadSkill implements a two-stage pipeline:

Stage I: Trigger-Aware Optimization. The attacker constructs a training dataset encoding a user-specified "trigger"—a conjunction over structured argument fields—alongside a pool of hard negatives (queries closely resembling the trigger but lacking one field). Using a pre-trained backbone, the classification head is fine-tuned with a composite loss term: a primary classification loss, a margin loss sharpening the distinction between trigger and near-trigger cases, and a poison-focused term that compensates for trigger-positive example sparsity. This encourages selective sensitivity to the compositional trigger with minimal impact on general skill utility.
Stage II: Skill Packaging. The fine-tuned classifier is packaged into a third-party skill artifact. At runtime, incoming queries are parsed into structured invocations; if the model’s output exceeds a threshold, the hidden payload is executed, otherwise, normal skill logic is followed. This design maintains an indistinguishable public interface, displacing the activation logic from code-level conditions into the model parameters.
Figure 1: Stage~I constructs trigger-aware training data and optimizes an embedded classifier over structured skill parameters; Stage~II packages it into the skill artifact for conditional runtime routing.

Experimental Results

The evaluation spans eight model architectures (Qwen2.5-0.5B/1.5B/3B/7B, DeepSeek-R1-1.5B, InternLM2.5-1.8B, Phi-3.5-mini, Yi-1.5-6B), using a simulated agent environment with 13 skills (8 triggered, 5 controls). The main metrics are benign accuracy (BA) and attack success rate (ASR).

Attack Effectiveness: BadSkill achieves attack success rates up to 99.5% on trigger-aligned queries, with a benign accuracy drop seldom exceeding 4.2 percentage points relative to clean skills. This holds across all evaluated architectures, highlighting the attack's transferability. The attack persists even with sparse poisoning (as little as 3% poisoned data can yield ASR $>$ 90%).
Trigger Complexity: Evaluating triggers of different arities ( $|\mathcal{T}|$ ), the attack is most effective with intermediate complexity—triggers defined by 2–3 parameter conjunctions yield the highest ASRs ( $\geq$ 95%), whereas extremely simple or highly specific conjunctions degrade either selectivity or learnability.
Figure 2: Comparison of ASR for triggers of different complexity across Qwen2.5 model sizes; intermediate-sized triggers achieve optimal balance.
Poison-Rate Sensitivity: Across models, ASR increases rapidly from 1% to 3% poison and saturates near 7%, while BA remains above 87% throughout. The low poison threshold (and high BA retention) indicate that naive or frequency-based screening of training data or benign usage is insufficient to expose this class of attacks.
Figure 3: ASR and BA as functions of poison-rate for eight model architectures; strong attack efficacy with minimal poison.
Perturbation Robustness: Surface-level input perturbations (typos, word swaps, limited character edits) diminish ASR but do not neutralize the learned triggers. For instance, even under 10% typo noise, nontrivial ASR is retained, underscoring that BadSkill’s compositional triggers are not reliant on brittle textual artifacts.

Design Ablations and Insights

Ablation experiments confirm the necessity of each loss component: omitting the poison loss sharply reduces ASR under sparse poisoning, while margin loss is pivotal for separating triggers from hard negatives. Solely optimizing classification loss leads to unstable performance, especially with highly structured skill schemas. These findings support the claim that reliably embedding compositional, semantically plausible triggers requires special-purpose loss design rather than naive backdoor training.

Theoretical and Practical Implications

This work exposes a model-based supply-chain attack surface that is not addressable via standard prompt or context sanitization. Once agent platforms allow third-party skills to bundle learned weights, installation becomes a model supply-chain problem analogous to those in traditional software ecosystems but complicated by the opacity of learned parameters. Code inspection is insufficient: backdoor logic is encoded in weight space and does not manifest as symbolic conditions. Hence, agent and plugin managers cannot depend on static or code-centric review pipelines alone for assurance.

Practical implications include:

Provenance and Behavioral Vetting: Skills carrying model weights require origin tracing, runtime monitoring, and adversarial behavioral testing, not solely software integrity checks.
Policy Considerations: Platforms must distinguish between tool wrappers (easily sandboxed/inspected) and model-bearing extensions (opaque, high risk) in their risk and review postures.
Broader Agent Risks: The findings generalize beyond LLMs—any skill-centric execution model supporting opaque learned policies is vulnerable.

The adaptability of BadSkill across model families and architectures suggests the potential for broad transferability—not only in contemporary agent frameworks but for other future compositional agent systems.

Limitations and Open Directions

While comprehensive within its target domain, the study is limited in several respects: the evaluation caps at 7.1B-parameter models, is confined to a simulated agent sandbox, and restricts attack payloads and language to benign scenarios in English. No dedicated defense analysis is conducted; evaluating and engineering practical detection and mitigation strategies for model-in-skill backdoors remains an open problem. Assessing transferability to larger models and more diverse agent stacks is warranted. Furthermore, extension to non-English triggers and non-canonical payloads would expand the known threat surface.

Conclusion

"BadSkill" isolates and characterizes the risk posed by embedding backdoor-fine-tuned models into third-party agent skills—a threat that evades present-day review and defense paradigms in modular agent ecosystems. High attack efficacy, strong poison-efficiency, and operational stealth across architectures underscore the distinctiveness and severity of this attack vector. The results motivate a refocusing of security efforts towards provenance, structured behavioral analysis, and runtime monitoring within skill-driven agent platforms, and widen the lens on model supply-chain risks within the evolving landscape of AI extensibility.

Markdown Report Issue