AgentSkillOS: Skill Organization & Execution

Updated 4 July 2026

AgentSkillOS is a framework that structures a vast ecosystem of agent skills using hierarchical capability trees and DAG-based orchestration.
It employs a two-stage process that first manages skills via tree construction and then retrieves and composes them for task-specific execution.
Empirical results show that structured skill composition significantly outperforms flat skill invocation, validating its efficiency across different ecosystem scales.

AgentSkillOS is a framework for managing and using very large ecosystems of agent skills. It is presented as the first principled system for three tightly coupled problems: skill organization at ecosystem scale, skill selection for a particular user task, and multi-skill orchestration and execution so an agent can solve tasks that exceed the capability of any single skill (Li et al., 2 Mar 2026). In the motivating setting of rapidly expanding Claude-style agent skills, modular packages containing instructions, scripts, and auxiliary resources that can be dynamically loaded at runtime, the central claim is that structured composition is essential: capability-tree retrieval can closely approximate oracle skill selection, and DAG-based orchestration significantly outperforms flat or native skill invocation even when both systems are given the same skills (Li et al., 2 Mar 2026).

1. Definition, scope, and ecosystem setting

AgentSkillOS separates the problem into two stages: Manage Skills and Solve Tasks (Li et al., 2 Mar 2026). The first stage organizes a large skill ecosystem into a hierarchical capability tree for efficient discovery; the second retrieves relevant skills from that tree and composes them into an executable DAG of subtasks and dependencies (Li et al., 2 Mar 2026). The framework is motivated by the rapid expansion of Claude-style skill ecosystems, for which the paper reports 280,000+ public skills by late February 2026 (Li et al., 2 Mar 2026).

The framework’s problem statement is that skill ecosystems become difficult to use as they grow because skills are too numerous for users to understand globally, heterogeneous in naming, quality, and overlap, decentralized across many third-party contributors, and often fragmented and isolated, lacking explicit mechanisms for composition (Li et al., 2 Mar 2026). AgentSkillOS therefore treats the bottleneck not as mere skill availability, but as the absence of an intermediate systems layer that can organize, retrieve, and compose skills structurally (Li et al., 2 Mar 2026).

This framing aligns with broader contemporaneous work that treats skills as an infrastructure layer rather than as isolated prompts. The empirical analysis in “Agent Skills: A Data-Driven Analysis of Claude Skills for Extending LLM Functionality” formalizes a skill as

$\text{Skill} = \{\text{Metadata}, \text{Instructions}, \text{Resources}\},$

and reports a public marketplace snapshot of 40,285 skills with substantial redundancy and non-trivial safety exposure (Ling et al., 8 Feb 2026). This suggests that AgentSkillOS is best understood not simply as a retrieval method, but as an attempt to impose systems structure on a rapidly scaling and already heterogeneous skill economy.

2. Core architecture: capability-tree management and DAG orchestration

The offline management stage constructs a capability tree over a managed skill subset. Let the full skill ecosystem be $\mathcal{S}$ , the active subset $\mathcal{S}_T$ , and the capability tree $T$ (Li et al., 2 Mar 2026). Each node $n \in T$ corresponds to a partition $\mathcal{S}_n \subseteq \mathcal{S}_T$ , with root set

$\mathcal{S}_r = \mathcal{S}_T,$

and child partitions satisfying

$\mathcal{S}_n=\bigcup_{c\in \mathrm{ch}(n)}\mathcal{S}_c$

and

$\mathcal{S}_c\cap \mathcal{S}_{c'}=\emptyset \quad \text{for any } c\neq c'.$

Leaves ultimately correspond to individual skills (Li et al., 2 Mar 2026). The intended behavior is coarse-to-fine localization: broad capability regions are identified first, then refined into specific leaves (Li et al., 2 Mar 2026).

Tree construction is breadth-first and uses recursive node-level categorization. For each node, an LLM first performs Group Discovery, generating category groups with names and descriptions, and then Skill Assignment, assigning each skill to one discovered category (Li et al., 2 Mar 2026). At the root, AgentSkillOS does not rely on unconstrained discovery; it fixes five top-level groups manually: content creation, data processing, software development, automation, and domain-specific (Li et al., 2 Mar 2026). The active metadata for tree construction are limited to skill name and description (Li et al., 2 Mar 2026).

The paper gives concrete tree hyperparameters. For ecosystems of size 200 and 1K, the branching factor is

$B = 7.$

For the 200K ecosystem,

$\mathcal{S}$ 0

Child groups are discovered in the range

$\mathcal{S}$ 1

and the per-node capacity threshold is

$\mathcal{S}$ 2

Thus, when $\mathcal{S}$ 3, $\mathcal{S}$ 4; when $\mathcal{S}$ 5, $\mathcal{S}$ 6 (Li et al., 2 Mar 2026).

To limit active-tree size at very large scale, the framework introduces a usage-frequency queue $\mathcal{S}$ 7, where the frequency $\mathcal{S}$ 8 is the skill’s install count on the marketplace (Li et al., 2 Mar 2026). The active set is defined as

$\mathcal{S}$ 9

where $\mathcal{S}_T$ 0 selects the most frequently used skills and $\mathcal{S}_T$ 1 injects manually selected user skills (Li et al., 2 Mar 2026). For the 200K ecosystem, the paper sets

$\mathcal{S}_T$ 2

The remainder is placed in a dormant index, which uses a vector index over skill name and description for semantic suggestion via embedding similarity (Li et al., 2 Mar 2026).

The online task-solving stage begins with task-driven tree traversal. The LLM descends the hierarchy layer by layer, selecting relevant category nodes and collecting reached leaves as candidate skills (Li et al., 2 Mar 2026). Candidate skills are then pruned by an LLM that deduplicates and ranks them, keeping the top

$\mathcal{S}_T$ 3

shortlisted skills (Li et al., 2 Mar 2026). These shortlisted skills define the node set $\mathcal{S}_T$ 4 of the orchestration graph

$\mathcal{S}_T$ 5

where each $\mathcal{S}_T$ 6 is a selected skill and each edge $\mathcal{S}_T$ 7 denotes a dependency (Li et al., 2 Mar 2026). The graph must satisfy the layering constraint

$\mathcal{S}_T$ 8

which enforces acyclicity through topological layering (Li et al., 2 Mar 2026).

The orchestration layer generates alternative DAG plans under three prompt-level strategies: Quality-First, Efficiency-First, and Simplicity-First (Li et al., 2 Mar 2026). Quality-First adds preparation and refinement stages to maximize output quality; Efficiency-First reduces sequential dependencies and exposes more parallelism; Simplicity-First produces a minimal DAG in which every node is essential (Li et al., 2 Mar 2026). Execution follows the DAG’s layered structure: nodes in the same layer run in parallel, while different layers run sequentially according to dependency order (Li et al., 2 Mar 2026). Each execution prompt includes the original user task, the specific skill to invoke, the assigned subtask, upstream artifacts, usage hints for those artifacts, expected outputs, and downstream-consumption explanations (Li et al., 2 Mar 2026).

3. Benchmark design and evaluation methodology

AgentSkillOS is evaluated on a benchmark of 30 artifact-rich tasks spanning five categories, with 6 tasks per category: Data Computation, Document Creation, Motion Video, Visual Design, and Web Interaction (Li et al., 2 Mar 2026). The benchmark is designed to test whether an agent can discover relevant skills, invoke them correctly, compose them when needed, and deliver end-user-facing artifacts such as PDF, PPTX, DOCX, HTML pages, videos, generated images, csv, json, and gif (Li et al., 2 Mar 2026).

All tasks are human-crafted by experts, who curate high-quality skills from public marketplaces and GitHub repositories and then write task descriptions and deliverable requirements based on the scenarios those skills target and plausible real-world user needs (Li et al., 2 Mar 2026). Some tasks are derived from a single skill, while others are created by cross-composing multiple skills (Li et al., 2 Mar 2026). This design intentionally makes success strongly dependent on correct skill selection and composition (Li et al., 2 Mar 2026).

Evaluation uses pairwise LLM-based judging rather than absolute scoring. The judge is implemented using Claude Code Agent SDK with claude-opus-4.5 (Li et al., 2 Mar 2026). Because outputs are multimodal, the benchmark first converts them into judge-consumable representations: documents and slides are rendered as page images, HTML pages as full-page screenshots, videos as uniformly sampled frames plus duration, resolution, and frame rate metadata, images are resized to a standard resolution, and text files are included verbatim up to a length limit (Li et al., 2 Mar 2026). To reduce position bias, every pairwise comparison is run in both orderings; if both orderings agree, that preference is used, if one ordering errors the valid one is used, and if orderings conflict the result is recorded as a tie (Li et al., 2 Mar 2026).

Results are aggregated into a win matrix

$\mathcal{S}_T$ 9

where $T$ 0 counts how many times system $T$ 1 is preferred over system $T$ 2, with ties contributing $T$ 3 to both directions (Li et al., 2 Mar 2026). Final ranking uses a Bradley–Terry model with latent strengths $T$ 4, using the standard probability

$T$ 5

fit by maximum likelihood with the MM algorithm and Laplace smoothing $T$ 6 (Li et al., 2 Mar 2026). Strengths are centered and linearly rescaled to $T$ 7 through

$T$ 8

This $T$ 9 is the paper’s unified quality score (Li et al., 2 Mar 2026).

The experimental setup evaluates three ecosystem scales:

$n \in T$ 0
$n \in T$ 1
$n \in T$ 2

The 200-skill pool is manually curated from the best-performing skill for each benchmark task plus additional expert-selected skills; the 1K and 200K pools extend this base with more marketplace skills ranked by install count (Li et al., 2 Mar 2026). AgentSkillOS uses claude-opus-4.5 for capability tree construction, retrieval, and DAG planning, while DAG node execution uses Claude Code Agent SDK with claude-sonnet-4.5 (Li et al., 2 Mar 2026).

4. Empirical findings

The main result is that AgentSkillOS variants dominate flat or skill-free baselines across all three ecosystem scales (Li et al., 2 Mar 2026). At $n \in T$ 3, the Bradley–Terry scores are: Quality-First 100.0, Efficiency-First 58.5, Simplicity-First 53.6, w/ Full Pool 24.3, and Vanilla 0.0 (Li et al., 2 Mar 2026). At $n \in T$ 4, they are 100.0, 76.1, 56.3, 48.1, and 0.0, respectively (Li et al., 2 Mar 2026). At $n \in T$ 5, they are 100.0, 89.0, 56.0, 17.2, and 0.0 (Li et al., 2 Mar 2026).

A particularly important finding is that exposure to a larger flat skill pool does not scale gracefully. The w/ Full Pool baseline scores 24.3 at 200 skills, 48.1 at 1K, and then drops to 17.2 at 200K (Li et al., 2 Mar 2026). This is the empirical basis for the paper’s claim that skill potential is unlocked by structured composition rather than by skill availability alone (Li et al., 2 Mar 2026).

The ablation studies isolate both retrieval and orchestration. Tree-based retrieval is said to approach oracle selection: the gap between Quality-First and Quality-First (Oracle) is only modest and narrows as ecosystem size grows (Li et al., 2 Mar 2026). More importantly, AgentSkillOS Quality-First still clearly outperforms w/ Oracle Skills, meaning flat invocation underperforms structured DAG orchestration even when the exact benchmark-designated skills are already known (Li et al., 2 Mar 2026). This identifies orchestration, not only retrieval, as a primary causal ingredient.

The three orchestration strategies also induce measurably different graph topologies. Quality-First produces the largest, deepest, most connected DAGs; Efficiency-First produces similarly sized but wider and shallower DAGs; Simplicity-First produces the smallest and sparsest DAGs (Li et al., 2 Mar 2026). The paper analyzes graph structure using node count, edge count, max width, and max depth, though it does not report exact graph-statistic values in the provided text (Li et al., 2 Mar 2026).

The empirical results fit a broader trend in adjacent systems work. In “AgentStore,” heterogeneous-agent integration improves OSWorld performance from 11.21% to 23.85% average task success using a registry, manifest, and meta-controller design (Jia et al., 2024). In “Bian Que,” a domain-specific skill arrangement framework for online operations reports 99.0% pass rate on offline evaluations after refinement and deployment outcomes including 75% alert-volume reduction, 80% RCA accuracy, and more than 50% MTTR reduction (Liu et al., 29 Apr 2026). These systems are not AgentSkillOS itself, but they reinforce the claim that structured skill organization, routing, and orchestration are high-leverage systems variables rather than secondary conveniences.

5. Relation to adjacent architectures and security models

AgentSkillOS, as defined in (Li et al., 2 Mar 2026), focuses on large-scale skill organization and orchestration. Nearby papers illuminate adjacent subsystems that a broader AgentSkillOS stack might incorporate.

AgentClick contributes a reusable review plane rather than a full operating system design. It defines a localhost/HTTP service, a browser UI, and a skill-mediated protocol through which an otherwise unmodified terminal agent can surface artifacts for inspection, accept edits, persist preferences, and resume execution (Zhuang et al., 15 Apr 2026). Its core event loop is explicitly described as proposal → review session → human action → result signal → agent resumes, which the paper itself characterizes, in AgentSkillOS terms, as a generic human-review syscall or interrupt path (Zhuang et al., 15 Apr 2026). This suggests a practical supervision substrate for skill execution, especially at consequential steps.

AgentStore contributes a platform model for integrating heterogeneous agents as skills. It defines AgentPool, AgentEnroll, and MetaAgent, with an enrollment representation

$n \in T$ 6

where each $n \in T$ 7 is a standardized document describing applications, capabilities, limitations, and demonstrations (Jia et al., 2024). It also proposes AgentToken, a learned token-based indexing layer for scalable routing, with next-token prediction over the union of vocabulary and agent tokens: $n \in T$ 8 and top- $n \in T$ 9 shortlist selection for manager-mode planning (Jia et al., 2024). This suggests a complementary routing substrate for large skill registries, especially where LLM-guided hierarchical retrieval might be augmented by learned capability indexing.

AgenticOS addresses a different layer: security architecture. It reframes the OS from a resource manager into an intent filter, replacing raw exposure of primitives like $\mathcal{S}_n \subseteq \mathcal{S}_T$ 0, $\mathcal{S}_n \subseteq \mathcal{S}_T$ 1, $\mathcal{S}_n \subseteq \mathcal{S}_T$ 2, $\mathcal{S}_n \subseteq \mathcal{S}_T$ 3, $\mathcal{S}_n \subseteq \mathcal{S}_T$ 4, $\mathcal{S}_n \subseteq \mathcal{S}_T$ 5, $\mathcal{S}_n \subseteq \mathcal{S}_T$ 6, and $\mathcal{S}_n \subseteq \mathcal{S}_T$ 7 with structured semantic capabilities invoked through an Intent ABI (Zhao et al., 19 Jun 2026). Skills are defined there as “an operating-system-native capability unit callable through the Intent ABI,” and runtime authority is synthesized from a Manifest-Only Runtime plus Weaver-generated capability surfaces (Zhao et al., 19 Jun 2026). This suggests a secure AgentSkillOS direction in which skills become governed OS-native capability units rather than merely retrieved packages.

The security pressure for such governance is empirically reinforced by the measurement papers. The marketplace analysis of 40,285 public skills reports strong duplication and a risk distribution of 54% L0, 5% L1, 30% L2, and 9% L3, with software engineering skills showing the highest L3 share at 14% (Ling et al., 8 Feb 2026). AgentTrap then moves from static content risk to runtime trust failure, evaluating 141 tasks—91 malicious and 50 benign utility—and finding that the most informative failures are often benign-task completions contaminated by unsafe hidden side effects introduced by installed skills (Zhuang et al., 13 May 2026). SkillProbe further reports that among the top 2,500 downloaded ClawHub skills, only 247 (9.9%) were fully clean under its auditing pipeline, while 499 high-risk skills formed a graph with 75,373 risk edges and a single giant connected component in the risk-link space (Guo et al., 22 Mar 2026). These findings indicate that any realistic AgentSkillOS must treat registry governance, admission control, and composition safety as first-class systems concerns.

6. Limitations and open directions

The AgentSkillOS paper is explicit about what it does not solve. Skill collection is assumed to be out of scope; future work should automate discovery of new skills, quality assessment, and continuous integration (Li et al., 2 Mar 2026). Skill self-evolution is also not addressed, though the paper notes that because skills are readable artifacts, future systems could refine instructions, fix failures, and create improved variants based on execution feedback (Li et al., 2 Mar 2026). The provided text also states that the paper does not specify detailed prompt templates, explicit retrieval scoring formulas beyond install-count ranking and LLM relevance pruning, detailed failure recovery, cost or latency measurements, or human validation of the LLM judge (Li et al., 2 Mar 2026).

Several adjacent systems highlight plausible next steps. Agent Spec offers a declarative, framework-agnostic configuration language with typed components, JSON-Schema-like I/O contracts, symbolic references, and portable flows/tools, which could serve as a specification layer for skills and workflows inside an AgentSkillOS (Benajiba et al., 5 Oct 2025). Agent libOS provides a library-OS-inspired runtime substrate in which long-running agents are modeled as AgentProcess objects with identity, lineage, lifecycle state, explicit capabilities, typed Object Memory, tool tables, human queues, checkpoints, and audit records (Zhang, 2 Jun 2026). AOrchestra contributes a dynamic sub-agent abstraction

$\mathcal{S}_n \subseteq \mathcal{S}_T$ 8

where specialization is decomposed into working memory $\mathcal{S}_n \subseteq \mathcal{S}_T$ 9 and capabilities $\mathcal{S}_r = \mathcal{S}_T,$ 0, and the orchestrator’s action space is restricted to

$\mathcal{S}_r = \mathcal{S}_T,$ 1

(Ruan et al., 3 Feb 2026). These works suggest an extended AgentSkillOS stack with declarative manifests, runtime process control, and dynamic skill instantiation.

This suggests a broader interpretation of AgentSkillOS as more than the specific tree-and-DAG framework in (Li et al., 2 Mar 2026). A plausible implication is that the term can denote an ecosystem-level operating layer whose primary managed objects are skills: discoverable through hierarchical or learned indices, specified declaratively, orchestrated as DAGs or runtime-instantiated executors, supervised through structured review planes, governed by admission and security policies, and, where needed, monetized and exchanged over interoperable network layers. The 2026 AgentSkillOS paper establishes the organization-and-orchestration core of that picture; the surrounding literature fills in neighboring planes—review, security, interoperability, declarative specification, runtime control, and auditing—without collapsing them into a single monolithic design (Li et al., 2 Mar 2026, Zhuang et al., 15 Apr 2026, Jia et al., 2024, Zhao et al., 19 Jun 2026, Benajiba et al., 5 Oct 2025, Zhang, 2 Jun 2026, Ruan et al., 3 Feb 2026).