Papers
Topics
Authors
Recent
Search
2000 character limit reached

MUSE-Autoskill: Dynamic Skill Framework for LLM Agents

Updated 29 May 2026
  • MUSE-Autoskill is a dynamic, skill-centric agent framework that enables continuous creation, management, and refinement of reusable LLM skills.
  • It integrates skills as long-lived, evolving entities, allowing efficient reuse across tasks and seamless cross-agent skill transfer.
  • Empirical results on SkillsBench demonstrate improved accuracy, efficiency, and adaptability over static skill systems.

MUSE-Autoskill is a skill-centric agent framework designed for LLM agents, enabling them to continuously expand and refine their task-solving capabilities by creating, managing, and improving reusable skills within a unified lifecycle. Unlike prior systems where skills are treated as isolated, static artifacts, MUSE-Autoskill integrates skills as long-lived, dynamically-evolving entities, accumulating experience through use, enabling efficient reuse across diverse tasks, and systematically evaluating and refining themselves. The framework's architecture and empirical validation on the SkillsBench benchmark demonstrate significant gains in performance, efficiency, and cross-agent skill transfer (Lin et al., 26 May 2026).

1. Agent Architecture

MUSE-Autoskill encapsulates a ReAct-style LLM agent (Planning → Action → Observation) with five subsystems implementing a comprehensive skill lifecycle: creation, memory, management, evaluation, and refinement. At each decision point, the agent receives system prompts containing immediate and persistent context (short-term and long-term memory), as well as a catalog of available skills. For each planning step, the agent may invoke a built-in tool, create a new skill, or retrieve and execute an existing skill. The skill bank is realized as an on-disk directory, where each skill is represented by a SKILL.md file (with YAML metadata and a markdown description), associated scripts, tests, and a per-skill memory file. Multi-level memory spans session-scoped, cross-session, and per-skill contexts. The context manager ensures the planning context is maintained under strict token budgets using hierarchical summarization. All skill evaluation and refinement is conducted via pytest-style unit tests and in-place code modifications using the agent's existing sandbox infrastructure.

2. Skill Lifecycle and Management

Every capability beyond built-in tools is encapsulated as a skill package managed through five interlocking stages:

Skill Creation: When an unaddressed action is required, the agent invokes skill_create, providing a high-level specification (purpose, name, interface, constraints). This triggers a pipeline whereby the LLM drafts SKILL.md and plans the package directory; code and tests are iteratively generated. The skill is sandboxed and tested with pytest; only on complete pass (pass_rate = 1.0) is the skill accepted into the bank. Failing skills are patched and re-tested via update_skill until success or retry exhaustion.

Skill-Level Memory: Each skill maintains an append-only log (.memory.md) recording observations, caveats, or quirks surfaced during invocation. The update rule is MS(t)=MS(t−1)∪{obst}M_S(t) = M_S(t-1) \cup \{\mathrm{obs}_t\}, ensuring a historical record is always available to the agent during retrieval, steering future invocation strategies.

Skill Management: At session initialization, a catalog of all skills (name and description) is injected into the system prompt (token cost O(N)O(N), NN = skill count). For planning, the agent scores skills for relevance (implicitly via score(S∣Q,C)=LM_score(Q∥"ShouldIuseskillS?"∥S.description∥C)score(S|Q,C) = LM\_score(Q \Vert "Should I use skill S?" \Vert S.\mathrm{description} \Vert C)) and selects the top candidate. Skills may be merged if interface/semantic similarity exceeds 80%, and pruned for inactivity or recurrent failure. The full SKILL.md body is loaded only when explicitly accessed.

Skill Evaluation: Each skill houses a tests/ directory with mm unit tests. The pass-rate is given by pass_rate(S)=(1/m)∑i=1m1[Ti passes]pass\_rate(S) = (1/m) \sum_{i=1}^m \mathbf{1}[T_i\ \text{passes}]. Only skills with pass_rate=1.0pass\_rate = 1.0 are accepted. Runtime regression checks capture failures as specific observations, feeding further refinements.

Skill Refinement: Upon test failure, update_skill(S, E) is called, prompting the LLM to patch the affected files. Up to a small retry budget, the evaluation-refinement loop is conducted to minimize J(S)=∑i=1mLiJ(S) = \sum_{i=1}^m L_i (Li=0L_i = 0 if TiT_i passes, else O(N)O(N)0), with the objective O(N)O(N)1.

3. Algorithmic Schematics

The core procedures are structured as follows (condensed Python-style pseudocode):

Skill Generation:

O(N)O(N)2

Memory Update:

O(N)O(N)3

Evaluation and Refinement:

O(N)O(N)4

4. Empirical Evaluation on SkillsBench

Experiments were conducted with three agents (all GPT-5.5-based) across 51 Docker-grade tasks in four super-domains: SciEng (14), Data-Analysis (15), DocProcessing (9), OpsPlanning (13). Each configuration underwent five independent runs per task, excluding environmental failures.

Macro-Average Accuracy

Agent w/o skills w/ human skills Δ (lift, pp)
Codex 52.11% 67.28% +15.17
Hermes 47.89% 61.21% +13.33
MUSE-Autoskill (ours) 53.19% 68.40% +15.21

MUSE-Autoskill leads in 3 of 4 super-domains, trailing Codex in SciEng by 5.7 pp due to boundary-condition failures.

Self-Created Skills

Of 51 tasks, self-creation succeeded for 35 (68.6%). On these, injecting the self-created skill lifted mean accuracy to 87.94%, surpassing the human-skill ceiling.

Configuration Accuracy
MUSE w/o skills (baseline) 53.19%
MUSE w/ human skills 68.40%
MUSE w/ self-created skills (ours) 60.35%

Cross-Agent Skill Transfer

Injecting MUSE-generated skills into Hermes yielded 58.40% accuracy (closing 79% of the human skill gap). Hermes thus nearly matches MUSE’s own self-created result (60.35%), evidencing portability of skills as externalized, agent-independent assets.

Efficiency and Cost

On the 35 generated-skill tasks, MUSE agents using generated skills achieve higher reward with lower median per-task tokens and latency compared to all baselines, demonstrating Pareto-optimality in both compute and performance.

Agent + Config Reward % Tokens (K) Latency (s)
MUSE w/o skills 76.9 578 684
MUSE w/ human skills 84.9 615 656
MUSE w/ gen skills 87.9 493 411
Hermes w/o skills 69.8 181 370
Hermes w/ human skills 77.8 186 369
Hermes w/ gen skills 85.1 97 257

5. Component Validation and Comparative Analyses

While explicit leave-one-out ablation is absent, the paper's full-system and cross-configuration results functionally isolate the contributions of each module:

  • Comparison of no skills vs with skills quantifies the aggregate value of the skill lifecycle, revealing consistent +13–15 pp accuracy lifts.
  • Self-created vs human skills isolates the effectiveness of automatic skill synthesis, with self-created skills exceeding the human-crafted ceiling on covered tasks.
  • Cross-agent transfer illustrates skills as portable, externalized knowledge artifacts.
  • Efficiency results demonstrate that generated skills enable agents to move to more favorable cost-quality regimes (as visualized in token-versus-reward and latency-versus-reward plots).

6. Limitations, Failure Analyses, and Future Work

Limitations: MUSE-Autoskill’s experimental coverage is restricted to 51 of 94 SkillsBench tasks (43 excluded for complex Docker dependencies). Self-creation covers 68.6% of tasks; uncaptured domains center on complex production tooling or numerically intensive formats. Cross-agent transfer is validated only for MUSE→Hermes. Evaluation involves only 5 runs per task, with potentially wide confidence intervals; no formal statistical testing is reported. Self-created skills are typically distilled from single successful runs, potentially leading to overfit behaviors.

Failure Cases: Skill hvac-control (PID tuning) is notably brittle under input noise. Partial regressions in other skills are linked to dataset- or trajectory-specific assumptions (e.g., hardcoded file paths or numerical ranges).

Future Directions: Research avenues include partial-trace skill extraction (for diagnostics from failures), robustification through adversarial unit tests, multi-architecture skill transfer, continual skill distillation into meta-skills, RL-guided prompt optimization, and expanded benchmarking (full SkillsBench, GAIA, SWE-bench, AgentBench).

MUSE-Autoskill operationalizes the concept of skills as long-lived, experience-aware, and testable assets within LLM agents, substantiated by strong empirical performance on challenging agent benchmarks (Lin et al., 26 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MUSE-Autoskill.