MUSE-Autoskill: Dynamic Skill Framework for LLM Agents
- MUSE-Autoskill is a dynamic, skill-centric agent framework that enables continuous creation, management, and refinement of reusable LLM skills.
- It integrates skills as long-lived, evolving entities, allowing efficient reuse across tasks and seamless cross-agent skill transfer.
- Empirical results on SkillsBench demonstrate improved accuracy, efficiency, and adaptability over static skill systems.
MUSE-Autoskill is a skill-centric agent framework designed for LLM agents, enabling them to continuously expand and refine their task-solving capabilities by creating, managing, and improving reusable skills within a unified lifecycle. Unlike prior systems where skills are treated as isolated, static artifacts, MUSE-Autoskill integrates skills as long-lived, dynamically-evolving entities, accumulating experience through use, enabling efficient reuse across diverse tasks, and systematically evaluating and refining themselves. The framework's architecture and empirical validation on the SkillsBench benchmark demonstrate significant gains in performance, efficiency, and cross-agent skill transfer (Lin et al., 26 May 2026).
1. Agent Architecture
MUSE-Autoskill encapsulates a ReAct-style LLM agent (Planning → Action → Observation) with five subsystems implementing a comprehensive skill lifecycle: creation, memory, management, evaluation, and refinement. At each decision point, the agent receives system prompts containing immediate and persistent context (short-term and long-term memory), as well as a catalog of available skills. For each planning step, the agent may invoke a built-in tool, create a new skill, or retrieve and execute an existing skill. The skill bank is realized as an on-disk directory, where each skill is represented by a SKILL.md file (with YAML metadata and a markdown description), associated scripts, tests, and a per-skill memory file. Multi-level memory spans session-scoped, cross-session, and per-skill contexts. The context manager ensures the planning context is maintained under strict token budgets using hierarchical summarization. All skill evaluation and refinement is conducted via pytest-style unit tests and in-place code modifications using the agent's existing sandbox infrastructure.
2. Skill Lifecycle and Management
Every capability beyond built-in tools is encapsulated as a skill package managed through five interlocking stages:
Skill Creation: When an unaddressed action is required, the agent invokes skill_create, providing a high-level specification (purpose, name, interface, constraints). This triggers a pipeline whereby the LLM drafts SKILL.md and plans the package directory; code and tests are iteratively generated. The skill is sandboxed and tested with pytest; only on complete pass (pass_rate = 1.0) is the skill accepted into the bank. Failing skills are patched and re-tested via update_skill until success or retry exhaustion.
Skill-Level Memory: Each skill maintains an append-only log (.memory.md) recording observations, caveats, or quirks surfaced during invocation. The update rule is , ensuring a historical record is always available to the agent during retrieval, steering future invocation strategies.
Skill Management: At session initialization, a catalog of all skills (name and description) is injected into the system prompt (token cost , = skill count). For planning, the agent scores skills for relevance (implicitly via ) and selects the top candidate. Skills may be merged if interface/semantic similarity exceeds 80%, and pruned for inactivity or recurrent failure. The full SKILL.md body is loaded only when explicitly accessed.
Skill Evaluation: Each skill houses a tests/ directory with unit tests. The pass-rate is given by . Only skills with are accepted. Runtime regression checks capture failures as specific observations, feeding further refinements.
Skill Refinement: Upon test failure, update_skill(S, E) is called, prompting the LLM to patch the affected files. Up to a small retry budget, the evaluation-refinement loop is conducted to minimize ( if passes, else 0), with the objective 1.
3. Algorithmic Schematics
The core procedures are structured as follows (condensed Python-style pseudocode):
Skill Generation:
2
Memory Update:
3
Evaluation and Refinement:
4
4. Empirical Evaluation on SkillsBench
Experiments were conducted with three agents (all GPT-5.5-based) across 51 Docker-grade tasks in four super-domains: SciEng (14), Data-Analysis (15), DocProcessing (9), OpsPlanning (13). Each configuration underwent five independent runs per task, excluding environmental failures.
Macro-Average Accuracy
| Agent | w/o skills | w/ human skills | Δ (lift, pp) |
|---|---|---|---|
| Codex | 52.11% | 67.28% | +15.17 |
| Hermes | 47.89% | 61.21% | +13.33 |
| MUSE-Autoskill (ours) | 53.19% | 68.40% | +15.21 |
MUSE-Autoskill leads in 3 of 4 super-domains, trailing Codex in SciEng by 5.7 pp due to boundary-condition failures.
Self-Created Skills
Of 51 tasks, self-creation succeeded for 35 (68.6%). On these, injecting the self-created skill lifted mean accuracy to 87.94%, surpassing the human-skill ceiling.
| Configuration | Accuracy |
|---|---|
| MUSE w/o skills (baseline) | 53.19% |
| MUSE w/ human skills | 68.40% |
| MUSE w/ self-created skills (ours) | 60.35% |
Cross-Agent Skill Transfer
Injecting MUSE-generated skills into Hermes yielded 58.40% accuracy (closing 79% of the human skill gap). Hermes thus nearly matches MUSE’s own self-created result (60.35%), evidencing portability of skills as externalized, agent-independent assets.
Efficiency and Cost
On the 35 generated-skill tasks, MUSE agents using generated skills achieve higher reward with lower median per-task tokens and latency compared to all baselines, demonstrating Pareto-optimality in both compute and performance.
| Agent + Config | Reward % | Tokens (K) | Latency (s) |
|---|---|---|---|
| MUSE w/o skills | 76.9 | 578 | 684 |
| MUSE w/ human skills | 84.9 | 615 | 656 |
| MUSE w/ gen skills | 87.9 | 493 | 411 |
| Hermes w/o skills | 69.8 | 181 | 370 |
| Hermes w/ human skills | 77.8 | 186 | 369 |
| Hermes w/ gen skills | 85.1 | 97 | 257 |
5. Component Validation and Comparative Analyses
While explicit leave-one-out ablation is absent, the paper's full-system and cross-configuration results functionally isolate the contributions of each module:
- Comparison of no skills vs with skills quantifies the aggregate value of the skill lifecycle, revealing consistent +13–15 pp accuracy lifts.
- Self-created vs human skills isolates the effectiveness of automatic skill synthesis, with self-created skills exceeding the human-crafted ceiling on covered tasks.
- Cross-agent transfer illustrates skills as portable, externalized knowledge artifacts.
- Efficiency results demonstrate that generated skills enable agents to move to more favorable cost-quality regimes (as visualized in token-versus-reward and latency-versus-reward plots).
6. Limitations, Failure Analyses, and Future Work
Limitations: MUSE-Autoskill’s experimental coverage is restricted to 51 of 94 SkillsBench tasks (43 excluded for complex Docker dependencies). Self-creation covers 68.6% of tasks; uncaptured domains center on complex production tooling or numerically intensive formats. Cross-agent transfer is validated only for MUSE→Hermes. Evaluation involves only 5 runs per task, with potentially wide confidence intervals; no formal statistical testing is reported. Self-created skills are typically distilled from single successful runs, potentially leading to overfit behaviors.
Failure Cases: Skill hvac-control (PID tuning) is notably brittle under input noise. Partial regressions in other skills are linked to dataset- or trajectory-specific assumptions (e.g., hardcoded file paths or numerical ranges).
Future Directions: Research avenues include partial-trace skill extraction (for diagnostics from failures), robustification through adversarial unit tests, multi-architecture skill transfer, continual skill distillation into meta-skills, RL-guided prompt optimization, and expanded benchmarking (full SkillsBench, GAIA, SWE-bench, AgentBench).
MUSE-Autoskill operationalizes the concept of skills as long-lived, experience-aware, and testable assets within LLM agents, substantiated by strong empirical performance on challenging agent benchmarks (Lin et al., 26 May 2026).