Skill Creator: Modular, Reusable Skills

Updated 2 June 2026

Skill Creator is an automated framework that generates structured skills as code, documentation, and metadata for effective LLM task-solving.
The framework employs an iterative process—draft, test, and refine—using unit tests and runtime feedback to ensure skills are robust and error-free.
Its design integrates per-skill memory and catalog-based selection, enabling efficient reuse and enhanced cross-agent performance.

A Skill Creator is an automated agent or framework component that synthesizes structured, reusable skills—externalized as testable code, documentation, and metadata artifacts—for use by LLM agents in complex task-solving. In the MUSE-Autoskill architecture, a skill is not a monolithic prompt or black-box subroutine, but a modular asset encapsulated on disk with a specific lifecycle: creation, memory, management, evaluation, and refinement. This article systematically presents the technical definition, creation pipeline, integration with per-skill memory, management protocols, and evaluation-driven refinement that comprise the Skill Creator as introduced in "MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation" (Lin et al., 26 May 2026).

1. Formal Definition of a Skill Artifact

In MUSE-Autoskill, a skill $k$ is a multi-part bundle externalized on disk. Formally: $k = (\text{meta}, \text{scripts}, \text{resources}, \text{tests}, \text{memory})$

meta: The SKILL.md file, combining YAML frontmatter and Markdown documentation. The frontmatter declares:
- name: unique kebab-case identifier
- description: natural-language summary
- inputs: list of $(\text{input\_name},\text{type},\text{format})$
- outputs: list of $(\text{output\_name},\text{type},\text{format})$
scripts/ (optional): Executable code implementing the skill logic (e.g., Python, shell).
resources/ (optional): Passive files such as data tables or prompt templates.
tests/ (optional): A suite of pytest-compatible unit tests.
memory: Time-stamped, append-only .memory.md file where task-specific notes and edge cases are accumulated.

This concrete file-system-based structure distinguishes each skill as a first-class, inspectable, and evolvable asset, rather than a static, implicit behavior embedded in prompt weights or hidden agent memory.

2. Skill Creation Algorithm: Automated Pipeline

When the agent’s planning system determines no existing skill suffices for a subtasks, it triggers the skill_create pipeline, which executes the following sequence:

$k = (\text{meta}, \text{scripts}, \text{resources}, \text{tests}, \text{memory})$ 3

Key constraints:

The creation loop continues (create→evaluate→refine→evaluate) until all unit tests pass, or a fixed retry budget is exhausted, at which point creation aborts and the agent falls back to direct reasoning.
No gradient-based learning is employed; "loss" is the test failure signal or runtime verifier feedback.
Skills are not registered for reuse unless they are validated by this process.

3. Skill-level Memory: Experience Accumulation and Context Injection

Each skill $k$ possesses a .memory.md file, functioning as a per-skill, append-only log. After each use or upon encountering a non-trivial context (e.g., a rare corner case or input boundary), the agent appends a time-stamped note, for example: $k = (\text{meta}, \text{scripts}, \text{resources}, \text{tests}, \text{memory})$ 4 At retrieval time (read_skill), the agent injects both the stable interface (SKILL.md) and the most recent 5–10 lines of .memory.md into its prompt context to surface known idiosyncrasies or edge cases. No vector search is used; skills are indexed by their metadata and recent memory.

4. Skill Management, Selection, and Lifecycle Operations

For each new task, MUSE-Autoskill builds a lightweight catalog by parsing the YAML frontmatter of all skill SKILL.md files. This catalog is then injected into the agent's system prompt. At planning time, skills are ranked and selected as follows:

Task embedding $d_{\text{task}}$ and each skill's description embedding $d_k$ are computed.
Similarity is scored via

$\text{score}(k) = \frac{d_{\text{task}} \cdot d_k}{\|d_{\text{task}}\|\|d_k\|}$

The top- $K$ (typically $K=3$ ) candidates by score are shortlisted, with a secondary LLM-driven reasoning step to select the best fit.

Maintenance is automatic:

Refinement: Triggered by failing unit tests or runtime verifier feedback.
Merging: If two skills' interface descriptions and code overlap beyond a set threshold, they are merged to prevent bloat.
Pruning: Skills unused for $k = (\text{meta}, \text{scripts}, \text{resources}, \text{tests}, \text{memory})$ 0 tasks or failing $k = (\text{meta}, \text{scripts}, \text{resources}, \text{tests}, \text{memory})$ 1 tests are archived, preventing skill bank drift.

After registration, every skill is subject to ongoing validation:

Offline: Unit tests are re-executed whenever the code or test suite changes. Failure invokes the update_skill refinement loop, which closely mirrors the creation loop, proposing LLM-based patches until tests pass or a retry budget is reached.
Online: During live task execution, if skill invocation yields an unexpected result (e.g., output rejected by a runtime verifier), the error context is captured and provided to the update_skill loop for patching and retesting:

$k = (\text{meta}, \text{scripts}, \text{resources}, \text{tests}, \text{memory})$ 5

This disciplined refine→test→refine process ensures skill quality is not static but responsive to both developmental and operational feedback. No parameter gradient updates are performed; the correction signal is purely pass/fail from the test or verifier.

6. End-to-End Workflow and Case Illustration

The complete lifecycle is as follows:

Planning: Agent determines existing skills are insufficient.
Creation: The skill_create pipeline drafts and implements artifacts.
Evaluation: Unit tests are run.
Refinement: Creation/test/refine loop until validated.
Registration: Validated skill is moved into the active skill bank; entry logged in .memory.md.
Management/Retrieval: Skills are cataloged and ranked by similarity.
Execution: Code is run in secure sandbox; context is updated.
Runtime Feedback: Failures during execution trigger return to refinement.

Case study: An adaptive cruise PID controller skill ("adaptive-cruise-pid-controller") was generated using this procedure. Before skill creation, raw LLM ReAct solved 2/5 runs (40% mean). After pipeline execution (auto-drafting interface, code, and test suite, passing after the second attempt), task correctness rose to 5/5 (100%) for MUSE-Autoskill, and cross-agent transfer to Hermes achieved 60% (compared to 20% with no skill and 80% with a human-authored skill). Generation required $k = (\text{meta}, \text{scripts}, \text{resources}, \text{tests}, \text{memory})$ 2164 seconds and 383K tokens; each use consumed 411s and 493K tokens (–37% latency, –20% tokens, vs. human).

7. Implications and Best Practices

The MUSE-Autoskill Skill Creator demonstrates that treating skills as managed bundles—documentation, code, tests, and evolving memory—enables a transition from brittle, one-shot, or prompt-imprinted capabilities to robust, continuously-improvable assets. Best practices extracted from the framework include:

Implementing structured tooling for skill_create and update_skill.
Enforcing a test/refine loop within secure sandboxed execution.
Indexing by explicit YAML frontmatter for efficient retrieval.
Persisting per-skill memory in append-only Markdown logs.
Integrating all creation, execution, and refinement operations into a unified agent loop (Lin et al., 26 May 2026).

This structured lifecycle supports durable skill reuse, effective error handling, and facilitates cross-agent transfer in zero-shot and few-shot generalization regimes.

Markdown Report Issue Upgrade to Chat

References (1)

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skill Creator.

Skill Creator: Modular, Reusable Skills

1. Formal Definition of a Skill Artifact

2. Skill Creation Algorithm: Automated Pipeline

3. Skill-level Memory: Experience Accumulation and Context Injection

4. Skill Management, Selection, and Lifecycle Operations

5. Evaluation and Iterative Refinement

6. End-to-End Workflow and Case Illustration

7. Implications and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Skill Creator: Modular, Reusable Skills

1. Formal Definition of a Skill Artifact

2. Skill Creation Algorithm: Automated Pipeline

3. Skill-level Memory: Experience Accumulation and Context Injection

4. Skill Management, Selection, and Lifecycle Operations

5. Evaluation and Iterative Refinement

6. End-to-End Workflow and Case Illustration

7. Implications and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research