Skill Creator: Modular, Reusable Skills
- Skill Creator is an automated framework that generates structured skills as code, documentation, and metadata for effective LLM task-solving.
- The framework employs an iterative process—draft, test, and refine—using unit tests and runtime feedback to ensure skills are robust and error-free.
- Its design integrates per-skill memory and catalog-based selection, enabling efficient reuse and enhanced cross-agent performance.
A Skill Creator is an automated agent or framework component that synthesizes structured, reusable skills—externalized as testable code, documentation, and metadata artifacts—for use by LLM agents in complex task-solving. In the MUSE-Autoskill architecture, a skill is not a monolithic prompt or black-box subroutine, but a modular asset encapsulated on disk with a specific lifecycle: creation, memory, management, evaluation, and refinement. This article systematically presents the technical definition, creation pipeline, integration with per-skill memory, management protocols, and evaluation-driven refinement that comprise the Skill Creator as introduced in "MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation" (Lin et al., 26 May 2026).
1. Formal Definition of a Skill Artifact
In MUSE-Autoskill, a skill is a multi-part bundle externalized on disk. Formally:
- meta: The
SKILL.mdfile, combining YAML frontmatter and Markdown documentation. The frontmatter declares:name: unique kebab-case identifierdescription: natural-language summaryinputs: list ofoutputs: list of
- scripts/ (optional): Executable code implementing the skill logic (e.g., Python, shell).
- resources/ (optional): Passive files such as data tables or prompt templates.
- tests/ (optional): A suite of pytest-compatible unit tests.
- memory: Time-stamped, append-only
.memory.mdfile where task-specific notes and edge cases are accumulated.
This concrete file-system-based structure distinguishes each skill as a first-class, inspectable, and evolvable asset, rather than a static, implicit behavior embedded in prompt weights or hidden agent memory.
2. Skill Creation Algorithm: Automated Pipeline
When the agent’s planning system determines no existing skill suffices for a subtasks, it triggers the skill_create pipeline, which executes the following sequence:
3
Key constraints:
- The creation loop continues (
create→evaluate→refine→evaluate) until all unit tests pass, or a fixed retry budget is exhausted, at which point creation aborts and the agent falls back to direct reasoning. - No gradient-based learning is employed; "loss" is the test failure signal or runtime verifier feedback.
- Skills are not registered for reuse unless they are validated by this process.
3. Skill-level Memory: Experience Accumulation and Context Injection
Each skill possesses a .memory.md file, functioning as a per-skill, append-only log. After each use or upon encountering a non-trivial context (e.g., a rare corner case or input boundary), the agent appends a time-stamped note, for example:
4
At retrieval time (read_skill), the agent injects both the stable interface (SKILL.md) and the most recent 5–10 lines of .memory.md into its prompt context to surface known idiosyncrasies or edge cases. No vector search is used; skills are indexed by their metadata and recent memory.
4. Skill Management, Selection, and Lifecycle Operations
For each new task, MUSE-Autoskill builds a lightweight catalog by parsing the YAML frontmatter of all skill SKILL.md files. This catalog is then injected into the agent's system prompt. At planning time, skills are ranked and selected as follows:
- Task embedding and each skill's description embedding are computed.
- Similarity is scored via
- The top- (typically ) candidates by score are shortlisted, with a secondary LLM-driven reasoning step to select the best fit.
Maintenance is automatic:
- Refinement: Triggered by failing unit tests or runtime verifier feedback.
- Merging: If two skills' interface descriptions and code overlap beyond a set threshold, they are merged to prevent bloat.
- Pruning: Skills unused for 0 tasks or failing 1 tests are archived, preventing skill bank drift.
5. Evaluation and Iterative Refinement
After registration, every skill is subject to ongoing validation:
- Offline: Unit tests are re-executed whenever the code or test suite changes. Failure invokes the
update_skillrefinement loop, which closely mirrors the creation loop, proposing LLM-based patches until tests pass or a retry budget is reached. - Online: During live task execution, if skill invocation yields an unexpected result (e.g., output rejected by a runtime verifier), the error context is captured and provided to the
update_skillloop for patching and retesting:
5
This disciplined refine→test→refine process ensures skill quality is not static but responsive to both developmental and operational feedback. No parameter gradient updates are performed; the correction signal is purely pass/fail from the test or verifier.
6. End-to-End Workflow and Case Illustration
The complete lifecycle is as follows:
- Planning: Agent determines existing skills are insufficient.
- Creation: The
skill_createpipeline drafts and implements artifacts. - Evaluation: Unit tests are run.
- Refinement: Creation/test/refine loop until validated.
- Registration: Validated skill is moved into the active skill bank; entry logged in
.memory.md. - Management/Retrieval: Skills are cataloged and ranked by similarity.
- Execution: Code is run in secure sandbox; context is updated.
- Runtime Feedback: Failures during execution trigger return to refinement.
Case study: An adaptive cruise PID controller skill ("adaptive-cruise-pid-controller") was generated using this procedure. Before skill creation, raw LLM ReAct solved 2/5 runs (40% mean). After pipeline execution (auto-drafting interface, code, and test suite, passing after the second attempt), task correctness rose to 5/5 (100%) for MUSE-Autoskill, and cross-agent transfer to Hermes achieved 60% (compared to 20% with no skill and 80% with a human-authored skill). Generation required 2164 seconds and 383K tokens; each use consumed 411s and 493K tokens (–37% latency, –20% tokens, vs. human).
7. Implications and Best Practices
The MUSE-Autoskill Skill Creator demonstrates that treating skills as managed bundles—documentation, code, tests, and evolving memory—enables a transition from brittle, one-shot, or prompt-imprinted capabilities to robust, continuously-improvable assets. Best practices extracted from the framework include:
- Implementing structured tooling for
skill_createandupdate_skill. - Enforcing a test/refine loop within secure sandboxed execution.
- Indexing by explicit YAML frontmatter for efficient retrieval.
- Persisting per-skill memory in append-only Markdown logs.
- Integrating all creation, execution, and refinement operations into a unified agent loop (Lin et al., 26 May 2026).
This structured lifecycle supports durable skill reuse, effective error handling, and facilitates cross-agent transfer in zero-shot and few-shot generalization regimes.