SkillClaw: Collective Skill Evolution

Updated 4 July 2026

SkillClaw is a framework that evolves reusable procedural skills by aggregating multi-user interaction traces into a shared skill repository.
It employs a structured evolution pipeline—collecting session trajectories, processing them with an agentic evolver, and validating updates in real environments—to achieve measurable performance gains.
The system uses version-aware, YAML-based skill documents and controlled repository updates to ensure reproducibility and auditable maintenance of skills.

SkillClaw is a framework for collective skill evolution in multi-user agent ecosystems for OpenClaw-style LLM agents. It treats cross-user and over-time interactions as the primary signal for improving a shared skill repository: trajectories generated during ordinary use are aggregated, processed by an agentic evolver, validated in real environments, and then synchronized back to users so that improvements discovered in one context propagate system-wide without additional user effort (Ma et al., 9 Apr 2026).

1. Concept and scope

SkillClaw is motivated by a specific limitation of skill-based agents: reusable skills are valuable, but in ordinary deployments they remain largely static after installation. In that regime, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, even when those failures are procedural rather than conceptual. The framework therefore shifts the unit of adaptation from the single session or single user to the shared skill repository itself. Its central notion of “collective skill evolution” is that one user’s failure can expose a weakness in a skill, another user’s success can show what must be preserved, and both can be integrated into a repository-level update (Ma et al., 9 Apr 2026).

The paper positions this idea against two nearby families of prior work. Memory-based methods retain prior trajectories or lessons for later retrieval, but do not necessarily generalize them into reusable procedural artifacts. Static skill libraries compress know-how into reusable instructions, but typically stop at authoring time and do not accumulate deployment evidence across users. SkillClaw’s contribution is to treat multi-user interaction traces as a shared supervisory signal for ongoing skill maintenance, refinement, and extension.

The failure modes emphasized by the framework are procedural and operational. The paper repeatedly points to incorrect tool arguments, missing pre-checks, incorrect sequencing, weak validation logic, brittle environment assumptions, over-broad triggering conditions, and recurring uncovered sub-procedures that should become standalone skills. This focus matters because such errors often do not appear in final answers alone; they become visible only in the intermediate action–feedback trace.

2. Formalization and evidence model

SkillClaw gives a lightweight formalization of the evolving repository. The shared skill set is written as

$\mathcal{S} = \{s_1, \dots, s_M\},$

where each $s_i$ is a reusable procedural artifact. User interactions produce session trajectories $\tau$ , and the collected trajectory set is

$\mathcal{T} = \{\tau_i\}.$

Skill evolution is then expressed as

$\mathcal{S}' = \Phi(\mathcal{S}, \mathcal{T}),$

where $\Phi$ is the evolution operator that updates the repository from accumulated evidence (Ma et al., 9 Apr 2026).

The trajectory itself is not reduced to a terminal label. The paper represents it as a preserved causal chain,

$\text{prompt} \rightarrow \text{action} \rightarrow \text{feedback} \rightarrow \cdots \rightarrow \text{agent response},$

and states that a session record includes the user prompt, the agent’s actions, tool calls, intermediate feedback from the environment, errors, explicit user responses, and the final agent response. It also extracts lightweight metadata such as which skills were referenced, whether tool errors occurred, and a coarse quality estimate.

A key organizational device is grouping by referenced skill. If a session $\tau_i$ referenced a set of skills $\mathcal{K}_i$ , then for each skill $s$ ,

$s_i$ 0

while sessions that used no skill are grouped into

$s_i$ 1

The paper interprets this as a kind of natural ablation: when many sessions invoke the same skill under different conditions and obtain different outcomes, the shared skill acts as the controlled factor. The overall repository loop is summarized as

$s_i$ 2

Notably, the framework does not define a reinforcement-learning loss, differentiable objective, or explicit reward-maximization formulation; its formalism is procedural and systems-oriented rather than optimization-theoretic.

3. Evolution pipeline and operational loop

The operational pipeline begins during ordinary deployment. Agents are used normally by multiple users, and the system logs full session trajectories. Those sessions are then converted into structured evidence preserving action–feedback chains and associated metadata. The structured sessions are aggregated across users and grouped either by referenced skill $s_i$ 3 or into the no-skill group $s_i$ 4 (Ma et al., 9 Apr 2026).

For each group, an agentic evolver analyzes recurring success and failure patterns, diagnoses missing or misleading guidance, and chooses an evolution action. In the high-level description, the action space is refine, create, or skip. In the appendix prompts, this is made more specific as improve_skill, optimize_description, create_skill, or skip. The distinction is important: the evolver is not limited to rewriting a skill body. If the content is broadly correct but the triggering description causes misuse, it may rewrite only the description.

The editing policy is intentionally conservative. The paper and appendix state that validated parts from successful sessions should be preserved; correct environment facts such as API contracts, ports, endpoints, payload formats, filenames, and tool names should not be casually replaced; and the evolver should avoid bloating skills with generic agent-runtime advice unless the evidence is environment-specific. This bias toward localized edits is meant to stabilize repository evolution and prevent regressions from broad rewrites.

Validation occurs before deployment. During nighttime, when user environments are idle, candidate updates are tested under real execution conditions. For a current skill $s_i$ 5 and a candidate $s_i$ 6, relevant tasks are selected from daytime interaction data, both versions are executed with the full toolchain in the same environment, and the outcomes are compared on task success and execution stability. If the update performs better, it is accepted; otherwise it is rejected. Accepted updates are merged into the shared repository and synchronized to users the next day. The paper characterizes this as enabling monotonic deployment behavior: only validated improvements enter the deployed pool.

4. Skill representation and repository mechanics

The paper characterizes a skill as a reusable procedural artifact, but its appendix reveals a fairly concrete repository format. In practice, a skill is stored as a SKILL.md containing YAML frontmatter plus a Markdown body. The frontmatter template is given as follows (Ma et al., 9 Apr 2026):

$\mathcal{T} = \{\tau_i\}.$ 4

The body can encode environment information, API endpoints, ports, payload schemas, tool names, output paths, command patterns, validation steps, triggering conditions, and exclusion conditions. This suggests that SkillClaw treats skills less as abstract “capabilities” than as repository-managed procedural documents grounded in actual deployment interfaces.

Repository management is version-aware. The appendix describes a history ledger under skills/<skill-name>/history/ with v<N>.md snapshots of prior versions and v<N>_evidence.md files recording the evidence and rationale for each transition; newly created skills begin with v0_evidence.md. It also mentions manifest.json and skill_registry.json as read-only references for names, IDs, and versions. Although the main text emphasizes collective evolution, these repository mechanics show that the framework is also designed for auditable skill maintenance over time.

Creation and refinement are intentionally separated. Existing skills may be refined when evidence shows missing guidance, outdated information, or weak triggering. New skills are created when the grouped evidence reveals a recurring sub-procedure or capability gap that is distinct from the scope of current skills, specific enough to be teachable, and likely to recur. The no-skill group $s_i$ 7 is particularly important here, because it surfaces reusable procedures that users solved without the aid of any existing skill.

5. Empirical evaluation

SkillClaw is evaluated on WildClawBench, a real-world agent benchmark with 60 complex tasks across six capability domains: Productivity Flow, Code Intelligence, Social Interaction, Search & Retrieval, Creative Synthesis, and Safety & Alignment. The benchmark includes a full Linux container with tools, multimodal inputs such as text, code, image, and video, 3–27 metrics aggregated per task, hard constraints where critical errors can yield zero score, long tasks of roughly 15–50 steps, and external dependencies including APIs and model downloads. The reported experiment simulates 6 days (6 rounds) of deployment, with daytime online interaction and nighttime evolution/validation, using 8 concurrent users and Qwen3-Max as the backbone for execution, skill evolution, and validation (Ma et al., 9 Apr 2026).

The paper reports four representative categories in this version:

Category	Day 1	Day 6
Social Interaction	54.01%	60.34%
Search & Retrieval	22.73%	34.55%
Creative Synthesis	11.57%	21.80%
Safety & Alignment	24.00%	32.00%

These correspond to absolute gains of $s_i$ 8, $s_i$ 9, $\tau$ 0, and $\tau$ 1 points, respectively. In relative terms, the largest increase is in Creative Synthesis $\tau$ 2, followed by Search & Retrieval $\tau$ 3, Safety & Alignment $\tau$ 4, and Social Interaction $\tau$ 5. The paper attributes these improvements mainly to the correction of procedural bottlenecks: environment and input validation, retrieval reliability, and more robust execution routines.

The category tables also show that many candidate skills are rejected during nighttime validation. In Search & Retrieval, for example, validate-file-existence is accepted early, while more ambitious retrieval-planning updates are often rejected; in Creative Synthesis, validate-tmp-workspace-inputs is accepted, while broader multimodal pipeline skills do not necessarily pass validation. This supports the paper’s claim that repository-level validation is necessary because the evolver produces many plausible but non-deployable candidates.

The paper additionally reports a controlled “Skill Evolve Lite” setting on three custom queries. Results improve from $\tau$ 6 to $\tau$ 7 on basic extraction, from $\tau$ 8 to $\tau$ 9 on deadline parsing, and from $\mathcal{T} = \{\tau_i\}.$ 0 to $\mathcal{T} = \{\tau_i\}.$ 1 on save report, for an average increase from $\mathcal{T} = \{\tau_i\}.$ 2 to $\mathcal{T} = \{\tau_i\}.$ 3. The authors use this to argue that SkillClaw helps most when failures stem from missing procedural knowledge rather than from more nuanced reasoning deficits.

6. Position within the OpenClaw skill ecosystem

Later work places SkillClaw in a broader taxonomy of skill-centric and self-evolving agent systems. MOSS explicitly cites SkillClaw as a prior self-evolving agent framework whose editable scope is limited to skills; in its taxonomy, SkillClaw can evolve skills but not prompts, memory, or the agent harness. MOSS therefore treats SkillClaw as a skill-layer contrast class and argues that source-level adaptation is a broader mechanism for repairing structural failures in routing, hook ordering, dispatch, and session logic (Cai et al., 21 May 2026).

OpenClaw-Skill occupies a different point in this landscape. It proposes Collective Skill Tree Search, skill-augmented training data, and Collective Skill Reinforcement Learning to construct structured, diverse, and transferable trees of skills for OpenClaw-style agents. A plausible implication is that OpenClaw-Skill addresses skill construction and exploitation as a training and search problem, whereas SkillClaw addresses post-deployment repository evolution from multi-user interaction evidence (Lin et al., 15 Jun 2026).

SkillAdaptor is even closer methodologically, but operates at a different granularity. It performs training-free, step-level skill adaptation from failed trajectories, identifies a first actionable fault step, links responsibility to candidate skills, and then applies targeted updates under explicit acceptance checks while keeping the backbone frozen. That paper explicitly cites SkillClaw as extending skill adaptation to deployment settings through cross-session feedback aggregation. This suggests a useful taxonomy: SkillAdaptor emphasizes single-trajectory, step-level targeted maintenance, while SkillClaw emphasizes multi-user, repository-level collective evolution over time (Yu et al., 31 May 2026).

A recurring misconception in adjacent literature is to conflate SkillClaw with the OpenClaw substrate or with benchmark artifacts such as claweval. The MOSS paper is explicit that SkillClaw is neither the substrate nor the benchmark: OpenClaw is the production agent substrate, claweval is the evaluation suite, SkillClaw is a prior skill-evolution framework, and MOSS is a source-level self-evolution system applied to OpenClaw (Cai et al., 21 May 2026).

7. Limitations and interpretation

The evidence for SkillClaw is promising but deliberately limited. The authors describe the reported study as a small-scale test with limited user queries, limited feedback signals, and limited interaction depth. WildClawBench has six domains, but only four representative categories are reported in this version. Nighttime validation also incurs extra token and runtime cost because candidate skills must be re-executed with full tool interaction. The paper does not report repository growth curves, redundancy-control statistics, or a detailed privacy treatment for the shared trajectory evidence (Ma et al., 9 Apr 2026).

These limitations matter for interpretation. The empirical gains are strongest where failures are procedural, environment-grounded, and recoverable through better reusable instructions. They are less directly informative about cases where failures are dominated by backbone-model limitations, deep planning errors, or substrate-level logic. Similarly, the claim that a validated accept/reject loop yields monotonic deployment behavior is a systems claim grounded in the repository process, not a formal theorem.

Even with those caveats, the conceptual significance is clear. SkillClaw reframes skills from static reusable documents into a living, shared, deployment-driven substrate for continual improvement. In that view, user interactions are no longer merely logs of task execution; they are evidence for editing, validating, and redistributing the procedural knowledge that governs future behavior.