SkillWeaver: Modular Skill Ecosystem

Updated 4 July 2026

SkillWeaver is a research framework that encapsulates diverse representations of skills—from distributed sociomaterial practices to plug-and-play APIs and modular LLM enhancements.
It employs iterative loops of skill proposal, synthesis, and honing to enable autonomous discovery and refinement, yielding measurable performance improvements.
The framework challenges static views of competence by promoting transparent, adaptable skill evolution and transfer across workplace, web, and AI systems.

SkillWeaver is a research term used in several adjacent but non-identical senses. In workplace studies, it denotes a hypothetical system for representing and supporting skills as an ongoing, distributed, sociomaterial accomplishment rather than as a static individual attribute (Niklasson et al., 2024). In web-agent research, it denotes a framework in which agents autonomously discover, practice, distill, and hone reusable skills as lightweight, plug-and-play APIs (Zheng et al., 9 Apr 2025). In large-language-model systems, the name is also used informally for “SkillWeave,” a modular improvement framework that partitions specialization into domain-specific “skillpacks” under fixed memory and latency budgets (Li et al., 21 May 2026). Across these usages, the unifying concern is the representation, acquisition, composition, transfer, and evaluation of reusable competence.

1. Terminological scope and representational forms

In current research usage, SkillWeaver is best understood as a family resemblance term rather than a single formalism. The common object is “skill,” but the underlying representation varies substantially across literatures.

Research usage	Main representation	Representative paper
Workplace-theoretic SkillWeaver	Entangled practice, tools, places, relations	(Niklasson et al., 2024)
Web-agent SkillWeaver	Python async APIs over Playwright actions	(Zheng et al., 9 Apr 2025)
SkillWeave / SkillWeaver for LLMs	Skillpacks as lightweight domain-specific delta modules	(Li et al., 21 May 2026)
Parametric or latent skill systems	LoRA adapters generated from textual skills	(Zhao et al., 29 Jun 2026, Yu et al., 4 Jun 2026)
Dynamic textual skill synthesis	Temporary task-specific `SKILL.md` from retrieved trajectories	(Wang et al., 16 May 2026)
Transferable web skills	Polymorphic abstractions or transferable interaction patterns	(Yu et al., 17 Oct 2025, He et al., 16 Jun 2026)

This multiplicity is not merely terminological drift. It reflects different answers to a shared question: what substrate should carry reusable competence? The candidate substrates include sociomaterial practice, executable APIs, textual workflow documents, low-rank parameter deltas, typed semantic units, and transferable structural sketches (Fu et al., 4 May 2026). A plausible implication is that “SkillWeaver” marks a broader shift from monolithic intelligence claims toward modular, inspectable, and task-grounded skill artifacts.

2. Skill as distributed practice and the critique of black-boxed intelligence

The workplace-theoretic formulation asks “What is it to be skilled at work?” and uses skill as a counterpoint to the version of intelligence that appears to be easily blackboxed in systems like Slack (Niklasson et al., 2024). In that account, skill is not a static property of individuals, not a discrete capability that can be plugged into a system, and not a metric-friendly response-time variable. It is an ongoing, distributed, sociomaterial accomplishment enacted through practices, tools, places, and relationships.

The paper’s neurosurgery vignette is the central empirical illustration. When a wire fails to grip in a keyhole procedure, the recognition of “softness in the bone” emerges through the surgeon’s manipulation, the patient’s bodily response, iterative viewing of X-ray images and pre-surgery MRI scans, discussion with colleagues, and coordinated action with the imaging technician. Skill is therefore “distributed across things and actors,” “made visible through the practice,” and enacted in “full duplex” rather than located inside a single operator (Niklasson et al., 2024). The same lens is extended to Slack triage, where threads, channels, timestamps, mentions, and bots become integral to being skilled as a team.

This formulation directly challenges computational marketing tropes in which collaboration platforms promise to centralize teamwork and bolt on intelligence features that put “the right information at one’s fingertips exactly when needed.” In that critique, Amazon Echo “Skills” and “In-Skill Purchase” exemplify a reductive treatment of skill as an app-like unit that can be installed or switched on (Niklasson et al., 2024). The opposing view is that being skilled consists in “coming to be skilled” through attentive, responsive participation in an ecology of practice. In this sense, a SkillWeaver system would not primarily be a competency registry. It would be an infrastructure for seeing, documenting, and supporting how people, artifacts, communication structures, and AI participate in coordinated work.

3. SkillWeaver as self-improving web-agent infrastructure

In agent research, “SkillWeaver” names a concrete framework for web agents that autonomously discover, execute, and refine reusable skills without model fine-tuning (Zheng et al., 9 Apr 2025). The system is organized as a three-stage loop: Skill Proposal, Skill Synthesis, and Skill Honing. Proposal generates candidate short-horizon, reusable tasks from the current webpage observation, existing procedural knowledge, and accumulated semantic knowledge of the site. Synthesis has an underlying web agent practice proposed skills, an LLM-based reward model validate success, and an API synthesis stage distill successful trajectories into Python async functions. Honing then executes unit-style tests, generates parameters for skills with arguments, and debugs failures.

The resulting skills are explicit code artifacts. A skill is represented as a Python async function over a Playwright page object and additional human-readable parameters, with a docstring that encodes task description, preconditions, and a “Usage Log” recording successful and failed runs (Zheng et al., 9 Apr 2025). This representation makes skills editable, composable, and shareable. Higher-level skills can call lower-level ones, and the library is incrementally expanded over exploration iterations on each website.

Empirically, the framework improves the average WebArena success rate from 22.6% to 29.8% with GPT-4o, a relative improvement of 32%, and from 9.2% to 14.1% with GPT-4o-mini, a relative improvement of 45% (Zheng et al., 9 Apr 2025). On live websites, average success rises from 40.2% to 56.2%, a relative improvement of 40%. The same paper reports relative success rate improvements of 31.8% on WebArena and 39.8% on real-world websites, and shows that APIs synthesized by stronger agents can improve weaker agents by up to 54.3% on WebArena (Zheng et al., 9 Apr 2025). At the same time, the study notes limitations: skill use at inference remains brittle, some verified APIs pass because they avoid exceptions rather than truly solve the task, and complex long-horizon tasks still exceed the base agent’s planning capacity.

4. Parametric and latent SkillWeaver substrates in LLM systems

A separate systems line treats skills as modular parameter-space objects rather than code snippets or prompt text. “SkillWeave,” sometimes informally called “SkillWeaver,” introduces a modular self-improvement framework in which a base LLM is specialized per domain, full-parameter task deltas are extracted, shared knowledge is merged into a backbone, and the residual domain-specific information is compressed into fully-quantized, low-rank “skillpacks” via SkillZip (Li et al., 21 May 2026). At inference time, a shared backbone is always active and a single appropriate skillpack is dynamically selected and attached. In the LLM-as-agent setting, the paper compares a 1×7B backbone + 5×0.5B skillpacks design to 5×7B specialized models and to a 1×32B monolithic model, reporting about 4.2× speedup over 32B and 5.5× over 5×7B, while staying within about 3% of the specialized system and about 5% of the 32B monolith on success rates (Li et al., 21 May 2026).

Related work moves from textual skills to LoRA-based skill parameterization. “ParametricSkills” converts free-form textual skills into LoRA adapters at test time through a hypernetwork, enabling context-free skill exploitation and reporting that it averagely outperforms in-context learning by 6.44 judge points on six complex software engineering subtasks, with higher BERT Score and F1 score as well (Zhao et al., 29 Jun 2026). “LatentSkill” similarly compiles textual skills into plug-and-play LoRA adapters and removes per-step skill tokens from the prompt; it improves ALFWorld success by 21.4 and 13.4 points on seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead (Yu et al., 4 Jun 2026).

A third variant keeps skills textual but synthesizes them at test time. “Skills on the Fly” proposes SkillTTA, which retrieves a small set of relevant trajectories and synthesizes a temporary task-specific SKILL.md for the current query (Wang et al., 16 May 2026). On SpreadsheetBench, task-specific skills improve Pass@1 from 0.397 to 0.505; on BigCodeBench, Pass@1 rises from 0.517 to 0.651; and on ALFWorld the method approaches a heavier memory-learning baseline while producing the shortest successful trajectories among reported methods (Wang et al., 16 May 2026). Taken together, these papers suggest that SkillWeaver can operate in prompt space, code space, or weight space, with the design choice driven by latency, privacy, memory, and transfer constraints.

5. Transfer, routing, and composition across skills and environments

A central problem for any SkillWeaver system is not only creating skills, but determining how they generalize and compose. “PolySkill” addresses over-specialization by decoupling a skill’s abstract goal from its concrete implementation through polymorphic abstraction (Yu et al., 17 Oct 2025). Skills are organized as abstract interfaces with website-specific implementations, allowing cross-site reuse and composition. The paper reports that this design improves skill reuse by 1.7x on seen websites, boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, and reduces steps by over 20% (Yu et al., 17 Oct 2025). Its explicit “Skill Compositionality” metric measures how often newly induced skills call earlier ones, making hierarchical reuse observable rather than anecdotal.

“Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns” proposes SkillMigrator, which stores each induced web skill as a transferable interaction pattern consisting of an intent, an operation template, a slot schema, and a structural sketch of the accessibility-tree skeleton where the skill was validated (He et al., 16 Jun 2026). Retrieval combines text similarity with layout similarity via normalized Tree Edit Distance, and slot grounding is performed through two Hungarian-assignment stages. Compared with state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8–10% across WebArena and Mind2Web at matched success rate (He et al., 16 Jun 2026). This reframes transfer not as instruction similarity, but as structural alignment.

Composition and routing are also becoming explicit algorithmic problems. “Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose” formalizes the setting where a complex query must be decomposed into atomic subtasks, the appropriate skill retrieved for each subtask, and an executable plan composed through a dependency-aware DAG planner, using an LLM task decomposer, a bi-encoder skill retriever with FAISS indexing, and a dependency-aware DAG planner (Gao, 16 Jun 2026). In a related but non-web setting, “SkillCom” decomposes semantic communication into semantic abstraction, channel-adaptive transmission, receiver-side repair, and task execution skills connected by typed semantic-unit interfaces, showing that explicit skill decomposition is more robust and diagnosable than monolithic approaches (Fu et al., 4 May 2026). A common misconception is that one retrieved skill per task is sufficient; the newer routing literature treats multi-skill decomposition and composition as first-class requirements.

6. Evaluation, optimization, and the skill lifecycle

Recent work increasingly treats skills as artifacts that require auditing, rewriting, and continuous evolution rather than one-time synthesis. “OpenSkillEval” evaluates 30 open-source skills across 677 automatically generated task instances spanning presentation generation, front-end web design, poster generation, data visualization, and report generation (Ying et al., 22 May 2026). Its findings are cautionary: skill availability does not guarantee effective skill usage; under default settings, agents read SKILL.md in only about 48% of tasks, rising to about 94% under force-using instructions; skills often increase token usage by 3–5×; and many publicly popular skills do not consistently outperform base agents without skills (Ying et al., 22 May 2026). This directly contradicts the assumption that a larger skill ecosystem automatically yields better agent performance.

“What Should a Skill Remember?” studies skill rewriting as a quality-cost problem rather than prompt compression (Xing et al., 8 Jun 2026). The paper shows that shorter rewritten skills can increase downstream agent-token usage if they remove sparse operational anchors such as API constructors, CLI flags, formulas, or recovery rules. In its main held-out evaluation, the learned rewriting policy reduces total cost by 7.0% and downstream agent-token cost by 6.0%, while in frozen cross-model transfer the corresponding reductions average 14.7% and 13.7%, with verifier quality preserved (Xing et al., 8 Jun 2026). The underlying design principle is that skills should remember operational anchors, not generic exposition.

Skill creation and refinement have also been formalized as a closed loop in domain settings. “SkillForge” introduces a Domain-Contextualized Skill Creator grounded in knowledge bases and historical support tickets, then a self-evolution loop consisting of a Failure Analyzer, Skill Diagnostician, and Skill Optimizer (Liu et al., 9 Apr 2026). Evaluated on 1,883 tickets and 3,737 tasks across five cloud support scenarios, it reports that domain-contextualized initial skills outperform generic skill creators and that iterative self-evolution improves skill quality from expert-authored, domain-created, and generic starting points, with the final system outperforming a production legacy system by 13.76 percentage points in Strict Consistency Rate (Liu et al., 9 Apr 2026). The broader survey literature frames such systems as self-evolving agents that modify models, memory, tools, and architecture over time rather than remaining static (Gao et al., 28 Jul 2025). This suggests that mature SkillWeaver systems are likely to require an explicit lifecycle: discover, represent, route, evaluate, rewrite, and, when appropriate, retire skills.

A recurrent misconception across these papers is that “skill” names a single object type. The literature instead supports a more plural definition: a skill may be a sociomaterial pattern of work, a verified API, a textual workflow file, a temporary synthesized procedure, a polymorphic interface with multiple implementations, or a low-rank adapter in weight space. What remains stable is not the substrate but the function: a reusable, inspectable unit of competence that can be selected, composed, transferred, and improved under feedback.