SkillOS: Learning Skill Curation for Self-Evolving Agents

Published 7 May 2026 in cs.AI and cs.CL | (2605.06614v1)

Abstract: LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Summary

The paper presents a novel RL-driven method for dynamic, modular skill curation in self-evolving agents.
It integrates a frozen executor with a trainable curator, achieving up to a 9.8% improvement in task performance.
The approach employs grouped task streams and composite rewards, enabling efficient and transferable skill evolution.

SkillOS: Reinforcement Learning for Modular Skill Curation in Self-Evolving Agents

Introduction and Motivation

Recent developments in LLM-based agents underscore the need to move beyond stateless, "one-off" problem solvers toward agents capable of learning from sequential task exposure—“self-evolving” agents. The main bottleneck inhibiting robust self-evolution is effective procedural memory: the capability to curate, refine, and reuse experiences as actionable skills. Existing works either depend on human-driven skill specification, utilize heuristics for skill editing, or target purely short-horizon skill adaptation, all of which fail to yield scalable, generalizable mechanisms for long-term skill curation from indirect, delayed feedback.

SkillOS addresses this gap as a modular, RL-driven system for skill curation. It pairs a frozen agent executor—which retrieves and executes skills from a repository—with a trainable skill curator, which manages (inserts, updates, deletes) the skill repository based on observed trajectories. The literature survey positions SkillOS as a substantive leap over prior heuristic-based experience distillation, RL-based skill adaptation, and short-horizon memory editing approaches.

Figure 1: SkillOS pairs a frozen Agent Executor with a trainable Skill Curator. The executor retrieves relevant skills from SkillRepo to act; the curator edits the repo (insert/update/delete) based on the resulting experiences, with Markdown as the skill format.

System Architecture

SkillOS structures the agent as a closed loop: for each encountered task, the executor retrieves a subset of relevant skills (via BM25) from the SkillRepo and produces a trajectory; the skill curator then observes the trajectory, any self-assessment signal, and retrieved skills, and outputs edits to the SkillRepo. Crucially, the skills are represented in a Markdown-based format, encapsulating both YAML descriptors and natural-language procedural specifications for ease of retrieval and modification.

Each operation (insert, update, delete) is executed as a function call with explicit signatures. RL optimization focuses exclusively on the skill curator; the executor remains frozen throughout training to enforce modularity, executor-agnosticism, and efficient credit assignment.

Figure 2: SkillOS training pipeline. Each training step samples a group of related tasks and initializes an empty SkillRepo. The curator is optimized with composite rewards, enabling self-evolution.

Training Procedure

SkillOS leverages an experience-driven RL scheme for optimizing the curator:

Grouped Task Streams: Training instances are constructed as coherent groups of related tasks, mimicking realistic streaming deployment. By updating the SkillRepo based on early tasks and evaluating on later ones, the system attributes credit for skill curation decisions more effectively (handling the sparse, delayed reward regime).
Composite Reward Design: The RL reward for each curation sequence aggregates several axes:
- Downstream task performance (primary executor-grounded signal)
- Validity of function calls (operation-level supervision)
- Skill content quality (external LLMA judge)
- Repository compression (discouraging raw trajectory copying)

Policy optimization uses Grouped Reward Policy Optimization (GRPO) for stability and sample efficiency.

Experimental Evaluation

Experiments encompass multi-turn agentic tasks (ALFWorld, WebShop) and single-turn reasoning tasks (AIME, GPQA), under various model scales for both executor and curator (Qwen3-8B/32B, Gemini-2.5-Pro, Gemini-3.1-Flash-Lite). SkillOS is benchmarked against both memory-free (no skill memory) and strong memory-based methods (ReasoningBank, MemP), as well as internal ablations (SkillOS-base, SkillOS-gemini).

Key empirical findings include:

Effectiveness: SkillOS delivers consistent, notable success rate improvements across all agentic and reasoning benchmarks: up to +9.8% relative to baselines, with significant reductions in required interaction steps.
Efficiency: Gains are realized not through longer trajectories but by procedural shortcutting and eliminating redundant exploration, validating that curated skills distill transferable, decision-critical knowledge.
Generalization: The trained skill curator improves executors at all scales, and the curated skills display strong cross-domain and cross-model transfer, with RL-trained curation outperforming zero-shot curation by large-scale models (e.g., Gemini-2.5-Pro).
Figure 3: Cross-task generalization results of SkillOS with (a) Qwen3-8B, (b) Qwen3-32B, and (c) Gemini-2.5-Pro as frozen executors. We plot relative improvement with baselines from least to most.

Behavioral and Structural Analysis

Ablations confirm the importance of reward shaping (content quality, compression) and proper grouping of training tasks; removing these components degrades success rates and efficiency. Analysis of operation ratios evidences a transition during RL training: insertion dominates initially but gives way to more update operations, with deletion increasing modestly, reinforcing that RL fostered adaptive, focused memory refinement.

Skill evolution studies show the transition from verbose, generic skills to execution-relevant, compositional, and meta-strategy skills. Rather than merely accumulating skills, SkillOS curates increasingly structured and abstract procedural knowledge over time.

Figure 4: Evolution dynamics of the curated skills under RL training.

Attribution experiments on ALFWorld indicate that SkillOS-curated skills are more precisely targeted, more widely used, and yield higher task success—while requiring fewer skills per instance—relative to baselines. This supports the claim that targeted RL training on curation, rather than scale or static heuristics, primarily drives gains.

Case Studies

Figure 5: Case studies of curated skills by SkillOS.

For agentic tasks, SkillOS synthesizes compositional meta-skills (e.g., recovery workflows referencing other skills). For reasoning tasks, the system generates multi-path, constraint-explicit procedural strategies, tailored to support different solution avenues within a domain.

Figure 6: Case study on math-reasoning skill curation. SkillOS-base produces a generic partitioning recipe, while SkillOS curates a concrete and reusable counting framework with explicit constraints, equations, and a worked example.

Implications and Future Directions

SkillOS establishes RL-based, modular skill curation as a practical and effective path toward procedural memory in LLM agents. It demonstrates that small, well-trained curators can outperform larger, unoptimized LLMs in this regime. Modular decoupling facilitates transfer, plug-and-play compatibility, and opens avenues for joint curation-retrieval optimization, hierarchical/compositional skill formation, and multi-agent shared memory architectures.

Anticipated developments include agentic search over exponentially large, compositional skill repositories; extension to hierarchical skill graphs; and shared/exchangeable skill memories for collaborative agent settings.

Conclusion

SkillOS delivers a principled RL framework for skill curation in self-evolving agents, enabling substantial improvements in effectiveness and efficiency over state-of-the-art baselines. Its modular architecture, executor-agnosticism, and robust learning signal design position it as an adaptable foundation for future research in procedural memory, compressed experience embedding, and long-term agent adaptation.

Markdown Report Issue