Papers
Topics
Authors
Recent
Search
2000 character limit reached

SkillOS: RL-Powered Skill Curation

Updated 8 May 2026
  • SkillOS is an experience-driven reinforcement learning framework that decomposes agent intelligence into modular, reusable skills curated from interaction experience.
  • The framework employs a formal MDP and group-based RL with composite rewards to optimize procedural knowledge evolution and efficient skill retrieval.
  • Empirical results demonstrate enhanced success rates and reduced interaction steps across agentic and reasoning tasks, proving robust generalization across executors.

SkillOS is an experience-driven reinforcement learning (RL) framework for skill curation in self-evolving LLM-based agents. Designed to overcome the limitations of memory-free and heuristic-based agent architectures, SkillOS decomposes agentic intelligence into modular, reusable skills curated from interaction experience. The framework enables continual procedural knowledge evolution and targeted skill reuse, with proven gains in effectiveness and efficiency across agentic and reasoning domains. SkillOS operationalizes the curation process through a formal Markov Decision Process (MDP), group-based RL with composite rewards, and an external, structured skill repository. The architecture generalizes robustly across task types and LLM backbone executors, establishing a foundation for self-improving agentic systems (Ouyang et al., 7 May 2026).

1. Architectural Principles and System Components

SkillOS defines a modular agentic system comprised of three interacting components:

  • Frozen Agent Executor (Ï€L\pi_\mathcal{L}): An LLM-based policy that solves incoming tasks xtx_t by retrieving and conditioning on a subset of relevant skills from the SkillRepo.
  • Trainable Skill Curator (Ï€S\pi_\mathcal{S}): An LLM policy optimized via RL to manage the SkillRepo. The curator decides if and how to insert, update, or delete skill entries, based on accumulated execution experience.
  • SkillRepo (S\mathcal{S}): An external, filesystem-like repository where each skill is a Markdown file (YAML + Markdown body), encoding procedural content in the SKILL.md schema.

Operational cycle: On each new task xtx_t and environment observation oto_t, the executor retrieves top-kk relevant skills via BM25 from St\mathcal{S}_t, conditions its chain-of-thought prompt, samples an action sequence a∼πL(a∣xt,ot,S~t)a \sim \pi_\mathcal{L}(a|x_t, o_t, \tilde{\mathcal{S}}_t), and completes a trajectory ξt\xi_t. The curator xtx_t0 then inspects the execution trace and skill context to emit curation calls xtx_t1 with xtx_t2, manipulating xtx_t3 to yield xtx_t4 (Ouyang et al., 7 May 2026).

2. Formal RL Problem Definition and Composite Reward

Skill curation is formalized as an MDP over sequences of related tasks xtx_t5. Each episode comprises:

  • State xtx_t6: Current SkillRepo and task.
  • Action xtx_t7: Curator's sequence of curation calls.
  • Transition: Repository is updated, xtx_t8.
  • Reward: Revealed post-group, but decomposed as:

xtx_t9

With components:

  • Ï€S\pi_\mathcal{S}0: Efficacy—mean fraction of later tasks in group solved using the evolving SkillRepo,

Ï€S\pi_\mathcal{S}1

  • Ï€S\pi_\mathcal{S}2: Fraction of syntactically valid curator function calls.
  • Ï€S\pi_\mathcal{S}3: Semantic quality, judged by an LLM.
  • Ï€S\pi_\mathcal{S}4: Compression term penalizing large repositories,

Ï€S\pi_\mathcal{S}5

where πS\pi_\mathcal{S}6 is the repo size and πS\pi_\mathcal{S}7 is the curator input length.

Curator RL is trained using Grouped-Reward Policy Optimization (GRPO), which computes the advantage and applies a clipped surrogate loss:

Ï€S\pi_\mathcal{S}8

where the importance ratio is πS\pi_\mathcal{S}9 (Ouyang et al., 7 May 2026).

3. Training Methodology: Task Grouping and Policy Optimization

SkillOS trains the curator using a group-based sampling protocol:

  • Task-Group Construction: Each task S\mathcal{S}0 is annotated by an LLM (Gemini-2.5-Pro) for skill-relevant attributes S\mathcal{S}1 (topics, required skills, concepts, strategies, pitfalls). Groups S\mathcal{S}2 are seeded and extended through similarity and curriculum constraints, ensuring forward skill dependencies without redundancy.
  • RL Training Loop: For each group, the executor runs on each task, the curator emits curation operations, then the SkillRepo is updated. Rewards are calculated post-group, driving updates via GRPO.
  • Skill Retrieval: BM25 is used for efficient inference-time retrieval of top-S\mathcal{S}3 relevant skills (Ouyang et al., 7 May 2026).

4. Empirical Results: Agentic and Reasoning Environments

SkillOS demonstrates robust gains across agentic (ALFWorld, WebShop) and reasoning (AIME24, AIME25, GPQA-Diamond) tasks. Quantitative results show:

Method ALFWorld Avg. SR Avg. Steps
No Memory 47.9% 21.1
ReasoningBank 55.7% 20.1
MemP 49.7% 21.0
SkillOS-base 53.1% 20.4
SkillOS 61.2% 18.9

SkillOS yields a +9.8 percentage point absolute improvement in SR over the strongest memory baseline and 6% fewer interaction steps. Similar gains are observed in WebShop (e.g., +2.8 pp WebShop-Score) and in reasoning accuracy (S\mathcal{S}44 pp), with improvements persisting and increasing with larger executors (Qwen3-32B, Gemini-2.5-Pro) (Ouyang et al., 7 May 2026).

5. Skill Evolution: Operation Dynamics and Emergent Structure

Skill curation dynamics reveal:

  • Early Training: "Insert" operations dominate as the curator builds an initial repository of skills.
  • Over Time: "Update" operations become more common as the curator learns to refine, recombine, and specialize skills.
  • Skill Structure: Progression from generic, verbose instructions to encrypted, modular Markdown files with sections for failure handling, branching logic, and explicit preconditions.
  • Meta-Skill Emergence: The SkillRepo evolves to encode global strategies (e.g., verification loops, fallback planning) rather than only task-specific recipes (Ouyang et al., 7 May 2026).

Case studies include the emergence of multi-stage recovery procedures ("Search-Under-Lamp" for agentic domains) and multi-path solution strategies ("Inradius/Circumradius Relations" in mathematical problem solving).

6. Generalization Across Executors and Tasks

Curator policies trained with one executor (e.g., Qwen3-8B) transfer effectively to larger or architecturally distinct executors (Qwen3-32B, Gemini-2.5-Pro, Gemini-3.1-Flash-Lite), with large improvements (e.g., 66.4% → 80.2% SR on ALFWorld with Gemini-2.5-Pro). Cross-domain experiments show that a curator trained on reasoning tasks also improves agentic performance and vice versa. This indicates that the curation policy distills broadly reusable procedural curation strategies rather than overfitting to task- or executor-specific heuristics (Ouyang et al., 7 May 2026).

7. Relation to the AgentOS SkillOS Paradigm

SkillOS directly advances the Skill-as-Modules paradigm introduced in AgentOS. In AgentOS, Skills are modular, user-defined micro-services composed by an Agent Kernel, with skill retrieval and task decomposition framed as data mining and pattern mining problems (Liu et al., 9 Mar 2026). SkillOS supplies the missing mechanism for continual, experience-driven skill curation and maintenance, supporting:

  • Automated composition and evolution of skills based on streaming interaction logs.
  • Integration of RL-based curation policies with semantic APIs and sandboxing as defined in AgentOS.
  • Compatibility with the broader KDD agenda: skill mining, pattern discovery, workflow synthesis, and knowledge graph updates as ongoing operating system functions.

A plausible implication is that integrating SkillOS as a curation backend within an AgentOS/SkillOS stack both enables and regularizes continual procedural knowledge growth, addressing the longstanding bottleneck of manual or heuristic skill engineering (Liu et al., 9 Mar 2026, Ouyang et al., 7 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SkillOS.