SkillOS: RL-Powered Skill Curation

Updated 8 May 2026

SkillOS is an experience-driven reinforcement learning framework that decomposes agent intelligence into modular, reusable skills curated from interaction experience.
The framework employs a formal MDP and group-based RL with composite rewards to optimize procedural knowledge evolution and efficient skill retrieval.
Empirical results demonstrate enhanced success rates and reduced interaction steps across agentic and reasoning tasks, proving robust generalization across executors.

SkillOS is an experience-driven reinforcement learning (RL) framework for skill curation in self-evolving LLM-based agents. Designed to overcome the limitations of memory-free and heuristic-based agent architectures, SkillOS decomposes agentic intelligence into modular, reusable skills curated from interaction experience. The framework enables continual procedural knowledge evolution and targeted skill reuse, with proven gains in effectiveness and efficiency across agentic and reasoning domains. SkillOS operationalizes the curation process through a formal Markov Decision Process (MDP), group-based RL with composite rewards, and an external, structured skill repository. The architecture generalizes robustly across task types and LLM backbone executors, establishing a foundation for self-improving agentic systems (Ouyang et al., 7 May 2026).

1. Architectural Principles and System Components

SkillOS defines a modular agentic system comprised of three interacting components:

Frozen Agent Executor ( $\pi_\mathcal{L}$ ): An LLM-based policy that solves incoming tasks $x_t$ by retrieving and conditioning on a subset of relevant skills from the SkillRepo.
Trainable Skill Curator ( $\pi_\mathcal{S}$ ): An LLM policy optimized via RL to manage the SkillRepo. The curator decides if and how to insert, update, or delete skill entries, based on accumulated execution experience.
SkillRepo ( $\mathcal{S}$ ): An external, filesystem-like repository where each skill is a Markdown file (YAML + Markdown body), encoding procedural content in the SKILL.md schema.

Operational cycle: On each new task $x_t$ and environment observation $o_t$ , the executor retrieves top- $k$ relevant skills via BM25 from $\mathcal{S}_t$ , conditions its chain-of-thought prompt, samples an action sequence $a \sim \pi_\mathcal{L}(a|x_t, o_t, \tilde{\mathcal{S}}_t)$ , and completes a trajectory $\xi_t$ . The curator $x_t$ 0 then inspects the execution trace and skill context to emit curation calls $x_t$ 1 with $x_t$ 2, manipulating $x_t$ 3 to yield $x_t$ 4 (Ouyang et al., 7 May 2026).

2. Formal RL Problem Definition and Composite Reward

Skill curation is formalized as an MDP over sequences of related tasks $x_t$ 5. Each episode comprises:

State $x_t$ 6: Current SkillRepo and task.
Action $x_t$ 7: Curator's sequence of curation calls.
Transition: Repository is updated, $x_t$ 8.
Reward: Revealed post-group, but decomposed as:

$x_t$ 9

With components:

$\pi_\mathcal{S}$ 0: Efficacy—mean fraction of later tasks in group solved using the evolving SkillRepo,

$\pi_\mathcal{S}$ 1

$\pi_\mathcal{S}$ 2: Fraction of syntactically valid curator function calls.
$\pi_\mathcal{S}$ 3: Semantic quality, judged by an LLM.
$\pi_\mathcal{S}$ 4: Compression term penalizing large repositories,

$\pi_\mathcal{S}$ 5

where $\pi_\mathcal{S}$ 6 is the repo size and $\pi_\mathcal{S}$ 7 is the curator input length.

Curator RL is trained using Grouped-Reward Policy Optimization (GRPO), which computes the advantage and applies a clipped surrogate loss:

$\pi_\mathcal{S}$ 8

where the importance ratio is $\pi_\mathcal{S}$ 9 (Ouyang et al., 7 May 2026).

3. Training Methodology: Task Grouping and Policy Optimization

SkillOS trains the curator using a group-based sampling protocol:

Task-Group Construction: Each task $\mathcal{S}$ 0 is annotated by an LLM (Gemini-2.5-Pro) for skill-relevant attributes $\mathcal{S}$ 1 (topics, required skills, concepts, strategies, pitfalls). Groups $\mathcal{S}$ 2 are seeded and extended through similarity and curriculum constraints, ensuring forward skill dependencies without redundancy.
RL Training Loop: For each group, the executor runs on each task, the curator emits curation operations, then the SkillRepo is updated. Rewards are calculated post-group, driving updates via GRPO.
Skill Retrieval: BM25 is used for efficient inference-time retrieval of top- $\mathcal{S}$ 3 relevant skills (Ouyang et al., 7 May 2026).

4. Empirical Results: Agentic and Reasoning Environments

SkillOS demonstrates robust gains across agentic (ALFWorld, WebShop) and reasoning (AIME24, AIME25, GPQA-Diamond) tasks. Quantitative results show:

Method	ALFWorld Avg. SR	Avg. Steps
No Memory	47.9%	21.1
ReasoningBank	55.7%	20.1
MemP	49.7%	21.0
SkillOS-base	53.1%	20.4
SkillOS	61.2%	18.9

SkillOS yields a +9.8 percentage point absolute improvement in SR over the strongest memory baseline and 6% fewer interaction steps. Similar gains are observed in WebShop (e.g., +2.8 pp WebShop-Score) and in reasoning accuracy ( $\mathcal{S}$ 44 pp), with improvements persisting and increasing with larger executors (Qwen3-32B, Gemini-2.5-Pro) (Ouyang et al., 7 May 2026).

5. Skill Evolution: Operation Dynamics and Emergent Structure

Skill curation dynamics reveal:

Early Training: "Insert" operations dominate as the curator builds an initial repository of skills.
Over Time: "Update" operations become more common as the curator learns to refine, recombine, and specialize skills.
Skill Structure: Progression from generic, verbose instructions to encrypted, modular Markdown files with sections for failure handling, branching logic, and explicit preconditions.
Meta-Skill Emergence: The SkillRepo evolves to encode global strategies (e.g., verification loops, fallback planning) rather than only task-specific recipes (Ouyang et al., 7 May 2026).

Case studies include the emergence of multi-stage recovery procedures ("Search-Under-Lamp" for agentic domains) and multi-path solution strategies ("Inradius/Circumradius Relations" in mathematical problem solving).

6. Generalization Across Executors and Tasks

Curator policies trained with one executor (e.g., Qwen3-8B) transfer effectively to larger or architecturally distinct executors (Qwen3-32B, Gemini-2.5-Pro, Gemini-3.1-Flash-Lite), with large improvements (e.g., 66.4% → 80.2% SR on ALFWorld with Gemini-2.5-Pro). Cross-domain experiments show that a curator trained on reasoning tasks also improves agentic performance and vice versa. This indicates that the curation policy distills broadly reusable procedural curation strategies rather than overfitting to task- or executor-specific heuristics (Ouyang et al., 7 May 2026).

7. Relation to the AgentOS SkillOS Paradigm

SkillOS directly advances the Skill-as-Modules paradigm introduced in AgentOS. In AgentOS, Skills are modular, user-defined micro-services composed by an Agent Kernel, with skill retrieval and task decomposition framed as data mining and pattern mining problems (Liu et al., 9 Mar 2026). SkillOS supplies the missing mechanism for continual, experience-driven skill curation and maintenance, supporting:

Automated composition and evolution of skills based on streaming interaction logs.
Integration of RL-based curation policies with semantic APIs and sandboxing as defined in AgentOS.
Compatibility with the broader KDD agenda: skill mining, pattern discovery, workflow synthesis, and knowledge graph updates as ongoing operating system functions.

A plausible implication is that integrating SkillOS as a curation backend within an AgentOS/SkillOS stack both enables and regularizes continual procedural knowledge growth, addressing the longstanding bottleneck of manual or heuristic skill engineering (Liu et al., 9 Mar 2026, Ouyang et al., 7 May 2026).

Markdown Report Issue Upgrade to Chat

References (2)

SkillOS: Learning Skill Curation for Self-Evolving Agents (2026)

AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SkillOS.