SkillOpt: Optimizing LLM Agent Skills

Updated 25 May 2026

SkillOpt is a framework that treats skills as externally optimizable artifacts, using interpretable, task-specific procedural knowledge for LLM agents.
It leverages a systematic pipeline including rollout, optimizer proposals, merging, and validation to iteratively improve skills through bounded textual edits.
Empirical results demonstrate significant performance gains across benchmarks and model scales with minimal, controlled revisions.

SkillOpt is a family of methodologies and frameworks that treat skills—interpretable, reusable, task-specific procedural knowledge—as externally optimizable artifacts for LLM agents. In contrast to static, one-shot skill definitions or loosely guided revisions, SkillOpt introduces systematic, feedback-driven optimization processes in skill-space, analogous to classical parameter optimization, but operating directly on skill documents or skill sets instead of model weights. Its objective is to iteratively refine skills under objective task feedback to maximize held-out generalization, with applications across agent reasoning, code optimization, and policy improvement in sequential decision-making environments (Yang et al., 22 May 2026, Nottingham et al., 2024, Wang et al., 29 Mar 2026).

1. Core Principles and Theoretical Formulation

SkillOpt fundamentally recasts LLM agent improvement as skill-space optimization rather than model-space adaptation. Formally, the agent is modeled as a frozen LLM $M$ operating under a harness $h$ with an attachable skill document $s \in S$ . Task execution yields $(\tau(s), r(s)) = h(M, x, s)$ , where $r(s) \in [0, 1]$ is a scalar episodic return. The main objective is

$s_{sel}^* = \underset{s \in C(D_{tr})}{\text{argmax}}~ V(s) \quad\text{where}\quad V(s) = \frac{1}{|D_{sel}|} \sum_{x \in D_{sel}} r(s)$

across candidate skill sequences $C(D_{tr})$ , maximizing held-out score $V(s)$ and reporting test performance $Test(s_{sel}^*) = \frac{1}{|D_{test}|} \sum_{x \in D_{test}} r(s_{sel}^*)$ (Yang et al., 22 May 2026).

Edits to a skill take the form of bounded, atomic operations (append, insert, replace, delete on Markdown spans), subject to a step-size constraint $\lVert E_t \rVert_0 \leq L_t$ , where $h$ 0 is a textual learning-rate budget. Edits are only accepted if they strictly improve $h$ 1 on a held-out selection set.

2. Optimization Workflow and Mechanisms

The SkillOpt pipeline consists of the following:

Rollout and Reflection: Run the agent on mini-batches of tasks, collect trajectories and rewards, and aggregate into success/failure minibatches.
Optimizer Proposals: An auxiliary optimizer model $h$ 2 analyzes failures (and optionally successes), augmented by a rejected-edit buffer and meta-guidance, to propose corrective and preservative edits.
Merge and Rank: Failure-prioritized merging of edit proposals, followed by ranking under the edit budget constraint $h$ 3.
Validation Gate: Accept a candidate skill if $h$ 4; else, buffer rejected edits with associated negative feedback.
Epoch-wise Slow/Meta Update: After step-level updates per epoch, longitudinally compare skill effectiveness, insert 'slow update' guidance into protected skill blocks, and update a private meta-skill (used only inside the optimizer, never at inference).
Termination: After fixed epochs/steps, select and export the best skill (Yang et al., 22 May 2026).

The rejected-edit buffer records unsuccessful edit attempts, which deters repeated negative revisions and stabilizes convergence, a mechanism absent in earlier self-revision and LLM reflection-based pipelines (Nottingham et al., 2024). Meta and slow updates introduce longer-range, momentum-like regularization across epochs.

3. Skill Representation and Economy

Skill artifacts in SkillOpt are natural language documents encoding procedural knowledge, learned rules, and, in some variants, segmentable subskills or compositional plans. In code optimization applications (e.g., EffiSkill), skills are further structured into "Operator Skills" (reusable transformation primitives with triggers, steps, and complexity deltas) and "Meta Skills" (procedural controllers for plan composition). The skill registry $h$ 5 is indexed both by trigger patterns and embedding vectors for efficient retrieval (Wang et al., 29 Mar 2026).

Final skills are typically compact (379–1,995 tokens), and optimization requires only 1–4 net accepted edits per benchmark, demonstrating edit economy. Skills remain frozen at inference; no extra model calls or feedback are introduced at runtime, and agent parameters are left untouched (Yang et al., 22 May 2026).

4. Empirical Performance Across Domains

Comprehensive experiments establish SkillOpt as state-of-the-art in skill optimization tasks:

On six agent benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), across seven models (GPT–5.5 to Qwen3.6-35B-A3B) and three harnesses (Direct-chat, Codex, Claude Code), SkillOpt is best or tied-for-best on all 52 (model, benchmark, harness) cells (Yang et al., 22 May 2026).
On GPT–5.5, average task accuracy improvement over vanilla (no-skill) runs: +23.5 (Direct-chat), +24.8 (Codex), +19.1 (Claude Code).
Smaller models show even greater relative improvement (e.g., GPT–5.4-nano on ALFWorld) (Yang et al., 22 May 2026).
In transfer, SkillOpt-trained skills recover 40–100% of in-domain gain when moved across model scales, harnesses, and, to a degree, benchmarks.
In Skill Set Optimization (SSO), average success on ScienceWorld reaches 97.3% (adaptation) and 71.6% (transfer), outperforming previous approaches by 35% on ScienceWorld and 40% on NetHack (Nottingham et al., 2024).
For code efficiency, EffiSkill delivers OPT@8 gains of 3.69–12.52 points over leading baselines across language and model settings, with ablations confirming the necessity of retrieval, multi-plan composition, and automated diagnosis (Wang et al., 29 Mar 2026).

5. Key Design Factors, Ablations, and Variants

Ablation studies consistently show that:

Textual learning-rate budgets ( $h$ 6) are essential; removing or unbounding $h$ 7 reduces downstream performance by 2–3 points.
Rejected-edit buffer removal degrades performance by up to 4.6 points on procedural tasks.
Slow/meta updates (momentum-like blocks) are critical: ablation can reduce SpreadsheetBench accuracy from 77.5% to 55.0% (−22.5 points) (Yang et al., 22 May 2026).
In SSO, skill pruning is the most impactful ablation, confirming the importance of continual skill refinement based on empirical returns (Nottingham et al., 2024).
EffiSkill ablations underpin the importance of skill retrieval and composition: removing retrieval or multi-plan exploration causes >10 point losses in solution rate (Wang et al., 29 Mar 2026).

A plausible implication is that bounded, data-driven, and negative-feedback-informed skill optimization robustly mitigates overfitting and catastrophic forgetting observed in naive or unregulated self-revision schemes.

6. SkillOpt in Context: Relation to Prior Approaches

Compared to hand-crafted, single-shot LLM, Trace2Skill, TextGrad, GEPA, EvoSkill, and untuned reflection pipelines, SkillOpt provides the first precisely controlled, reproducible text-space optimizer with clear analogues to minibatched gradient-based learning and rigorous validation gating (Yang et al., 22 May 2026, Nottingham et al., 2024). In code domains, SkillOpt-inspired frameworks like EffiSkill mediate between instance-level rewrites and full neural optimization; skills function as transferable, mechanism-level abstractions enabling execution-free optimization.

SkillOpt inherits the objective of generalizable behavior from in-context learning research and operationalizes compositional skill construction analogous to subpolicy discovery in hierarchical RL, but always as externally inspectable, incrementally editable artifacts.

7. Limitations and Future Directions

Several outstanding challenges remain:

SkillOpt requires automated, reliable, dense feedback signals (e.g., exact-match rewards, program verifiers). Its offline training overhead may hamper adoption for single-use or sparse-reward tasks (Yang et al., 22 May 2026).
The canonical implementation focuses on optimizing a single skill document; highly heterogeneous domains may require coordinated skill libraries or explicit skill routers (cf. SkillRouter (Zheng et al., 23 Mar 2026)).
Cosine-similarity-based retrieval can degrade in high-dimensional or low-signal environments (Nottingham et al., 2024).
Translation to reward-free or user-preference validation, preference gating, or full weight-space distillation is not yet demonstrated.
Integration with broader agent workflows, domain adaptation for specialized codebases, and dynamic skill composition are prominent targets for future research (Yang et al., 22 May 2026, Wang et al., 29 Mar 2026).

A plausible implication is that systematic external skill optimization may ultimately serve as an interface layer for agent personalization and robust transfer, enabling rigorous deployment without inference-time performance overhead or intervention in base agent weights.

References

"SkillOpt: Executive Strategy for Self-Evolving Agent Skills" (Yang et al., 22 May 2026)
"Skill Set Optimization: Reinforcing LLM Behavior via Transferable Skills" (Nottingham et al., 2024)
"EffiSkill: Agent Skill Based Automated Code Efficiency Optimization" (Wang et al., 29 Mar 2026)
"SkillRouter: Retrieve-and-Rerank Skill Selection for LLM Agents at Scale" (Zheng et al., 23 Mar 2026)