Papers
Topics
Authors
Recent
Search
2000 character limit reached

EvoSkill: Automated Multi-Agent Skill Evolution

Updated 6 March 2026
  • EvoSkill is a self-evolving framework that automates the extraction and refinement of modular, domain-specific agent skills through iterative failure analysis.
  • It employs a triadic pipeline—executor, proposer, and skill-builder—to systematically generate, evaluate, and integrate high-level behavioral protocols.
  • Empirical results show significant accuracy improvements and cross-task transferability, exemplified by the 'search-persistence protocol' boosting performance in diverse benchmarks.

EvoSkill is a self-evolving framework for automated skill discovery and refinement in multi-agent and coding-agent systems. Targeting the challenge that generic agents lack domain-specific expertise, EvoSkill systematically extracts, evaluates, and organizes reusable "agent skills"—modular, high-level behavioral protocols and supporting artifacts—by leveraging iterative failure analysis and evolutionary selection. Central to EvoSkill is the externalization and cumulative enrichment of skills as structured, interpretable units, enabling agents to adapt and transfer capabilities without altering underlying LLM or agent model parameters (Alzubi et al., 3 Mar 2026).

1. Formal Definition and Representation of Agent Skills

In EvoSkill, an agent skill is defined as a reusable, domain-specific capability module composed of:

  • Procedural instructions (e.g., stepwise protocols, compliance routines)
  • Trigger metadata specifying invocation conditions
  • Optionally, helper scripts or reference files for runtime invocation

Skills adhere to the Agent Skills specification: each is a structured folder named for the skill, containing SKILL.md (instructions, metadata), and optionally scripts/ and reference/ subdirectories for code and support material. This disk-based organization permits the agent harness to efficiently load skills at startup and invoke scripts on demand, decoupling skill accumulation from agent context window limitations (Alzubi et al., 3 Mar 2026).

2. Iterative Failure Analysis and Skill Proposal Pipeline

EvoSkill's core optimization loop is predicated on failure-driven discovery. It employs a triadic LLM-based actor architecture:

  1. Executor Agent (A): Executes the agent program (system prompt plus installed skills) across training examples, scoring outputs and flagging failures below threshold τ\tau.
  2. Proposer Agent (P): Consumes failed examples, their execution traces, predicted versus ground-truth outputs, and a feedback history H\mathcal{H}; performs structured root-cause analysis of failures to produce proposals π\pi for new skills or edits to existing ones.
  3. Skill-Builder Agent (S): Transforms proposals into concrete skill artifacts or patches, materializing them on disk, and returns candidate agent programs p~\tilde{p} including the new/revised skill.

Throughout, the agent's LLM or execution engine remains frozen; all adaptation is externalized via the skill library (Alzubi et al., 3 Mar 2026).

3. Evolutionary Workflow and Pareto Frontier Selection

The algorithmic backbone of EvoSkill is a population-based evolutionary process maintaining a kk-sized Pareto frontier GG of top-performing agent programs, each defined by their combination of base prompt and accumulated skills. The high-level pseudocode is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
function EvoSkill(A, P, S, V, k, T)
    H ← []            # feedback history
    G ← {A}           # initial program
    s_A ← Eval(A, V)  # baseline validation

    for t in 1:T
        p ← next(G)
        F ← failures from running p
        if F is empty: continue
        π ← P(F, H)
        p̃ ← S(p, π)
        s̃ ← Eval(p̃, V)
        if |G| < k or s̃ > min(Score(G)):
            G ← G ∪ {p̃}
            if |G| > k: remove argmin(G)
        H.append((π, s̃, in_frontier))
    return argmax(Score(G))

Key principles:

  • Evaluation (Eval(p,V)\mathrm{Eval}(p, V)): Average validation performance on a held-out set VV.
  • Frontier Update: Only admit candidates that outperform the current minimum or fill unused slots.
  • Pareto Frontier: For multi-objective optimization s(p)=[s1(p),...,sm(p)]\mathbf{s}(p) = [s_1(p), ..., s_m(p)], dominance is Pareto-based; in single-objective settings it reduces to scalar comparison.

This mechanism ensures the persistent skill population evolves strictly via externally derived skills, robustly decoupling performance gains from any model-level changes (Alzubi et al., 3 Mar 2026).

4. Empirical Evaluation and Performance Metrics

EvoSkill was empirically validated using Claude Code Opus 4.5 on two distinct benchmarks:

  • OfficeQA: Grounded document reasoning over U.S. Treasury data; 246 examples, multiple train/validation/test splits.
  • SealQA: Search-augmented QA with noisy retrieval; 111 examples.

Performance is summarized as exact-match gains over baseline coding-agent systems:

Benchmark Base Accuracy EvoSkill Accuracy Improvement (Δ points)
OfficeQA (5%) 60.6% 63.4% +2.8
OfficeQA (10%) 60.6% 65.8% +5.2
OfficeQA (15%) 60.6% 64.5% +3.9
OfficeQA (merge) 60.6% 67.9% +7.3
SealQA 26.6% 38.7% +12.1

The merged skill library outperformed individual runs, indicating the complementary nature of independently evolved skills. In SealQA, critical skills such as "search-persistence protocol" emerged, directly targeting failure modes characteristic of noisy retrieval (Alzubi et al., 3 Mar 2026).

5. Zero-Shot Transfer and Generalization of Evolved Skills

A pivotal result is the demonstrated transferability of EvoSkill's evolved skill modules across distinct tasks:

  • The "search-persistence protocol" skill evolved on SealQA, without modification, was injected into an agent operating on BrowseComp (128 examples). The resulting accuracy gain was +5.3 points (43.5% → 48.8%).
  • This demonstrates that high-level, interpretable skill artifacts encode generalizable strategies beyond monolithic prompt or code optimization.

This transfer ability is attributed to EvoSkill’s focus on codifying generally applicable protocols (e.g., redundancy in information sourcing, cross-checks), rather than brittle, task-specific heuristics (Alzubi et al., 3 Mar 2026).

6. Comparative Perspective and Connections

EvoSkill builds upon and distinguishes itself from prior work in skill abstraction and constructive agent learning. The AutoSkill framework (Yang et al., 1 Mar 2026) generalizes the concept to continual, unsupervised extraction of skill artifacts from dialogue traces, supporting individualized, lifelong accumulation, merging, and injection of reusable behaviors. Unlike model-centric tuning, these systems externalize adaptation into explicit skill banks—maintained, versioned, and retrieved using dense and lexical similarity metrics.

Earlier paradigms in robotics, such as projective simulation with autonomous skill acquisition and skill-centric testing, also reflect the externalization of reusable primitives, but are oriented toward kinesthetic and visual program synthesis with explicit tracking of sensor modalities and skill domains of applicability (Hangl et al., 2017).

The skill-centric approach in EvoSkill thus represents an evolutionary trajectory in agent systems: from tightly coupled, ad-hoc heuristics or low-level prompt modifications to structured, interpretable, and composable skill modules managed via evolutionary and continual refinement processes. This architecture yields robustness, extensibility, and cross-task generalization while obviating the need for parameter adaptation or retraining.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvoSkill Framework.