EvoSkill: Automated Skill Discovery for Multi-Agent Systems

This presentation explores EvoSkill, a framework that enables AI agents to automatically discover and refine reusable skills through failure analysis. Unlike hand-crafted approaches or code-level optimization, EvoSkill evolves structured skill folders that contain instructions, triggers, and scripts—achieving up to 12% accuracy improvements and demonstrating zero-shot transfer across different benchmarks. The talk examines how this skill-level abstraction creates composable, interpretable modules that persist across tasks and models.
Script
AI agents that can code are flexible, but they lack something crucial: systematic domain expertise. Every new task forces them to start from scratch, reinventing solutions that should be reusable. EvoSkill changes that by teaching agents to discover and refine their own skills automatically, creating a library of capabilities that grows with experience.
The researchers identified a fundamental gap: while coding agents can tackle diverse tasks, they don't accumulate domain expertise in a structured way. Hand-crafting skills for each domain is slow and expensive, and existing evolutionary approaches that modify code or prompts directly fail to produce artifacts that transfer or compose cleanly.
So how do you teach an agent to build its own expertise?
EvoSkill operates as a self-evolving loop with three specialized agents. After each task execution, failures are systematically diagnosed, proposals for new skills emerge, and these get materialized into structured skill folders. Critically, the underlying language model stays frozen—all improvements come from discovering better skills. Only skills that demonstrably improve held-out validation tasks survive, ensuring the agent accumulates genuinely useful capabilities.
The framework was tested on two challenging benchmarks. On OfficeQA, which requires extracting quantitative information from dense Treasury documents, EvoSkill achieved 67.9% accuracy—a 7.3 point gain by developing skills that verify data extraction rigorously. On SealQA, where agents must navigate adversarial search results, accuracy jumped 12.1 points to 38.7% through skills that enforce exhaustive multi-source verification. These aren't minor tweaks; they're systematic capability enhancements.
The most striking result is transferability. A search-persistence skill developed entirely on SealQA was applied zero-shot to BrowseComp, a completely different benchmark, and immediately improved accuracy by 5.3 percentage points. This is rare in agent optimization: most learned enhancements degrade across domains. EvoSkill's skill-level abstraction produces artifacts that actually generalize.
EvoSkill reframes agent improvement from tweaking prompts or code to discovering reusable, interpretable modules of expertise. By isolating performance gains at the skill level while keeping models frozen, it opens a path toward collaborative libraries of agent capabilities that compound over time. Visit EmergentMind.com to explore more research like this and create your own video presentations.