Create a Video View Paper

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

This presentation explores SkillsBench, a systematic benchmark for evaluating how structured procedural knowledge—called Agent Skills—impacts the performance of language model based agents. Across 84 real-world tasks spanning 11 professional domains and 7 commercial agent configurations, the research reveals that curated Skills boost performance by an average of 16.2 percentage points, while agent-generated Skills provide negligible benefit. The work demonstrates that focused, human-authored procedural packages offer substantial value in specialized domains, shift cost-performance tradeoffs favorably, yet cannot be reliably synthesized by language models themselves—establishing critical design principles for the emerging agent augmentation ecosystem.

Script

Can language model agents actually use procedural knowledge when you hand it to them? That question sits at the heart of a rapidly growing ecosystem of agent Skills, yet until now, we've had no rigorous way to measure whether these structured workflows genuinely help or just add noise.

Building on that challenge, let's examine why measuring Skills impact has been so elusive.

The authors identified a fundamental gap: while Agent Skills—these portable bundles of how-to knowledge—were proliferating across repositories and commercial platforms, nobody had built a systematic way to test whether they actually worked. The ecosystem was expanding blindly.

To address this measurement void, the researchers built SkillsBench.

Following that gap, SkillsBench introduces a controlled experimental framework. Each task is containerized with programmatic verifiers, ensuring reproducible pass-fail outcomes. The benchmark tests agents with no augmentation, with human-curated Skills, and with Skills the agent writes itself—a three-way comparison across commercial platforms.

This pipeline visualization captures the benchmark's three-phase design. First, they aggregate tens of thousands of Skills from open-source repositories and commercial ecosystems. Second, rigorous automated and human filtering ensures every task is leak-free and every Skill is genuinely procedural. Third, evaluation runs each task under all three augmentation conditions across multiple agent harnesses, producing thousands of deterministic outcomes for comparison.

Now let's turn to what those thousands of evaluations revealed.

Here's the striking contrast. Human-curated Skills deliver substantial gains, especially in specialized domains where baseline models lack procedural coverage. But when agents try to generate their own Skills? They fail. Self-generated procedural knowledge is too shallow to match the precision of expert-authored workflows, contradicting hopes that language models could bootstrap their own augmentation.

Diving deeper, the researchers uncovered precise design constraints. More isn't better: compact, targeted Skills packages dramatically outperform exhaustive documentation, which overwhelms the agent. And domain context is decisive—Skills shine brightest where model priors are weakest, but can actually hurt performance in areas where the language model already has strong intuitions.

So why does SkillsBench matter beyond the numbers? It's the first rigorous framework proving that agent augmentation is not a one-size-fits-all proposition. Procedural expertise cannot be conjured from language models on demand—it must be carefully authored, modularly packaged, and selectively deployed. This work redefines how we should build and evaluate agent ecosystems going forward.

Agent Skills work when they're focused, human-authored, and domain-matched—but fail when agents try to write their own. To explore the full benchmark methodology and detailed results, visit EmergentMind.com.