From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

Published 22 May 2026 in cs.AI | (2605.23899v1)

Abstract: Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- \textbf{experience generation}, \textbf{skill extraction}, and \textbf{skill consumption} -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete \emph{meta-skill} that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Summary

The paper introduces a rigorous evaluation framework examining the lifecycle of agent skills from experience generation to consumption.
The paper demonstrates that extractor-target compatibility and experience pool composition drive skill utility, with 75% positive and 25% negative transfer rates.
The paper validates a meta-skill guided rubric that improves downstream performance gains (+1.55pp) and enhances judge accuracy from 46.4% to 73.8%.

Systematic Analysis of Model-Generated Agent Skills: Experience, Extraction, and Consumption

Problem Motivation and Study Design

Structured skill reuse is a central mechanism in contemporary LLM-based language agents, enabling fast adaptation and experience transfer. The proliferation of domain-level, model-generated skills promises scalable and automated distillation of recurring procedures, but systematic evaluation across the entire lifecycle—experience generation, skill extraction, skill consumption—remains lacking. This study constructs a utility-grounded evaluation framework to rigorously probe whether model-generated skills actually deliver downstream utility, when and why negative transfer arises, and what factors at each lifecycle stage govern skill quality.

Figure 1: The study evaluates the trajectory-to-skill lifecycle across experience generation, skill extraction, and skill consumption.

Experimental Protocol and Evaluation Metrics

The evaluation pipeline is instantiated across five agentic domains (ALFWorld, SpreadsheetBench, SWE-bench-Verified, SEAL-0, BFCL-v4) with six state-of-the-art LLMs as targets and five as extractors. Each target generates an experience pool from domain tasks, an extractor distills a consolidated domain-level skill, and the skill is supplied to the same target for evaluation on held-out tasks. Downstream performance gain (delta relative to no-skill baseline) is the principal metric, with Extraction Efficacy (EE) and Target Evolvability (TE) summarizing extractor- and target-side effects. This approach enables controlled isolation of extraction vs. consumption phases and high-fidelity measurement of skill transfer.

Empirical Results: Performance Gains and Negative Transfer

Across domains, model-generated skills are generally beneficial, with positive transfer in approximately 75% of extractor-target pairings. However, non-trivial negative transfer occurs in 25% of cases, with domain-dependent fragility (e.g., 47% negative in ALFWorld vs. 13% in spreadsheet/coding domains). Numerical results show substantial target-dependence: e.g., GPT-5.4 benefits uniformly on ALFWorld ( $TE = +4.93$ ), while lighter Gemini/Qwen variants exhibit negative TE. Extractor quality is not predicted by baseline task skill or model scale; e.g., Gemini-3.1-Flash-Lite outperforms GPT-5.4 as an extractor in SpreadsheetBench despite weaker task execution. Thus, skill utility is driven by nuanced extractor-target compatibility, not mere model strength.

Lifecycle Dissection: Drivers of Skill Utility

Experience Generation

Manipulating success/failure ratios in experience pools reveals domain-specific optimal mixes: SpreadsheetBench benefits from success-heavy pools, ALFWorld from failure-rich ones, but all-failure pools degrade utility universally. This indicates successful trajectories provide positive procedural signals essential for transferable skills, but domain-specific failure modes can offer critical information in environments with high constraint complexity.

Skill Extraction

Neither skill format nor textual plausibility correlates with downstream utility. Across 151 pairwise skill comparisons, LLM judges based on text alone fail to predict performance (46.4% accuracy, comparable to random), and accuracy drops further on high-gap pairs. Instead, actionable specificity and encoding of concrete failure mechanisms are primary determinants, as qualitative analysis shows high-utility skills consistently enumerate executable remedies for domain-specific failure types.

Skill Consumption

Skill-induced gains vary sharply even with identical skill text injected into different targets. Consumption reshapes the target's default policy: in SpreadsheetBench, GPT-5.4 is steered toward evaluator-aligned computation and verification, while Qwen3.5-9B moves toward workbook-native workflows, gaining structure but losing robustness. Skill consumption is bounded by inherent target abilities and the structure of their experience-induced strategies.

Meta-Skill Guided Extraction and Rubric Formulation

Contrastive analysis across high-gap skill pairs produces a validated 3-dimension rubric: Failure Mechanism Encoding, Actionable Specificity, and High-Risk Action Blacklist. Inserting this rubric as meta-skill prior into extraction prompts consistently improves downstream skill utility in all tested domains and models—raising judge accuracy in pairwise ranking from 46.4% to 73.8% and providing average performance gains of $+1.55$ pp, contrasted with a naive plausibility rubric which degrades utility ( $-0.59$ pp).

Figure 2: Meta-skill guidance consistently improves the downstream utility of generated skills; naive plausibility guidance fails.

Theoretical and Practical Implications

The study establishes that skill generation from raw agent trajectories exhibits substantial variance and non-trivial risk of negative transfer. Extractor-target compatibility, experience pool composition, and the actionable content of skills are primary drivers, while neither model scale nor textual fluency predicts utility. The validated rubric offers a plug-in, deployment-realistic mechanism to systematically improve skill extraction, paving the way for principled screening and construction of skill libraries.

Practically, these findings enable practitioners to avoid silent degradation from poorly generated skills, screen out plausible but ineffective artifacts, and reliably scale skill extraction systems. Theoretically, they motivate future research on skill selection, composition, and safety at library scale, as well as deeper formalization of compatibility spaces across extractor-target pairs.

Conclusion

This systematic, utility-grounded study moves model-generated skill extraction from intuition-driven practice to principled discipline. It rigorously characterizes the sources of variance and negative transfer, provides validated criteria for actionable improvement, and informs both evaluation and extraction in agentic systems. The resulting framework and meta-skill prior will accelerate reliable agent evolution as skill libraries proliferate and agent harnesses scale.

Markdown Report Issue