Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

Published 30 Apr 2026 in cs.CL | (2604.27488v1)

Abstract: We introduce Skills-Coach, a novel automated framework designed to significantly enhance the self-evolution of skills within LLM-based agents. Addressing the current fragmentation of the skill ecosystem, Skills-Coach explores the boundaries of skill capabilities, thereby facilitating the comprehensive competency coverage essential for intelligent applications. The framework comprises four core modules: a Diverse Task Generation Module that systematically creates a comprehensive test suite for various skills; a Lightweight Optimization Module dedicated to optimizing skill prompts and their corresponding code; a Comparative Execution Module facilitating the execution and evaluation of both original and optimized skills; and a Traceable Evaluation Module, which rigorously evaluates performance against specified criteria. Skills-Coach offers flexible execution options through its virtual and real modes. To validate its efficacy, we introduce Skill-X, a comprehensive benchmark dataset consisting of 48 diverse skills. Experimental results demonstrate that Skills-Coach achieves significant performance improvements in skill capability across a wide range of categories, highlighting its potential to advance the development of more robust and adaptable LLM-based agents.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents a fully automated framework that leverages Training-Free GRPO to probe, optimize, and evaluate LLM skills without human intervention.
It integrates diverse task generation, lightweight optimization, comparative execution, and traceable evaluation for systematic skill refinement.
Empirical results on Skill-X show a dramatic increase in skill score (0.37 to 0.84) and pass rates (+54.43%), highlighting the approach's effectiveness for complex tasks.

Skills-Coach: Automated Skill Optimization via Training-Free GRPO

Context and Motivation

The proliferation of skill-based modular architectures in LLM-driven agents has resulted in a fragmented ecosystem: while tens of thousands of individual skills exist, coverage of complex or specialized functional requirements remains incomplete, with integration bottlenecks impeding scalable deployment. The paper "Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO" (2604.27488) directly addresses the autonomous self-evolution of skills, formalizing the optimization challenge and proposing a fully automated pipeline to efficiently probe, optimize, and evaluate skill boundaries without human intervention.

Skills-Coach Architecture

Skills-Coach implements a comprehensive modular framework comprised of four synergistic components, each designed for a distinct stage in the skill evolution cycle:

Diverse Task Generation Module: Systematic analysis of skill specification files (Skill.md, Readme.md) enables the creation of a robust suite of training and test tasks, encompassing standard operations, challenging edge cases, and realistic boundary conditions. Tasks are generated to maximize diversity, boundary representation, and practical applicability. Each sample is paired with automated validation criteria for objective scoring.
Lightweight Optimization Module: Skill instructions and code files are refined using Training-Free Group Relative Policy Optimization (GRPO), which leverages LLM introspection rather than gradient-based approaches. This module operates across two parallel pathways: multi-variant instruction generation and comparative scoring, complemented by sequential code optimization incorporating rule-driven augmentation, LLM-based command improvement, and auto-fixing for identified defects.
Comparative Execution Module: Executes both the original and optimized skill versions within isolated environments, logging all outputs, errors, and performance metrics. The controlled execution ensures fairness and robust process traceability, utilizing parallel task handling and fail-safe strategies for resilient operation.
Traceable Evaluation Module: Conducts detailed, criterion-consistent quantitative assessment of both skill versions. Uses LLM-powered or fallback heuristic evaluation across eight dimensions (51 discrete criteria), supporting deep analysis, rigorous retention decisions, and generation of comprehensive summary reports with explicit evidence.
Figure 1: Overview of the Skills-Coach system architecture, detailing its modular execution and optimization pipeline.

Empirical Evaluation on Skill-X

To benchmark Skills-Coach, the authors created Skill-X, a suite of 48 diverse skills sourced from prominent developer platforms. This dataset encompasses both instruction-only and code-inclusive skill types, supporting robust cross-domain evaluation. The empirical setup involves three optimization epochs per skill, 12 training and 8 test tasks per skill, and stringent execution in real mode with standardized scoring and pass thresholds.

Strong numerical results are reported: average skill score increased from 0.37 to 0.84 post-optimization (127% relative improvement); pass rate rose from 33.59% to 88.02% (+54.43%). Both instruction-only and code-inclusive skill types demonstrated >50% increase in pass rates. Notably, improvements in advanced tasks exceeded those in standard tasks, highlighting the framework’s efficacy for complex, boundary-pushing scenarios.

Figure 2: Performance comparison demonstrating substantial gains in skill score and pass rate across Skill-X after Skills-Coach optimization.

Per-Skill Optimization Analysis

Granular analysis reveals that optimization yields maximal benefit for underperforming skills, with 23 skills showing exceptional improvements (+0.5 or greater in score), including several whose scores increased from 0.0 to 1.0 and pass rates from 0%-100%. For skills already at intrinsic optimality, marginal gains become negligible, confirming the necessity to strategically prioritize optimization resources for maximal impact.

Task Generation and Evaluation

The task generation module delivers systematic coverage of standard, advanced, and boundary tasks, with automated criteria ensuring precise and objective evaluation. The assessment of the Pollyreach skill serves as a representative case: basic tasks probe standard command functionality, while advanced tasks demand complex logic and robustness, including format validation and edge-case handling.

Figure 3: Generated standard versus advanced test cases for Pollyreach, capturing progression from functional to robustness evaluation.

Summary reports generated by the evaluation module provide multidimensional analytics, supporting iterative optimization and strategic decision-making for skill retention and further refinement.

Figure 4: Key contents from the Pollyreach summary report, visualizing metric improvements and capability boundary analysis.

Training-Free GRPO and Optimization Efficiency

The core optimization strategy employs Training-Free GRPO, obviating the need for gradient-based parameter updates and large datasets. This enables rapid optimization—minutes compared to hours—and supports effective iterative refinement with minimal overfitting risk, enhancing both generalization and cross-domain transfer.

Implications and Future Directions

Skills-Coach’s methodology for autonomous skill self-evolution has significant implications for the scalability and robustness of LLM agent deployments. By systematizing the exploration and refinement of capability boundaries, the framework addresses fragmentation in skill ecosystems and reduces dependency on manual maintenance. The substantial empirical gains, especially for complex, code-inclusive skills, are indicative of the power of introspective LLM-driven optimization pipelines. Future developments may focus on hierarchical skill composition, dynamic ecosystem orchestration, and integration with broader agent feedback loops, ultimately advancing the autonomy and reliability of intelligent applications.

Conclusion

Skills-Coach delivers a formalized, modular pipeline for agent skill self-evolution powered by Training-Free GRPO. The framework achieves substantial performance improvements across multiple dimensions of skill capability, particularly for advanced and code-centric tasks. Its fully automated architecture, robust evaluation, and strong empirical validation position it as a key step toward scalable, comprehensive skill ecosystems in LLM-based agents. The research establishes objective guidelines for resource allocation in skill optimization and lays the foundation for further advances in agent skill autonomy and integration.