Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition

Published 5 Jun 2026 in cs.AI | (2606.06893v1)

Abstract: LLM agents increasingly rely on Skills to encode procedural knowledge, yet high-quality Skills remain costly to hand-write. This paper studies automatic Skill construction from heterogeneous interaction evidence, including demonstrations, agent trajectories, tool traces, and execution logs. We argue that trace-to-skill construction is not simple summarization tasks, because traces are fragmented, redundant, and may miss rare but safety-critical behaviors. To address this, we introduce RWSA, a workflow-oriented intermediate representation that decomposes Skills into Workflow structure, execution Semantics, and runtime Attachments, capturing task decomposition, control flow, verification, safety, rollback, and state management. Building on RWSA, we propose W2S, a framework that segments traces, induces local Skill drafts, aligns shared structures, reconciles branches, and compresses redundancy while preserving evidence and confidence annotations. Experiments on 70 Skills show that W2S improves behavioral replay consistency by 10.5% over summarization- and prompting-based baselines, highlighting the need to treat traces as executable runtime specifications rather than compressible text.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a novel framework that decomposes agent traces into a structured Skill-IR with a routing header, workflow backbone, node semantics, and runtime attachments.
It introduces an iterative evidence extraction process that aligns and refines local skill drafts to achieve higher behavioral fidelity compared to traditional methods.
The approach enhances interpretability and scalability by ensuring procedural clarity and verifiability in automated agent skill generation.

Workflow-to-Skill: Structured Skill Induction via Workflow-Semantics-Attachments Decomposition

Introduction

This work introduces a principled framework for automated agent skill generation, positing that effective skill induction requires extracting structured runtime specifications from heterogeneous agent interaction traces rather than compressing them via traditional summarization paradigms. The approach is grounded in the observation that skills—procedural knowledge artifacts that encode agent behavior—necessitate not only the preservation of high-level intent but also the explicit recovery of workflow structure, node-local execution semantics, and operational attachments. These components collectively define when, how, and under what conditions an agent skill should be invoked and executed within future tasks.

The Skill-IR Representation and Decomposition

Central to the framework is the Skill-IR intermediate representation, which decomposes each skill into four explicit elements: a routing header, a workflow backbone, node-level operational semantics, and runtime attachments. The routing header encompasses activation descriptors; the workflow backbone encodes execution unit graphs; node semantics define decision logic and behavioral criteria; runtime attachments specify tools, resources, and validation requirements.

The backbone is constructed as a workflow graph $W = (N, E)$ of execution units, with node-local semantics $S$ and attachment constraints $A$ binding operational context. This explicit decomposition distinguishes Skill-IR from free-form text summarization by modeling skills as compositional runtime specifications with observable control flow.

Figure 1: Workflow Backbone $W$ captures procedural execution structure via directed nodes and edges.

Figure 2: Overview of W2S, showing the transformation of traces into workflow evidence, semantic evidence, runtime evidence, and their integration into reusable, executable agent skills.

W2S: Evidence-Driven Workflow-to-Skill Induction

The W2S (Workflow-to-Skill) framework formalizes skill induction as a conditional evidence extraction and alignment process. Given multiple traces from historical agent operations—potentially fragmented, redundant, or encompassing rare failure/recovery behavior—the methodology first parses and segments traces across scenarios, inducing path-local skill drafts. The next stage aligns, merges, and compresses these local drafts into a global Skill-IR instance, applying reconciliation logic for conditional branches, redundancy elimination, and explicit preservation of validation, rollback, or approval routines.

Crucially, path-level evidence is strictly mapped to WSA (Workflow-Structure, Semantics, Attachments) fields, maintaining provenance annotations that distinguish directly observed behaviors from inferred patterns and unobserved cases. This prevents the common failure of skill over-generalization or loss of critical runtime branches often seen in summarization-based or text-centric induction methods.

The W2S induction pipeline is iterative, applying coverage and behavioral consistency checks against reference trajectories, with diagnostic feedback targeting the corresponding Skill-IR components (backbone, semantics, attachments). This supervised refinement continues until all high-priority discrepancies relative to the evidence are resolved.

Empirical Evaluation and Numerical Results

Evaluation utilizes the WSASkill benchmark, comprising 70 annotated skills covering a taxonomy of eight WSA skill types (ranging from prompt fragments to multi-step, resource- and semantics-constrained workflows). The central metric is replay-based behavioral fidelity, wherein induced skills are executed under replayed scenarios to quantify their alignment with the behavioral contract of reference skills.

The W2S framework yields an average behavioral fidelity score of 0.503, surpassing the 0.455 achieved by the Anthropic Skill Creator (ASC) baseline, a structured skill authoring pipeline. The improvement is consistent across most skill types, with the most significant gains observed in semantic resource (T3, +0.040), semantic guideline (T2, +0.122), and attachment wrapper (T1, +0.115) cases. Notably, W2S underperforms on the tool-driven workflow (T5, -0.070 score gap), indicating current limitations in handling attachment-centric procedural bindings without explicit local semantics.

The authors assert that reliable skill induction necessitates treating traces as evidence for reconstructing executable specifications rather than performing document compression, directly challenging prior summarization-oriented paradigms.

Implications and Theoretical Context

This work advances the formalization of skill creation as a structured induction problem, underscoring the importance of explicit intermediate representations such as Skill-IR for bridging procedural generalization and runtime executability. The decomposition along workflow, semantics, and attachments axes is shown to be effective for preserving critical behavioral properties, supporting versioning, provenance inspection, and iterative skill refinement.

From a practical standpoint, the proposed methodology provides a scalable avenue for the automatic, verifiable expansion of agent skill repertoires in dynamic environments, reducing reliance on brittle human-authored procedural code and supporting long-term agent adaptation. The explicit mapping of path-level evidence also facilitates explainability, auditability, and targeted repair—a substantial improvement over methods yielding unstructured prompt fragments.

Theoretically, the paradigm shift from summarization to structured runtime specification aligns with calls for procedural memory, compositionality, and operational contract preservation in the evolution of autonomous LLM agents. The clear separation between activation, execution, decision logic, and resource context enables more modular and interpretable skill libraries, and allows new research on lifecycle skill governance, hierarchical composition, and automatic validation under evolving tool/application landscapes.

Future Directions

Key avenues for further research include improved mechanisms for attachment scoping and constraint modeling, as current results indicate remaining deficiencies when reconstructing resource-heavy workflow types. Extensions toward hierarchical skill induction, probabilistic evidence propagation for rare behaviors, and tighter feedback integration with agent policy optimization loops are plausible. Integration with skill verification and reinforcement learning pipelines could support ongoing skill evolution and continual improvement, while deeper structural typing and schema enforcement may enable more robust deployment in safety-critical or multi-agent settings.

Conclusion

The "Workflow-to-Skill" work systematically addresses limitations in existing skill induction approaches by proposing a structured, evidence-driven decomposition and reconstruction framework, substantiated by strong empirical results across a comprehensive skill taxonomy (2606.06893). By shifting the paradigm toward explicit runtime representation and behavior preservation, this framework sets the stage for more reliable, interpretable, and maintainable agent skill generation, with implications for scalability and long-horizon agent behavior in complex, evolving task environments.

Markdown Report Issue