Multi-Level Skill Representation

Updated 7 April 2026

Multi-level skill representation is a structured approach that organizes agent capabilities into layered hierarchies of abstract strategies, decomposable routines, and atomic actions.
It leverages formal definitions, algorithmic construction, and embedding-based retrieval to enhance generalization, interpretability, and scalable learning.
Empirical applications in robotics, reinforcement learning, and LLM agents demonstrate improvements in efficiency, transferability, and reduced sample complexity.

Multi-level skill representation provides a formal and algorithmic foundation for structuring agent behavior, perception, planning, and execution across robotics, reinforcement learning, LLM agents, and other complex sequencing tasks. These representations organize capabilities in explicit compositional hierarchies, where higher levels capture abstract strategies or intent, intermediate layers encode decomposable routines or skills, and lower levels comprise atomic actions, tool usages, or sensorimotor primitives. This stratification underpins generalization, transferability, interpretability, and scalable learning in agents operating in open or long-horizon domains.

1. Formal Definitions and Hierarchical Structures

Multi-level skill representations are systematically constructed as explicit, modular hierarchies where each level targets a distinct abstraction scale or functional scope:

Three- and Four-Level Hierarchies: In SkillX, the hierarchy consists of (i) planning skills (ordered high-level task decompositions referencing functional skills), (ii) functional skills (multi-step macros or subroutines, possibly spanning multiple tools), and (iii) atomic skills (single tool invocations encoding typical usage patterns, parameter constraints, and failure modes) (Wang et al., 6 Apr 2026). Uni-Skill adopts a four-layer taxonomy ("SkillFolder") inspired by VerbNet: abstract class, verb instance, skill description (template), and skill slice (video segment) (Xie et al., 3 Mar 2026). Each node in the tree has associated text and/or visual embeddings for semantic retrieval and grounding.
Bi-level Discrete–Continuous Representations: Models such as VQ-CNMP discover a codebook of discrete high-level skills via vector quantization, while preserving a continuous latent space within each cluster for low-level trajectory refinement (Aktas et al., 2024). SkillDiffuser's two-level hierarchy involves a discrete skill abstraction head (VQ-VAE bottleneck over horizon-based representations) that conditions low-level diffusion planning for trajectory execution (Liang et al., 2023).
Graph- and Memory-Based Hierarchies: Manipulation frameworks employ multi-graph structures: task graph (semantic decomposition into subtasks and primitives), scene graph (physical environment, relations), and state graph (binding semantic steps to scene objects) as in (Qi et al., 2024). LLM-based agent systems maintain skill banks or plan memory with knowledge stored at distinct abstraction layers, supporting retrieval and composition via intent or context similarity (Xia et al., 9 Feb 2026, Wang et al., 6 Apr 2026, Lee et al., 19 Feb 2026).
Reinforcement Learning-Based Compositions: Hierarchical RL methods utilize explicit option frameworks and modularity-driven partitions to induce multi-level skills ("options") at progressively coarser timescales, yielding skill hierarchies where higher-level skills are compositions over lower-level options (Evans et al., 2023, Yang et al., 9 Mar 2026).

2. Algorithmic Construction and Learning of Skill Hierarchies

Several algorithmic regimes have been developed for constructing and populating multi-level skill representations:

Bottom-Up Extraction from Demonstration/Experience: SkillX iteratively builds the library by rolling out agent trajectories, extracting plans, macro-routines, and atomic invocations per success/failure, then refining via merging, deduplication, and schema-based filtering. Libraries are stored and indexed for plug-and-play retrieval (Wang et al., 6 Apr 2026). Uni-Skill parses multi-hour video data to extract hierarchical skill trees, using VLM-based procedure extraction, description generation, and temporal alignment (Xie et al., 3 Mar 2026). In RL, modularity maximization is applied recursively to the state-transition graph, with the Louvain algorithm producing nested partitions that define the initiation and termination sets for options at each abstraction level (Evans et al., 2023).
Neuro-symbolic and Hybrid Models: VQ-CNMP employs unsupervised clustering of demonstration trajectories into discrete skill vectors and fine-tuning for low-level control. High-level plans are subsequently synthesized via LLMs that sequence discovered skills, while continuous latent optimization produces precise actions (Aktas et al., 2024).
Recursive Skill Evolution and Lifelong Learning: SkillRL interleaves policy optimization with periodic extraction of new general and task-specific skills by a teacher model, updating the SkillBank as task performance criteria indicate deficiencies. This leads to continual evolution of both policy and skill library (Xia et al., 9 Feb 2026). AutoSkill formalizes skill artifacts as versioned tuples (containing name, prompt, triggers, tags, in-context examples, and version), and supports automatic extraction, merging, and injection into LLM queries across sessions, tracked in a persistent SkillBank (Yang et al., 1 Mar 2026).
Experience-Driven Compression for RL: Multi-level meta-RL frameworks repeatedly compress MDPs by treating families of lower-level policies as abstract actions at higher levels, building parametric skill generators and composing embeddings for transfer and efficiency (Yang et al., 9 Mar 2026).

3. Representation, Retrieval, and Composition Mechanisms

Semantic, Geometric, and Physical Levels: Robotic skill-transfer pipelines utilize semantic graphs for task decomposition and planning (LLM-driven), geometric planners for collision-free physical motion, and tactile representations for real-world contact and adjustment (Qi et al., 2024). Each layer operates on its own abstracted representation, with message-passing or graph neural embeddings at the semantic level, path-planning and pose optimization at the geometric level, and signal processing on tactile data at the physical layer.
Vector and Symbolic Embeddings: Libraries store skills (across all levels) as concatenated text/visual embeddings, enabling hierarchical retrieval by nearest-neighbor search with cosine similarity or by keyword/tag matching (Wang et al., 6 Apr 2026, Xie et al., 3 Mar 2026). IntentCUA uses multi-view (environment, action, description) latent representations, clustered to form intent and subgroup prototypes; low-level skills are canonicalized as action signatures and medoids within each subgroup (Lee et al., 19 Feb 2026).
Dynamic Injection and Adaptation: During execution, subgoals are decomposed into plans or routines, which in turn invoke atomic or primitive actions. Retrieval cascades (plan → functional → atomic) ensure fallback to the lowest necessary abstraction for execution (Wang et al., 6 Apr 2026). Skill-aware planners detect insufficiencies, request new skills, retrieve matching demonstrations, and adapt/extract constraints for grounding and integration (Xie et al., 3 Mar 2026).
Compositionality and Hierarchy: Options are composed across timescales (sequential chaining, parent-child linkage), and entire layered trees define allowed transitions and dependencies. This forms the basis for high-level planning that defers realization details to increasingly refined subprotocols (Evans et al., 2023, Mao et al., 2024).

4. Applications and Empirical Impact

Multi-level skill representations have been validated empirically across a variety of domains and architectures:

Robotics: Hierarchical frameworks (SkillDiffuser, RoboMatrix, Uni-Skill) demonstrate improved success rates and generalization on open-world and unseen manipulation tasks, leveraging interpretable skill abstractions and modular control (Liang et al., 2023, Mao et al., 2024, Xie et al., 3 Mar 2026). Four-layer libraries hierarchicalize verb classes, descriptions, and trajectory slices, supporting retrieval and “few-shot” grounding for novel objectives (Xie et al., 3 Mar 2026).
LLM Agents and Desktop Automation: Plug-and-play skill knowledge bases (SkillX, SkillRL, AutoSkill) yield measurable gains in zero-shot, few-shot, and efficiency metrics. Experimental results show improvements of ~10–15 percentage points in task success and substantial reductions in context or token length versus memory-augmented or raw trajectory approaches (Wang et al., 6 Apr 2026, Xia et al., 9 Feb 2026, Yang et al., 1 Mar 2026).
Learning from Demonstration: Multi-representational frameworks optimize the selection of reproduction model per boundary condition, partitioning the generalization domain and outperforming single-representation approaches in both 2D simulations and real-robot tasks (Hertel et al., 2021).
Reinforcement Learning: RL approaches using multi-level skills demonstrate sharply reduced sample complexity and improved scaling in a range of benchmarks, including large-scale MDPs with millions of states. Empirical results confirm that hierarchical composition is critical for rapid convergence and robust transfer (Evans et al., 2023, Yang et al., 9 Mar 2026).

5. Theoretical Foundations, Guarantees, and Limitations

Transfer and Generalization: Hierarchical design enables skills to be reused across tasks and sampled for transfer in curriculum settings (Yang et al., 9 Mar 2026, Mao et al., 2024). Skill-embedding decompositions ensure decoupling of “what” (skill policy) and “how” (context embedding), supporting re-composition and modularity.
Error Propagation and Convergence: Formal proofs establish that, under mild regularity, error and variance decrease with each compression level, and planning efficiency is greatly improved. Multi-level policies allow for transfer speed-up and reduced value-iteration complexity (Yang et al., 9 Mar 2026, Evans et al., 2023).
Experimental Ablations: Removing or flattening the hierarchy, or omitting meta-level evolution/refinement, consistently degrades performance (−10 to −30 percentage points) in generalization and speed (Wang et al., 6 Apr 2026, Xia et al., 9 Feb 2026). Hierarchical frameworks require care in skill clustering and representation choice; for example, similarity metric bias influences which skill model is selected for generalization in LfD (Hertel et al., 2021).
Limitations: Many systems hard-code workflow sequences, keeping strategy selection outside the skill modules (Leu et al., 2018, Qi et al., 2024). Some frameworks rely on external planning modules (LLMs, symbolic planners), or require ground-truth environmental state for grounding, limiting end-to-end learning (Aktas et al., 2024).

6. Comparative Summary Table

Framework	Domain	Levels/Abstractions	Representation Form
SkillX (Wang et al., 6 Apr 2026)	LLM agent, API	Plan → Function → Atomic	Tuple, text, code
Uni-Skill (Xie et al., 3 Mar 2026)	Robotic manipulation	Verb Class → Instance → Description → Slice	Embedding tree + video
VQ-CNMP (Aktas et al., 2024)	Robot LfD	Discrete codebook (high) + continuous (low)	VQ-VAE + CNMP
SkillDiffuser (Liang et al., 2023)	Robot planning	Discrete skill code → Diffusion plan → Action	Transformer + Unet
RoboMatrix (Mao et al., 2024)	Open-world robotics	LLM scheduler → meta-skill → control/hardware	Prompt + VLA model
SkillRL (Xia et al., 9 Feb 2026)	LLM agent RL	General/task-specific skill → policy	LLM, embeddings, distillation
IntentCUA (Lee et al., 19 Feb 2026)	Desktop automation	Multi-view intent → subgroup → skill-hint	Multi-view MLP + plan memory
Option hierarchy (Evans et al., 2023)	RL, control	Graph partition levels (Louvain)	State clusters, option policies

These approaches systematically map complex behaviors into multi-level modular compositional hierarchies, yielding modularity, transferability, and empirical improvements in both sample efficiency and generalization across domains.

References: