Skill-Conditioned Executor

Updated 27 March 2026

Skill-conditioned executor is a framework that abstracts long-horizon task intents into parameterized skill primitives, enabling lower-level policies to execute complex actions.
It employs diverse architectures—from neural networks using cross-attention to symbolic planners with finite state machines—to condition action selection on structured skill representations.
Recent implementations show robust generalization, adaptive learning, and improved efficiency, though challenges remain in partial observability and computational cost optimization.

A skill-conditioned executor is a system that operationalizes high-level task intent—typically captured as a "skill" or temporally extended action—by conditioning lower-level (atomic or continuous) policies, controllers, or discrete action generators on structured skill representations. This paradigm abstracts complex, long-horizon behavior into parameterizable or symbolic skill primitives, then conditions the underlying execution pipeline on these skill abstractions to support robust, generalizable, and efficient action composition across diverse domains, such as robotics, sequential decision-making, web agents, and LLM-based systems.

1. Formal Definitions and Core Principles

A skill-conditioned executor maintains two central ingredients: (1) a parameterized skill or set of skills, and (2) an execution mechanism that conditions policy, controller, or action selection on the provided skill.

Parameterization can take the form of (i) symbolic operators (e.g., pick, place), (ii) latent vector embeddings (learned or manually specified), (iii) trajectories (sparse or dense), or (iv) natural language or image goals.

In canonical goal-conditioned visuomotor control, for example, each skill primitive such as "push" or "pick-and-place" is instantiated by encoding a desired goal state and then conditioning a policy or controller to realize the skill from the observed state toward the target (Groth et al., 2020). More generally, the executor maintains a mapping:

$\pi(a_t \mid s_t, z_t)$

where $s_t$ is the current observation/state, $a_t$ is the action, and $z_t$ is a skill conditioning variable (which may be discrete, continuous, or structured).

Recent systems further extend this abstraction: in symbolic planning and learning, the executor is defined as a controller for a set of black-box skills $\Omega$ with abstracted preconditions, postconditions, and effect models, enabling classical planning or reactive scheduling over skill invocations (Yang et al., 22 Nov 2025). In reinforcement learning and imitation learning, hierarchical policies select skill embeddings or tokens at a coarse timescale, and the executor maps these to low-level actions (Rana et al., 2022, Huang et al., 2023, Kim et al., 2024).

2. Architecture and Conditioning Mechanisms

Skill-conditioned executors employ a range of architectures depending on the target modality and skill formalism:

Visuomotor and Manipulation (e.g., GEECO, Skill Transformer): Deep end-to-end neural networks (CNNs or transformers) encode images, proprioception, and skill representations. Conditioning is often applied through vector concatenation, attention, or token insertion (Huang et al., 2023, Groth et al., 2020).
Symbolic/Reactive Planning: The executor is a composition of symbolic planners, finite state machines (FSMs), and behavior trees (BTs). Skills are operators with well-specified preconditions and effects (Wuthier et al., 2021, Mukherjee et al., 2020, Yang et al., 22 Nov 2025), and the BT/FSM hybrid architecture enables asynchronous, preemptive skill switching required for multitasking.
Language and Instruction-conditioned Policies: Encoders map language instructions to skill bottlenecks (e.g., via VQ-VAE, InfoNCE objectives), and policies are conditioned on discrete skill tokens, supporting robust generalization to novel combinations and instructions (Ju et al., 2024).
Latent Diffusion and Generative Policies: In cross-embodiment imitation, human demonstration videos are distilled to low-dimensional trajectories or skill plans, which then condition both video-generation and action-execution policies through cross-attention and latent-space embedding (Tang et al., 9 Oct 2025, Xu et al., 2023).
LLM/Agentic Systems: Skills are externalized as markdown files, templates, or memory abstractions; the executor (LLM) is conditioned by injecting selected skill templates or structured predicates as plain-text prefix (prompt-based conditioning) (Zhang et al., 2 Feb 2026, Zhou et al., 19 Mar 2026).

Conditioning may be performed by concatenation, cross-attention, prefixing natural language skill instructions, or explicit fusion via learned embeddings. The underlying mechanism must ensure that the representation of the skill influences the resulting action distribution or behavioral pathway throughout the skill's temporal extent.

3. Learning and Adaptation Approaches

The efficacy of skill-conditioned executors depends on how skill representations and conditioned controllers are learned and continually adapted:

End-to-End Supervised Learning: Policies are trained to map (observation, skill) pairs to trajectory segments, as in CNN–LSTM architectures for visual manipulation (Groth et al., 2020) or conditional diffusion policies for robot skills (Xu et al., 2023).

Mutual Information Maximization: In language-conditioned systems, skill tokens are discovered to maximize $I(Z;L)$ or $I(z;L|s)$ , ensuring each skill is recoverable from language and vice-versa, as in Language Conditioned Skill Discovery (LCSD) (Ju et al., 2024).

Hierarchical and Latent Variable Models: Skill discovery leverages VAE or state-conditioned flow models to embed low-level trajectories, with high-level RL policies or planners sampling skill embeddings, further augmented by residual policies for fine-tuned adaptation (Rana et al., 2022, Kim et al., 2024).

Symbolic Abstraction and Predicate Learning: By inventing symbolic predicates and composing grounded abstract operators around black-box skills, SkillWrapper demonstrates formal guarantees (soundness, completeness) in symbolic planning domains (Yang et al., 22 Nov 2025). Data-driven predicate invention bootstraps robust planning across long, unseen task compositions.

Continual and Reflective Learning: Systems such as Memento-Skills (Zhou et al., 19 Mar 2026) implement closed-loop "Read–Write Reflective Learning," where both the skill selection policy and the skill set itself evolve over time as new failure modes are encountered and repaired, all without updating core LLM parameters.

Skill Recovery and Repair: Explicit contract-based skill representations (ContractSkill) enable deterministic step-level verification, fault localization, and minimal patch-based repair, supporting robust, transferable execution in web and GUI automation (Lu et al., 20 Mar 2026).

4. Empirical Outcomes and Benchmarks

Skill-conditioned executor architectures have demonstrated the following empirical outcomes:

Robotic Manipulation: In GEECO, skill-conditioned dynamic-image policies obtained 99% reach, 89% push (pushing), and 78% pick, 46% place (pick-and-place) success rates, far surpassing baseline MPC and imitation learning controllers (Groth et al., 2020). The executor generalized to unseen objects, scene clutter, and visual noise with minimal performance degradation.
Cross-Domain Transfer and Generalization: Symbolic predicate/inventory-based skill executors (e.g., SkillWrapper, CUA-Skill) solved up to 76.7% (SkillWrapper, real robot) and 57.5% (CUA-Skill Agent, desktop agent) of held-out tasks, surpassing non-skill or monolithic baselines. Repaired contract-based skills (ContractSkill) increased VisualWebArena and MiniWoB GUI task success by up to 47.8 and 12.8 points in cross-model transfer (Lu et al., 20 Mar 2026, Chen et al., 28 Jan 2026, Yang et al., 22 Nov 2025).
Reduced Token and Memory Footprint: Distilled skill banks in LLM agents led to 10–20× prompt reductions and higher reasoning utility, accelerating RL and RL-finetuned adaptation (Xia et al., 9 Feb 2026).
Adaptive, Fault-tolerant Execution: Non-bypassable skill-conditioned execution contracts (as in Survivability-Aware Execution) introduced invariants (risk budgets, cold-down, slippage, allowlists), reducing maximum drawdown by 93.1% and delegation-induced loss by 97% on adversarial trading workloads (Borjigin et al., 10 Mar 2026).
Hierarchical RL and Sample Efficiency: Skill-conditioned priors and residual decoders in RL yielded accelerated learning, higher final rewards, and broader task generalization versus atomic action or vanilla multitask RL (ReSkill, GLvSA) (Rana et al., 2022, Kim et al., 2024).
Zero-shot and Few-shot Robustness: Offline-learned, skill-abstraction based policies achieved over 90 normalized success after few-shot adaptation in goal-conditioned manipulation and navigation tasks, vastly outperforming non-hierarchical baselines under distribution shift (Kim et al., 2024).

5. Limitations and Theoretical Guarantees

Although demonstrating strong generalization, skill-conditioned executors exhibit limitations:

Depth Ambiguity, Partial Observability: Visual skill executors experience performance degradation in ambiguous or occluded scenarios; explicit depth or contact modalities could be integrated to address this (Groth et al., 2020).
Skill Interface Specification: Incomplete or implicit skill contracts often result in brittle execution in GUI and web domains (ContractSkill exposes the benefits of explicit contracts and recovery rules) (Lu et al., 20 Mar 2026).
Predicate Abstraction and Planning Coverage: Sufficient exploration and predicate invention are required for sound and complete symbolic planning (SkillWrapper provides probabilistic completeness bounds for learned abstractions) (Yang et al., 22 Nov 2025).
Computational Efficiency: Generative video-conditioned policies (e.g., cross-embodiment TrajSkill) are limited by sampling costs of diffusion transformers and the representation of 2D keypoint flows (Tang et al., 9 Oct 2025).

Several frameworks provide theoretical guarantees:

Soundness and Completeness: SkillWrapper formalizes abstract operator learning and planning such that soundness (consistent mapping between low-level and symbolic state) and probabilistic completeness (coverage of real plans) are provably satisfied given sufficient data and correct predicate invention (Yang et al., 22 Nov 2025).
Delegation Gap Instrumentation: Survivability-Aware Execution instruments and quantifies the delegation gap and operationalizes safety boundaries for skill-enabled agent stacks in trading (Borjigin et al., 10 Mar 2026).

6. Evolution and Future Directions

Contemporary research indicates a transition from static, hand-designed skill libraries toward fully learnable, self-evolving skill banks and dynamic skill-conditioned execution. MemSkill and Memento-Skills demonstrate closed-loop procedures, where distinct controllers, designers, and evolutionary mechanisms continuously select, apply, and refine memory skills or structured workflows in LLM-based agents (Zhang et al., 2 Feb 2026, Zhou et al., 19 Mar 2026). Reflective learning and recursive co-evolution ("Read–Write Reflective Learning," "recursive evolution mechanism") enable continual skill adaptation without modifying the frozen base model, supporting cross-domain generalization and rapid incorporation of failure-driven skill discoveries.

Furthermore, the explicit formalization of skill contracts, symbolic predicates, and state abstractions enables transparent, plannable, and transferable execution across diverse domains, including manipulation, virtual agents, and sequential decision-making. As the field matures, frameworks combining interpretable skill parameterization, robust adaptive execution, and verifiable hierarchical abstractions are expected to further advance the reliability, efficiency, and domain transferability of skill-conditioned executors.