Papers
Topics
Authors
Recent
2000 character limit reached

Keypose Imagination in Atomic Skill Spaces

Updated 27 December 2025
  • Keypose Imagination is a framework that combines structured atomic skill spaces with predictive terminal keypose modeling for advanced robotic manipulation.
  • It employs vision–language annotations and VQ-VAE encoding to build a semantically grounded skill library, enabling precise segmentation and compositional control.
  • Empirical evaluations show significant improvements in action transition accuracy and task success rates compared to traditional methods.

Keypose Imagination denotes a class of policy architectures and training protocols for robotic manipulation that integrates structured atomic skill spaces with predictive modeling of terminal keyposes. Its purpose is to enable multi-task agents to reason simultaneously about fine-grained motion sequences and overarching long-horizon goals, thereby supporting reliable skill chaining. This methodology, exemplified by the AtomSkill framework (Zhu et al., 20 Dec 2025), addresses core bottlenecks in semantic skill compositionality and generalization across diverse manipulation scenarios.

1. Formal Definition: Atomic Skill Space and Keypose Imagination

In AtomSkill (Zhu et al., 20 Dec 2025), demonstration trajectories τ={(Ot,at)}t=1T\tau = \{ (O_t, a_t) \}_{t=1}^T are partitioned into a discrete set of atomic skills S={1,,S}\mathcal S = \{1, \dots, |\mathcal S|\}, guided by changes in the binary gripper state (open/closed). Each atomic skill sSs \in \mathcal S is encoded as a quantized sequence zqEnz_q \in \mathcal E^n, where E={e1,,eK}RD\mathcal E = \{e_1, \dots, e_K \} \subset \mathbb R^D is a learned VQ-VAE codebook and nn is the fixed token length per skill. The “keypose imagination” module refers to the joint prediction of (a) a chunked action sequence a^t:t+H1\hat a_{t:t+H-1} and (b) the terminal keypose a^keyposeat+Htrue\hat a_{\rm keypose} \approx a_{t+H}^{\rm true} for each atomic skill segment. This architecture enables skill-conditioned policies to anticipate the spatial terminus of the subtask while executing granular control.

2. Construction of Semantically Grounded Atomic Skill Library

AtomSkill employs gripper-state change detection as segmentation cues, and vision–language annotation using models such as Qwen-2.5-VL for extracting natural-language descriptions LsiL_{s_i} and semantic skill labels sis_i. VQ-VAE encoding transforms segment actions into continuous embeddings zejRDz_e^j \in \mathbb R^D quantized via nearest-neighbor search into codebook entries, assembling discrete skill representations zqjz_q^j. Temporal coherence and semantic clustering are imposed by supervised contrastive losses: Ltemp=i=1N1Ptemp(i)pPtemp(i)logSi,paA(i)Si,a,\mathcal L_{\mathrm{temp}} = -\sum_{i=1}^N \frac{1}{|P_{\mathrm{temp}}(i)|} \sum_{p \in P_{\mathrm{temp}}(i)} \log \frac{S_{i,p}}{\sum_{a \in A(i)} S_{i,a}},

Lskill=i=1N1Pskill(i)pPskill(i)logSi,paA(i)Si,a,\mathcal L_{\mathrm{skill}} = -\sum_{i=1}^N \frac{1}{|P_{\mathrm{skill}}(i)|} \sum_{p \in P_{\mathrm{skill}}(i)} \log \frac{S_{i,p}}{\sum_{a \in A(i)} S_{i,a}},

with Si,j=exp(zizjT)S_{i,j} = \exp\left( \frac{z^i \cdot z^j}{\mathcal T} \right) and temperature T\mathcal T.

3. Keypose-Conditioned Action Generation: Architecture and Objectives

The cross-attention action decoder (ψθ\psi_\theta) fuses visual (front/wrist RGB), proprioceptive, and language features with the skill code zqz_q. At each timestep tt, queries are attended to the zqjz_q^j keys/values. The decoder simultaneously regresses

a^t:t+H1=(a^t,...,a^t+H1),a^keyposeat+Htrue\widehat a_{t:t+H-1} = (\widehat a_t, ..., \widehat a_{t+H-1}), \quad \widehat a_{\mathrm{keypose}} \approx a_{t+H}^{\mathrm{true}}

under the composite training loss

L=LVQ+β1La+β2Lcontrast+β3Lkeypose,\mathcal L = \mathcal L_{\mathrm{VQ}} + \beta_1 \mathcal L_a + \beta_2 \mathcal L_{\mathrm{contrast}} + \beta_3 \mathcal L_{\mathrm{keypose}},

where La\mathcal L_a and Lkeypose\mathcal L_{\mathrm{keypose}} are 1\ell_1 losses on the predicted action chunk and terminal keypose, respectively. A diffusion-based sampler ρθ\rho_\theta generates skill codes at inference, supporting compositional chaining of skills.

4. Empirical Performance and Comparative Analysis

AtomSkill’s keypose imagination yields quantitative gains across simulated (RLBench, six tasks) and real-world manipulation environments. Average ATP/SR in simulation is 0.68/67.2%, surpassing ACT’s 0.55/46.7%, DP’s 0.54/37.2%, VQ-BeT’s 0.10/5.0%, and QueST’s 0.39/30.0%. For motion-pattern and spatial localization tasks, keypose conditioning delivers pronounced improvements (ATP/SR: 0.83/82.2% and 0.53/52.2%, respectively). Real-world bimanual and single-arm benchmarks manifest similar advantages: e.g., single-arm ATP 0.93 versus ACT’s 0.73. Ablative studies confirm the synergistic benefit of combining temporal and semantic contrastive losses with keypose prediction: removing these components reduces ATP to 0.33/29.4%. Notably, keypose prediction enables robust chaining and clear-disambiguation of spatial endpoints in composite skills.

5. Integration of High-Level Reasoning with Fine-Grained Control

Keypose imagination operationalizes a policy’s capacity to reason over both high-level skill intentions and low-level motor execution, unifying semantic abstraction (via skill code zqz_q and vision-language annotation) with direct control. The terminal keypose serves as a predictive anchor, guiding action chunking and facilitating seamless transition between skill segments. The explicit modeling of terminal states supports hierarchical and compositional policies that outperform recurrent or purely autoregressive baselines in handling multi-stage tasks and cross-task generalization.

6. Limitations and Practical Implications

While keypose imagination addresses semantic consistency and long-horizon planning, limitations persist. Gripper-keyframe segmentation may yield variable skill granularity, and reliance on pretrained vision-language annotators constrains transferability to visually ambiguous or poorly described contexts. The temporal alignment between predicted keyposes and true behavioral transitions depends heavily on the quality of demonstration data and codebook structuring. This suggests that future work should explore adaptive segmentation heuristics, richer semantic annotation, and hierarchical skill codebooks. A plausible implication is that further incorporation of event segmentation theory or self-supervised boundary detection—such as the Skill Boundary Detection proposed in (Deng et al., 11 Mar 2025)—may augment keypose imagination capabilities.

Keypose imagination is situated in a spectrum of atomic skill space frameworks. Compared to data-driven library construction via VLA fine-tuning (Li et al., 25 Jan 2025), and LLM-driven iterative skill emergence (Zhao et al., 23 May 2024), it uniquely implements predictive modeling of skill termination. Lifelong RL skill-space planning (Lu et al., 2020) and knowledge state networks for atomic learning (Rasch et al., 2021) focus on latent skill discoverability and assessment, but do not offer explicit end-state reasoning. Thus, keypose imagination represents a distinct advancement in enabling policies that not only execute atomic skills but plan and anticipate their targets within rich multi-task domains.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Keypose Imagination.