- The paper introduces a cross-modal latent dynamics framework that integrates kinematic and semantic state transitions via asymmetric cross-attention for grounded foresight.
- It employs a two-stage learning process with self-supervision, EMA targets, and auxiliary reconstruction losses to achieve a 94.7% success rate on long-horizon tasks.
- The framework demonstrates parameter efficiency and real-time planning capability, outperforming similar-scale models in diverse robotic manipulation scenarios.
CLaD: Cross-Modal Latent Dynamics for Grounded Foresight in Robotic Planning
Introduction
CLaD ("CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics" (2603.29409)) introduces a novel framework for long-horizon robotic manipulation that captures and utilizes the tightly coupled transitions between a robot's proprioceptive (kinematic) state and semantic (perception and language) state. Distinct from prior approaches that either plan within unimodal latent spaces or rely on explicit semantic artifact generation, CLaD models the co-evolution of kinematic and semantic states during action execution. It does so via an asymmetric cross-attention design, using proprioceptive transition queries to interpret semantic transitions, with a dedicated grounding mechanism to ensure predictive stability and semantic fidelity. This yields grounded latent foresights that condition an expressive, diffusion-based low-level controller.
Framework Overview and Methodology
CLaD is structured as a two-stage process. In the first stage, cross-modal latent dynamics are learned using self-supervision, with the model predicting the evolution of latent states representing both proprioception and semantics. Asymmetric cross-attention is employed: proprioceptive transition tokens serve as queries over semantic transitions, enforcing a causal and interpretable coupling of action-induced changes in both spaces. To stabilize learning and prevent representational collapse, predictions are constrained by EMA (exponential moving average) targets and lightweight auxiliary reconstruction losses, explicitly anchoring latent foresights to observable states.
The second stage leverages the predicted grounded foresights by modulating them with current proprioceptive and semantic observations through FiLM controllers. These modulated representations then serve as context for a diffusion policy, which generates robust, temporally extended action sequences.
Empirical Results
CLaD achieves a 94.7% average success rate on LIBERO-LONG, a challenging long-horizon robotic manipulation benchmark. This matches or surpasses much larger vision-language-action (VLA) models such as OpenVLA (93.8%, 7B params) and TT0.5 (93.2%, 3.3B params), while requiring only 0.66B parameters. In comparison to similar-scale latent planners and high-level semantic planners, CLaD offers both parameter efficiency and stable performance across diverse tasks.
Ablation studies demonstrate that joint cross-modal foresight outperforms both unimodal semantic or proprioceptive foresights and no-foresight baselines. Specifically, proprioceptive foresight alone degrades performance (50.4%), and semantic foresight alone is insufficient to match the full cross-modal architecture (91.5% vs 94.7%). The asymmetric design, with proprioceptive queries over semantic transitions, is empirically favored over symmetric or reversed alternatives.
Auxiliary reconstruction losses play a quantitatively and qualitatively critical role. Removing this loss component results in an 8.6% absolute reduction in success rate and loss of task-aligned cluster structure in the learned latent space, directly linking semantic grounding to policy effectiveness.
Computationally, CLaD offers practical advantages: low GPU memory requirements and a planning latency (0.012s) compatible with real-time deployment. This is a substantial improvement over semantic artifact generation approaches, which incur significant overhead due to iterative subgoal or image generation.
Theoretical Implications and Distinctions
CLaD advances latent-space planning by explicitly modeling the joint dynamic evolution of multiple modalities, rather than conflating them or aligning static observations. The architectural choice of asymmetric cross-attention imparts a structural prior consistent with the causal coupling between robot action and scene evolution, which traditional latent planners neglect. The result is a parameter-efficient model that better resolves perceptual ambiguities—especially in tasks with overlapping object features or similar visual scenes.
Crucially, CLaD's foresight is grounded via dual constraints: self-supervised prediction anchored by EMA targets, and explicit reconstruction losses that enforce bi-directional decodeability to observable poses and scene states. This joint grounding prevents excessive abstraction, a common failure mode in high-capacity latent models, and offers a practical path for integrating learned foresight as a surrogate for costly semantic artifact generation.
Generalizability, Limitations, and Future Directions
While CLaD demonstrates state-of-the-art performance on long-horizon planning, it exhibits slightly reduced generalization on short-horizon, high-variability tasks compared to large, heavily pre-trained VLAs. This highlights a dichotomy: CLaD is highly effective for task distributions where on-the-fly cross-modal dynamics modeling is paramount but does not yet fully leverage massive-scale pre-trained knowledge for rapid generalization. The authors suggest that integrating object-centric or spatially explicit latent foresight, and unifying dynamics modeling with large-scale VLA pre-training, are promising directions.
Additionally, the two-stage learning protocol, while efficient at inference, incurs notable training time (22 hours on a single RTX 4090), partly due to non-shared training of dynamics and policy. Extension to more heterogeneous sensory modalities—such as tactile or force sensing—and amortized dynamics pre-training across larger datasets are natural avenues for scaling.
Conclusion
CLaD establishes cross-modal latent dynamics as a competitive and efficient paradigm for robotic planning, by tightly integrating proprioceptive and semantic evolution during action. The framework's use of asymmetric attention, grounded foresight, and diffusion-based policies enables parameter-efficient, robust solution of long-horizon manipulation tasks. As robotic platforms and datasets grow in scale and diversity, CLaD's core methodology—modeling the causal interplay of multiple observable modalities in the latent space—offers a strong foundation for scalable, generalist robotic systems.