Papers
Topics
Authors
Recent
Search
2000 character limit reached

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Published 31 Mar 2026 in cs.RO, cs.AI, cs.CV, and cs.LG | (2603.29844v1)

Abstract: The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-LLMs (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

Summary

  • The paper introduces a dual-system architecture that decouples intent and action via a differentiable latent bottleneck for robust VLA performance.
  • The paper demonstrates superior simulation and real-world results, achieving up to 70.2% success on pick-and-place tasks and high few-shot efficiency.
  • The paper leverages human demonstration data and systematic ablations to validate scalable, interpretable latent world modeling within embodied AI.

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Introduction

DIAL addresses a core challenge in embodied AI: unifying high-level intent and low-level motor execution within Vision-Language-Action (VLA) architectures. Traditional approaches either introduce non-differentiable bottlenecks (e.g., textual plans or pixel-level goal images) or treat Vision-LLMs (VLMs) merely as feature encoders. The latter leads to unstable optimization and suboptimal utilization of VLM semantic capabilities. DIAL circumvents this issue by decoupling intent and action through a differentiable latent world modeling bottleneck, thereby bridging cognitive high-level reasoning and precise robotic motor control (Figure 1). Figure 1

Figure 1: DIAL structurally separates high-level VLM intent synthesis (System-2) and low-level action decoding (System-1) with a differentiable latent bottleneck for robust intent-to-action grounding.

Architectural Contributions

At the core of DIAL is a dual-system model: System-2 (VLM-based) synthesizes predictive latent foresight representing high-level intent within the VLM's ViT feature space. System-1 (policy) acts as a latent inverse dynamics model, decoding the difference between current observations and System-2's predicted foresight into robust, high-frequency motor actions. This architecture is explicitly implemented via:

  • Predictive Foresight via Latent World Modeling: System-2 generates spatially-structured latent futures (xtx_t) using learnable queries and an MLP head, closely aligned with ground-truth future ViT patch features through MSE loss.
  • Latent Inverse Dynamics Policy: System-1 fuses the current ViT-based visual context and the high-level foresight (xt)(x_t) in a cross-attentional DiT-based decoder, determining necessary action transitions purely in latent space.
  • Decoupled-to-Unified Training: Initial "decoupled warmup" prevents interference and collapse: System-2 learns world modeling, and System-1 is optimized using ground-truth futures—then joint optimization allows flow of action-aware gradients through the latent bottleneck, refining the VLM backbone in a controlled, regularized fashion. Figure 2

    Figure 3: The dual-system DIAL architecture, showing ViT-based foresight synthesis and latent inverse dynamics-based action decoding.

This methodological stratification ensures that the VLM's decision making is structurally indispensable for the low-level policy, directly mitigating shortcut learning and forcing action policies to be causally dependent on VLM-predicted intent (Figure 3). Figure 3

Figure 2: Comparative analysis demonstrates DIAL's strict bottleneck between reasoning and execution, outperforming hierarchical and end-to-end VLA designs without enforced structural grounding.

Empirical Evaluation

Simulation Benchmarks

On RoboCasa GR1 Tabletop, DIAL establishes a new upper bound for VLA performance:

  • Full Data: DIAL achieves 70.2% average success rate, surpassing FLARE (55.0%) and GR00T-N1.6 (47.6%), reflecting strong intent-to-action alignment and robust transfer across both pick-and-place and articulated manipulation tasks.
  • Data Efficiency: With only 10% demonstration data (few-shot setting), DIAL attains 58.3% success—outperforming FLARE trained with 10×10\times more data, illustrating exceptional data efficiency due to the bottleneck-structured representation. Figure 4

    Figure 5: DIAL achieves SOTA performance in RoboCasa GR1 simulation; the structured bottleneck enhances data efficiency.

Ablation Analyses

A systematic dissection reveals:

  • World Modeling Crucial for Performance: Removing explicit world modeling supervision collapses success rates (<22%-31%).
  • Loose vs. Strict Interfaces: Concatenative or auxiliary foresight tokens (+SEER, +FLARE variants) do not enforce causal intent usage, yielding suboptimal grounding (49%-52%). Only explicit inverse dynamics in the native intent space achieves optimal policy grounding (>58%).
  • Feature Space Alignment: DIAL's performance degrades significantly when substituting VLM-native latents with DINO-v2 features, confirming the necessity of latent consistency across reasoning and control. Figure 6

    Figure 7: Few-shot performance and ablation studies—structural interface and latent consistency are essential for stable VLM-to-action grounding.

Leveraging Human Data for Scalability

By pre-training on EgoDex human demonstration data, DIAL inherits physically grounded manipulation priors. This integration yields:

  • Improved In-Distribution and OOD Generalization: Success rates for pick-and-place and OOD tasks increase by up to 7%, and average OOD transfer rises from 46.2% to 51.2%. Gains are substantial in cues represented in human data (e.g., object manipulation), while limited in categories absent from the demonstrations (e.g., articulated objects). Figure 8

    Figure 9: Incorporating EgoDex boosts few-shot simulation and OOD generalization via effective cross-embodiment learning.

Real-World Validation and Stability

On the IRON-R01-1.11 robot, DIAL maintains robust physical performance:

  • In-Distribution: Warmup phase critical for training stability and real-robot success (77.5% with warmup vs 57.5% without).
  • Generalization: Consistently outperforms baselines in combinatorial, distractor, and instance-level OOD tasks, enabled by structured comparison between current ViT features and predicted foresight.
  • Role of Human Data: Removing human data from pre-training halves OOD performance, emphasizing the need for cross-embodiment demonstration as a semantic prior. Figure 10

    Figure 4: DIAL demonstrates stable execution and generalization on real robots; warmup and human priors are essential.

    Figure 11

    Figure 6: DIAL's zero-shot robustness extends across combinatorial, distractor, and instance transfer settings in the real world.

Foresight Interpretability

Qualitative analyses using PCA projections show that DIAL's latent foresight aligns closely with ground-truth future features in localized, task-relevant areas while diverging from current observations precisely where change is needed. This indicates that System-2 generates an actionable visual roadmap, not mere scene reconstructions, confirming that the structural bottleneck encodes semantically meaningful, task-specific transitions. Figure 12

Figure 8: DIAL's latent foresight anticipates spatially- and semantically-precise future scenes, providing interpretable structural guidance for downstream action.

Implications and Future Directions

DIAL's structural decoupling establishes a paradigm where high-level VLM reasoning is directly harnessed for action, enabling:

  • Highly data-efficient embodied policy learning with robust zero-shot OOD transfer.
  • Seamless integration of human-centric priors through action-free demonstrations.
  • Native compatibility with future advances in VLM/ViT pre-training and scalable action modules.

The architecture supports rapid transfer and iteration: a strong System-1 policy can be efficiently paired with new versions of VLM backbones, while further scaling foresight pre-training to large-scale, unlabeled human videos is anticipated to significantly enrich semantic grounding—pushing toward the development of more generalist and adaptive robotic agents.

Conclusion

DIAL systematically solves the structural grounding problem inherent to current VLAs by establishing an explicit, differentiable latent intent bottleneck between VLM-based decision-making and action decoding. This not only enables strong performance and generalization with high sample efficiency but also provides a modular template for future research in scalable, intent-aware embodied agents. Continued advances in VLM, ViT, and effective utilization of unlabeled video remains a promising avenue for the expansion of the DIAL architecture and its real-world applicability.

(2603.29844)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 44 likes about this paper.