AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

Published 13 Apr 2026 in cs.RO and cs.LG | (2604.11135v1)

Abstract: Pretrained video generation models provide strong priors for robot control, but existing unified world action models still struggle to decode reliable actions without substantial robot-specific training. We attribute this limitation to a structural mismatch: while video models capture how scenes evolve, action generation requires explicit reasoning about where to interact and the underlying manipulation intent. We introduce AIM, an intent-aware unified world action model that bridges this gap via an explicit spatial interface. Instead of decoding actions directly from future visual representations, AIM predicts an aligned spatial value map that encodes task-relevant interaction structure, enabling a control-oriented abstraction of future dynamics. Built on a pretrained video generation model, AIM jointly models future observations and value maps within a shared mixture-of-transformers architecture. It employs intent-causal attention to route future information to the action branch exclusively through the value representation. We further propose a self-distillation reinforcement learning stage that freezes the video and value branches and optimizes only the action head using dense rewards derived from projected value-map responses together with sparse task-level signals. To support training and evaluation, we construct a simulation dataset of 30K manipulation trajectories with synchronized multi-view observations, actions, and value-map annotations. Experiments on RoboTwin 2.0 benchmark show that AIM achieves a 94.0% average success rate, significantly outperforming prior unified world action baselines. Notably, the improvement is more pronounced in long-horizon and contact-sensitive manipulation tasks, demonstrating the effectiveness of explicit spatial-intent modeling as a bridge between visual world modeling and robot control.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes an intermediate spatial value map interface that decouples future video prediction from action generation for efficient robotic manipulation.
It employs a two-stage training protocol with intent-causal attention and GRPO-based policy optimization to enhance sample efficiency and robustness.
Experiments on RoboTwin 2.0 demonstrate state-of-the-art performance with success rates of 94.0% (Easy) and 92.1% (Hard) across various manipulation tasks.

AIM: Intent-Aware Unified World Action Modeling with Spatial Value Maps

Motivation and Problem Statement

Existing unified world action models, which build on strong pretrained video generation backbones, exhibit limitations in robot control: decoding reliable actions from future visual representations remains data- and adaptation-intensive because the action inference is confounded by dense, task-agnostic appearance signals. The core issue is a structural mismatch between world modeling (what changes visually) and control execution (where and why to interact). Robotic manipulation necessitates explicit spatial reasoning about interaction regions and manipulation intent, which are only implicitly encoded in prior generative or VLA approaches.

AIM Model: Spatial Value Map Interface for World-to-Action Decoding

The proposed AIM model precisely addresses this by introducing an explicit spatial value map interface between video-based future prediction and action generation. Rather than directly inferring control actions from future visual latents, AIM decodes an intermediate value map spatially aligned with forthcoming RGB observations, encoding task-relevant interaction structures in a control-oriented abstraction. Actions are then conditioned on this value map, bridging the gap between perception and actionable intent.

The high-level contrast between classical models and AIM is captured in (Figure 1):

Figure 1: (a) Traditional world action models decode actions directly from visual futures; (b) AIM inserts an intermediate, spatially-aligned value map as the control interface.

AIM utilizes a pretrained visual backbone (Wan2.2-TI2V-5B), adapting it via mixture-of-transformers to jointly model future frames, spatial value maps, and actions. Key to AIM is the “intent-causal attention” mechanism: action decoding only attends to predicted value maps, not the dense future video tokens, enforcing structural disentanglement between world and action reasoning.

The architecture and sequential training protocol are shown in (Figure 2):

Figure 2: The AIM framework: Stage I jointly learns visual prediction, action generation, and value-map modeling via intent-causal attention. Stage II applies policy optimization (GRPO), finetuning only the action head against dense/sparse rewards derived from value map responses.

Methodology Details

Data and Tokenization

AIM operates on manipulation trajectories with synchronized multi-view video, action vectors, and per-step value map annotations (highlighting contact/affordance regions via simulator-based ground-truth generation). Video and value maps are discretized using a shared VAE tokenizer, maintaining geometric alignment; actions are projected to tokens via an MLP, and linguistic instructions are encoded by a pretrained T5 encoder, injected only into the world modeling pipeline.

Policy Factorization and Attention Structure

AIM factorizes the predictive model as:

$p(X^+, M^+, A^+ \mid \mathcal{H}_t) = p(X^+, M^+ \mid \mathcal{H}_t) \, p(A^+ \mid \mathcal{H}_t, M^+)$

with $\mathcal{H}_t$ denoting the history window. Under the intent-causal self-attention, action tokens are restricted visibilities: at each transformer layer, actions can access only the value tokens—not future video tokens or direct language features. This forces manipulation intent to be mediated through an interpretable, spatially localized map, simplifying inverse dynamics learning for the action head.

Self-Distillation Reinforcement Learning

AIM incorporates a two-stage training paradigm: initial supervised chunked sequence prediction (joint world, value, and action learning), followed by RL-based policy optimization (Stage II) focused exclusively on the action head. The world and value branches are frozen; the action head is updated via GRPO, driven by (i) sparse environment reward and (ii) dense, model-derived reward from the alignment of executed actions with high-value map regions (self-distillation). This approach stabilizes visual semantics and prevents catastrophic forgetting of visual priors.

Dataset and Value Map Supervision

The study introduces a comprehensive simulation dataset on RoboTwin 2.0, providing 30K trajectories. Value maps are annotated by simulating grasp/contact (pick) or object placement (place), projecting physical contact areas into the observation space, and producing Gaussian-smoothed affordance maps. This enables consistent, large-scale spatial supervision critical for explicit interaction reasoning.

Experimental Results

AIM is evaluated on 50 RoboTwin 2.0 manipulation tasks under both Easy and Hard settings, systematically compared to leading world-action and VLA baselines including $\pi_{0.5}$ , Motus, X-VLA, Fast-WAM, GigaWorld, and LingBot-VA. AIM achieves an average SR of 94.0% (Easy) / 92.1% (Hard), outperforming all external baselines and delivering the strongest success rates, particularly on tasks demanding long-horizon planning or precise spatial contact (see the representative task executions in (Figure 3)).

Figure 3: Qualitative rollouts on complex RoboTwin 2.0 tasks, demonstrating temporally-aligned visual prediction and stage-localized value maps directing interaction.

AIM’s explicit value-map interface confers especially robust gains on contact-sensitive and spatially localized manipulation tasks, where prior models falter by overfitting to appearance features or failing to identify actionable regions.

Discussion and Implications

The introduction of a spatial value map as an explicit, temporally grounded intent interface decouples visual prediction from control requirements, enabling more sample-efficient and interpretable action generation. This design solves the task sparsity and intent localization challenge that plagues dense action-from-video methods. The two-stage training protocol, utilizing self-distilled dense rewards from frozen value maps, demonstrates stable policy finetuning without retraining the entire generative backbone.

Practically, AIM advances the construction of generalist robot policies by enabling more reliable action decoding across diverse tasks without expensive robot-specific data. Theoretically, it provides further evidence that intermediate, spatially grounded intent representations are not only beneficial but essential for efficient visuomotor control. Related avenues for future work include scaling this spatial value framework to real-world robot vision under partial observability and extending the value map concept toward 3D/affordance reasoning in unstructured settings.

Conclusion

AIM redefines world-to-action modeling for robotic manipulation by introducing an intent-causal, spatial value map interface, backed by a principled mixture-of-transformers architecture and a robust two-phase training regime. The resultant system achieves state-of-the-art control success on the RoboTwin 2.0 benchmark, evidencing the value of explicit spatial reasoning for generalizable, interpretable robotic action generation. Continued exploration of spatial intent interfaces holds promise for bridging generative world models and actionable control in embodied AI.