Disentangled Visual Foresight in VLA Systems

Updated 27 November 2025

DVF is a model component that disentangles high-dimensional visual forecasting from semantic reasoning using a dedicated diffusion Transformer head.
It employs latent-action queries and residual connections to optimize inter-frame dynamics, ensuring rapid training convergence and robust action planning.
DVF’s architecture, integrated within the Mantis system, demonstrates state-of-the-art results on benchmarks like LIBERO and real-world robotic manipulation tasks.

Disentangled Visual Foresight (DVF) is a model component introduced within the Mantis Vision-Language-Action (VLA) framework, designed to separate high-dimensional pixel-space forecasting from semantic understanding and reasoning. DVF enables a VLA system to predict future visual states and corresponding latent actions while avoiding the computational and capacity bottlenecks associated with end-to-end image prediction. The DVF mechanism achieves this by attaching a diffusion Transformer head to a frozen vision-language encoder backbone, employing learnable latent-action queries, and utilizing a residual connection to the current visual frame. This separation facilitates fast convergence and robust comprehension, as demonstrated in Mantis's empirical performance on robotic manipulation tasks and instruction-following scenarios (Yang et al., 20 Nov 2025).

1. Motivation and Conceptual Foundation

The central challenge in Vision-Language-Action (VLA) models is the integration of high-dimensional visual input and sparse language instructions to generate effective action policies. Traditional approaches that require direct future-frame prediction by the backbone burden model capacity and slow training, while approaches that compress future observations into low-dimensional supervisory signals (e.g., keypoints) suffer from loss of fine-grained motion information.

DVF addresses these trade-offs by disentangling the visual foresight prediction task from the reasoning backbone. Instead of overloading the backbone with forecasting objectives, DVF delegates look-ahead generation to a separate diffusion Transformer head, which operates in conjunction with latent-action queries. These latent variables are optimized to encode essential inter-frame dynamics, allowing the backbone to maintain a focus on perception and language grounding. The key outcome is an efficient model that achieves rapid convergence and retains strong semantic and reasoning capabilities (Yang et al., 20 Nov 2025).

2. Architectural Structure and Functional Components

The Mantis architecture operationalizes DVF through the following modular structure at each timestep $t$ :

Backbone $\mathbb{P}$ (Qwen2.5-VL): Processes current visual observation $o_t$ , language instruction $\ell$ , and latent-action queries [LAT], producing hidden states $h_t = \mathbb{P}_x(o_t, \ell, [LAT])$ .
Connector $\mathcal{C}$ : A 12-layer Transformer that merges $o_t$ and $h_t$ into condition vector $c_t$ for the foresight head.
Diffusion Transformer (DiT) DVF Head $\mathcal{D}$ : Utilizes Sana’s linear-complexity DiT, ingesting noisy future-frame latents $z_n$ and $c_t$ to generate denoised representations. A residual connects $o_t$ to facilitate efficient change prediction.
Action Head $\pi$ : A DiT-based policy module, taking $(h_t, [LAT], [ACT])$ to denoise action trajectories $a_{t:t+n}$ , where LAT guides ACT via causal attention.

A summary table of Mantis’s components is presented below:

Component	Input Modality	Role Function
Backbone (Qwen2.5-VL)	$o_t$ , $\ell$ , [LAT]	Visual-language perception
Connector ( $\mathcal{C}$ )	$o_t$ , $h_t$	Condition vector for DiT
DVF Head ( $\mathcal{D}$ )	$z_n$ , $c_t$ , $o_t$ (residual)	Future-frame denoising
Action Head ( $\pi$ )	$h_t$ , [LAT], [ACT]	Action trajectory forecasting

This disentangled architecture enables scalable training and effective multi-modal integration.

3. Mathematical Formulation

DVF’s operational principles are encapsulated in three loss objectives, each corresponding to distinct output modalities:

Next-State Image Prediction: For a future frame $o_{t+n}$ , its latent $z_0 = Enc(o_{t+n})$ is diffused via $z^t \sim q(z^t|z_0)$ . The DiT head learns $p_\theta(z^{t-1}|z^t, c_t)$ by optimizing the noise prediction loss:

$L_{DVF} = \mathbb{E}_{t \sim [1,T], z_0, \varepsilon \sim \mathcal{N}(0, I)} \| \varepsilon - \varepsilon_\theta(z^t, c_t, t) \|^2$

Inference samples $z_T \sim \mathcal{N}(0, I)$ , runs the reverse chain conditioned on $c_t$ , and decodes $Dec(z^0) + o_t \rightarrow \hat{o}_{t+n}$ .

Action Diffusion Loss: For an $n$ -step action sequence, latent $y_0$ is created and diffused. Optimization targets:

$L_{action} = \mathbb{E}_{t, y_0, \eta} \| \eta - \eta_\theta(y^t, h_t, [LAT], [ACT], t) \|^2$

Language Loss: If instruction $\ell$ is present, a cross-entropy loss is applied on any masked token prediction or VQA supervision.
Combined Objective: During pretraining, the system minimizes:

$\min_\theta\, \alpha L_{DVF} + L_{action} + \beta L_{lang}$

where $\alpha$ and $\beta$ balance modalities.

This structure ensures that visual foresight, action planning, and semantic understanding are optimized jointly, but disentangled through specialized architectural heads.

4. Latent-Action Queries and Implicit Dynamics Encoding

Latent-action queries ([LAT]) serve as bottleneck variables encoding inter-frame dynamics not recoverable from $o_t$ alone. The residual connection in the DVF head forces [LAT] to focus on change-related information. Empirically, [LAT] embeddings align with motion primitives—providing low-dimensional, explicit summaries of necessary object or effector motions (e.g., translation, rotation, gripper opens/closes). The latent action at time $t$ is defined as:

$a^{lat}_t = f_{query}([LAT]\,|\,c_t)$

These representations are fed, with $h_t$ , into the action head for policy denoising. This suggests [LAT] acts as a compact, implicitly supervised forecaster, easing translation into torque commands and facilitating multi-modal generalization.

5. Training Regimen and Data Utilization

DVF integration within Mantis follows a progressive, staged recipe:

Stage 1: Multiple-Gap Vision Only
- Data: SSV2 human manipulation videos (~220K).
- Training: DVF head, [LAT], [GAP] queries; frozen backbone. Objective: $L_{DVF}$ across random gaps.
Stage 2: Vision + Action Joint
- Data: DROID robot demonstrations (~76K episodes).
- Training: Unfreeze action queries. Optimize $\alpha L_{DVF} + L_{action}$ (gap = action chunk size); backbone frozen.
Stage 3: Language Supervised Mix
- Data: DROID + 38 image-text datasets (VQA, OCR, planning).
- Training: Unfreeze backbone; optimize $\alpha L_{DVF} + L_{action} + \beta L_{lang}$ .

Fine-tuning for the LIBERO benchmark uses vision and action objectives with $\alpha=0.1$ to maximize downstream success rates.

6. Empirical Evaluation and Ablations

Mantis’s empirical performance demonstrates the effectiveness of DVF:

LIBERO Benchmark: 96.7% average success; 98.8% (Spatial), 99.2% (Object), 94.4% (Goal), 94.2% (Long). This surpasses both vision-augmented and non-vision-augmented baselines.
Convergence: Achieves rapid convergence comparable to non-vision-augmented models; end-to-end foresight alternatives require up to ten additional epochs to succeed.
Real-World Robotics: Consistently outperforms $π_{0.5}$ in instruction-following, generalization, and reasoning on Agilex platforms.
DVF Ablation Study: Removal or modification of DVF yields lower performance (no-DVF: 91.3%, flawed-DVF: 94.4%, vanilla-DVF: 95.7%, pretrained-DVF: 96.2%), highlighting the necessity of foresight, the value of residual connectivity, and the impact of extensive video pretraining.
Adaptive Temporal Ensemble (ATE): Selective activation based on dynamic patch overlap cuts inference costs ~50% with no decrease in success rate.

A plausible implication is that residual-based DVF efficiently isolates dynamics from semantic context, leading to robust, generalizable action execution and efficient deployment in both simulation and physical settings.

7. Implementation Details and Reproduction Guidelines

Released code and weights are provided via https://github.com/zhijie-group/Mantis. Core implementation features include:

Backbone: Qwen2.5-VL (3.7B parameters).
DVF Head: Sana DiT (1.4B) with 12-layer connector.
Action Head: DiT-based policy (0.3B); 30 diffusion steps per action trajectory.
Training Tools: DeepSpeed and AdamW optimizer; image sizes 512×512, wrist camera 256×256.
Reproduction Steps: (i) Freeze backbone, (ii) attach a diffusion head with latent queries and residual connection, (iii) sequentially introduce action/language losses via staged optimization.

In summary, Disentangled Visual Foresight refines dense frame forecasting from a costly auxiliary signal into concise latent-action cues. This structural disentanglement co-optimizes perception, language comprehension, and action planning, enabling state-of-the-art success in both simulated and real robotic manipulation, with high convergence speed and minimal computational overhead (Yang et al., 20 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Disentangled Visual Foresight (DVF).