Papers
Topics
Authors
Recent
2000 character limit reached

Disentangled Visual Foresight in VLA Systems

Updated 27 November 2025
  • DVF is a model component that disentangles high-dimensional visual forecasting from semantic reasoning using a dedicated diffusion Transformer head.
  • It employs latent-action queries and residual connections to optimize inter-frame dynamics, ensuring rapid training convergence and robust action planning.
  • DVF’s architecture, integrated within the Mantis system, demonstrates state-of-the-art results on benchmarks like LIBERO and real-world robotic manipulation tasks.

Disentangled Visual Foresight (DVF) is a model component introduced within the Mantis Vision-Language-Action (VLA) framework, designed to separate high-dimensional pixel-space forecasting from semantic understanding and reasoning. DVF enables a VLA system to predict future visual states and corresponding latent actions while avoiding the computational and capacity bottlenecks associated with end-to-end image prediction. The DVF mechanism achieves this by attaching a diffusion Transformer head to a frozen vision-language encoder backbone, employing learnable latent-action queries, and utilizing a residual connection to the current visual frame. This separation facilitates fast convergence and robust comprehension, as demonstrated in Mantis's empirical performance on robotic manipulation tasks and instruction-following scenarios (Yang et al., 20 Nov 2025).

1. Motivation and Conceptual Foundation

The central challenge in Vision-Language-Action (VLA) models is the integration of high-dimensional visual input and sparse language instructions to generate effective action policies. Traditional approaches that require direct future-frame prediction by the backbone burden model capacity and slow training, while approaches that compress future observations into low-dimensional supervisory signals (e.g., keypoints) suffer from loss of fine-grained motion information.

DVF addresses these trade-offs by disentangling the visual foresight prediction task from the reasoning backbone. Instead of overloading the backbone with forecasting objectives, DVF delegates look-ahead generation to a separate diffusion Transformer head, which operates in conjunction with latent-action queries. These latent variables are optimized to encode essential inter-frame dynamics, allowing the backbone to maintain a focus on perception and language grounding. The key outcome is an efficient model that achieves rapid convergence and retains strong semantic and reasoning capabilities (Yang et al., 20 Nov 2025).

2. Architectural Structure and Functional Components

The Mantis architecture operationalizes DVF through the following modular structure at each timestep tt:

  • Backbone P\mathbb{P} (Qwen2.5-VL): Processes current visual observation oto_t, language instruction \ell, and latent-action queries [LAT], producing hidden states ht=Px(ot,,[LAT])h_t = \mathbb{P}_x(o_t, \ell, [LAT]).
  • Connector C\mathcal{C}: A 12-layer Transformer that merges oto_t and hth_t into condition vector ctc_t for the foresight head.
  • Diffusion Transformer (DiT) DVF Head D\mathcal{D}: Utilizes Sana’s linear-complexity DiT, ingesting noisy future-frame latents znz_n and ctc_t to generate denoised representations. A residual connects oto_t to facilitate efficient change prediction.
  • Action Head π\pi: A DiT-based policy module, taking (ht,[LAT],[ACT])(h_t, [LAT], [ACT]) to denoise action trajectories at:t+na_{t:t+n}, where LAT guides ACT via causal attention.

A summary table of Mantis’s components is presented below:

Component Input Modality Role Function
Backbone (Qwen2.5-VL) oto_t, \ell, [LAT] Visual-language perception
Connector (C\mathcal{C}) oto_t, hth_t Condition vector for DiT
DVF Head (D\mathcal{D}) znz_n, ctc_t, oto_t (residual) Future-frame denoising
Action Head (π\pi) hth_t, [LAT], [ACT] Action trajectory forecasting

This disentangled architecture enables scalable training and effective multi-modal integration.

3. Mathematical Formulation

DVF’s operational principles are encapsulated in three loss objectives, each corresponding to distinct output modalities:

  • Next-State Image Prediction: For a future frame ot+no_{t+n}, its latent z0=Enc(ot+n)z_0 = Enc(o_{t+n}) is diffused via ztq(ztz0)z^t \sim q(z^t|z_0). The DiT head learns pθ(zt1zt,ct)p_\theta(z^{t-1}|z^t, c_t) by optimizing the noise prediction loss:

LDVF=Et[1,T],z0,εN(0,I)εεθ(zt,ct,t)2L_{DVF} = \mathbb{E}_{t \sim [1,T], z_0, \varepsilon \sim \mathcal{N}(0, I)} \| \varepsilon - \varepsilon_\theta(z^t, c_t, t) \|^2

Inference samples zTN(0,I)z_T \sim \mathcal{N}(0, I), runs the reverse chain conditioned on ctc_t, and decodes Dec(z0)+oto^t+nDec(z^0) + o_t \rightarrow \hat{o}_{t+n}.

  • Action Diffusion Loss: For an nn-step action sequence, latent y0y_0 is created and diffused. Optimization targets:

Laction=Et,y0,ηηηθ(yt,ht,[LAT],[ACT],t)2L_{action} = \mathbb{E}_{t, y_0, \eta} \| \eta - \eta_\theta(y^t, h_t, [LAT], [ACT], t) \|^2

  • Language Loss: If instruction \ell is present, a cross-entropy loss is applied on any masked token prediction or VQA supervision.
  • Combined Objective: During pretraining, the system minimizes:

minθαLDVF+Laction+βLlang\min_\theta\, \alpha L_{DVF} + L_{action} + \beta L_{lang}

where α\alpha and β\beta balance modalities.

This structure ensures that visual foresight, action planning, and semantic understanding are optimized jointly, but disentangled through specialized architectural heads.

4. Latent-Action Queries and Implicit Dynamics Encoding

Latent-action queries ([LAT]) serve as bottleneck variables encoding inter-frame dynamics not recoverable from oto_t alone. The residual connection in the DVF head forces [LAT] to focus on change-related information. Empirically, [LAT] embeddings align with motion primitives—providing low-dimensional, explicit summaries of necessary object or effector motions (e.g., translation, rotation, gripper opens/closes). The latent action at time tt is defined as:

atlat=fquery([LAT]ct)a^{lat}_t = f_{query}([LAT]\,|\,c_t)

These representations are fed, with hth_t, into the action head for policy denoising. This suggests [LAT] acts as a compact, implicitly supervised forecaster, easing translation into torque commands and facilitating multi-modal generalization.

5. Training Regimen and Data Utilization

DVF integration within Mantis follows a progressive, staged recipe:

  • Stage 1: Multiple-Gap Vision Only
    • Data: SSV2 human manipulation videos (~220K).
    • Training: DVF head, [LAT], [GAP] queries; frozen backbone. Objective: LDVFL_{DVF} across random gaps.
  • Stage 2: Vision + Action Joint
    • Data: DROID robot demonstrations (~76K episodes).
    • Training: Unfreeze action queries. Optimize αLDVF+Laction\alpha L_{DVF} + L_{action} (gap = action chunk size); backbone frozen.
  • Stage 3: Language Supervised Mix
    • Data: DROID + 38 image-text datasets (VQA, OCR, planning).
    • Training: Unfreeze backbone; optimize αLDVF+Laction+βLlang\alpha L_{DVF} + L_{action} + \beta L_{lang}.

Fine-tuning for the LIBERO benchmark uses vision and action objectives with α=0.1\alpha=0.1 to maximize downstream success rates.

6. Empirical Evaluation and Ablations

Mantis’s empirical performance demonstrates the effectiveness of DVF:

  • LIBERO Benchmark: 96.7% average success; 98.8% (Spatial), 99.2% (Object), 94.4% (Goal), 94.2% (Long). This surpasses both vision-augmented and non-vision-augmented baselines.
  • Convergence: Achieves rapid convergence comparable to non-vision-augmented models; end-to-end foresight alternatives require up to ten additional epochs to succeed.
  • Real-World Robotics: Consistently outperforms π0.5π_{0.5} in instruction-following, generalization, and reasoning on Agilex platforms.
  • DVF Ablation Study: Removal or modification of DVF yields lower performance (no-DVF: 91.3%, flawed-DVF: 94.4%, vanilla-DVF: 95.7%, pretrained-DVF: 96.2%), highlighting the necessity of foresight, the value of residual connectivity, and the impact of extensive video pretraining.
  • Adaptive Temporal Ensemble (ATE): Selective activation based on dynamic patch overlap cuts inference costs ~50% with no decrease in success rate.

A plausible implication is that residual-based DVF efficiently isolates dynamics from semantic context, leading to robust, generalizable action execution and efficient deployment in both simulation and physical settings.

7. Implementation Details and Reproduction Guidelines

Released code and weights are provided via https://github.com/zhijie-group/Mantis. Core implementation features include:

  • Backbone: Qwen2.5-VL (3.7B parameters).
  • DVF Head: Sana DiT (1.4B) with 12-layer connector.
  • Action Head: DiT-based policy (0.3B); 30 diffusion steps per action trajectory.
  • Training Tools: DeepSpeed and AdamW optimizer; image sizes 512×512, wrist camera 256×256.
  • Reproduction Steps: (i) Freeze backbone, (ii) attach a diffusion head with latent queries and residual connection, (iii) sequentially introduce action/language losses via staged optimization.

In summary, Disentangled Visual Foresight refines dense frame forecasting from a costly auxiliary signal into concise latent-action cues. This structural disentanglement co-optimizes perception, language comprehension, and action planning, enabling state-of-the-art success in both simulated and real robotic manipulation, with high convergence speed and minimal computational overhead (Yang et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Disentangled Visual Foresight (DVF).