VideoVLA: Unified Multi-Modal Robotics

Updated 14 December 2025

VideoVLA is a unified, multi-modal framework that combines vision, language, and action forecasting to enable generalizable robotic manipulation.
It employs a multi-modal diffusion transformer that jointly predicts future video latents and continuous action vectors to synthesize policies.
Experimental results show enhanced performance in novel object and skill generalization by leveraging visual imagination alongside action planning.

VideoVLA refers to a unified, multi-modal framework designed to transform large-scale video generative models into generalizable robotic manipulators by tightly integrating vision, language, and action forecasting within a single architecture. The approach leverages the principles of visual imagination for policy synthesis, employing pre-trained video diffusion models to jointly forecast both future actions and their visual consequences, thereby enabling robust generalization across novel objects, skills, and robot embodiments in manipulation tasks. The VideoVLA concept establishes a new paradigm for robot learning centered on multi-modal sequence modeling and joint predictive planning (Shen et al., 7 Dec 2025).

VideoVLA is centered on a multi-modal Diffusion Transformer (DiT) backbone, initialized from a large text-to-video DiT generator (CogVideoX-5B). The architecture consists of:

Encoders: Instructions are encoded via a T5 text encoder, producing token embeddings $\mathbf{T}\in\mathbb{R}^{L\times d}$ . Current observations (video frames) are encoded using a 3D-causal VAE encoder, yielding latent maps $\{\mathbf{V}_j\in\mathbb{R}^{h\times w}\}_{j=1}^n$ .
Transformer Backbone: A stack of transformer blocks denoises noisy future video latents and action vectors, conditioned on $\mathbf{T}$ and current observation latent $\mathbf{V}_1$ , enabling joint attention over all modalities.
Output Heads: The model outputs both clean future video latents $\hat{\mathbf{V}}'_j$ (which are decoded back to images for visualization) and a chunk of clean robot actions $\hat{\mathbf{a}}_i\in\mathbb{R}^7$ .

The input sequence concatenates language tokens, current visual latent, noised future latents, and noised actions prior to transformer processing. No discrete tokenization is applied to actions; they remain in their native continuous 7D real-valued format (3D rotation, 3D translation, binary gripper open/close).

2. Diffusion-Based Joint Forecasting and Training Protocol

VideoVLA employs a diffusion process on both video latents and action vectors:

Forward Process: Gaussian noise is incrementally added to the future latent and action vectors over timesteps: $q(x_t|x_{t-1}) = \mathcal{N}\left(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I\right)$ .
Reverse Process: The transformer learns to estimate the noise and reconstruct the clean original data from the noisy versions, with the reverse transitions $p_\theta(x_{t-1}|x_t) = \mathcal{N}\left(x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(t) I\right)$ .
Objective: The loss function is the sum of the prediction errors on both modalities:

$\mathcal{L} = \mathbb{E}_{x_0,\epsilon,t}\left[ \|\epsilon - \epsilon_{\theta,video}(x_t^{video}, t)\|^2 + \|\epsilon - \epsilon_{\theta,action}(x_t^{action}, t)\|^2 \right]$

Training is staged: 100,000 pre-training steps on Open X-Embodiment (OXE; 22.5M frames, 1M trajectories, 60 robot types) and 15,000 fine-tuning steps on a real-robot teleoperation dataset. No additional data augmentation; robustness emerges from video model pre-training.

3. Action Signal Representation and Execution Cycle

Robot actions in VideoVLA are represented as real-valued joint vectors and embedded into the transformer’s feature space:

Embedding: Each action is linearly projected to the transformer’s latent space with appended positional, time-step, and diffusion-timestep encodings.
Prediction Cycle: The transformer attends jointly to language, current vision, future vision, and future action embeddings; at the final timestep, predicted action latents are mapped back via a linear head to robot commands.
Execution: Typically, the first $M=3$ actions from a predicted chunk of 6 steps are executed to balance reactivity and horizon prediction.

4. Evaluation: Generalization Across Objects, Skills, and Embodiments

Experimental results across simulation (SIMPLER) and real-world settings demonstrate superior generalization:

Task	VideoVLA	CogACT	π0	OpenVLA	SpatialVLA
In-domain (SIMPLER VM, Google)	80.4%	75.2%	53.5	34.3%	-
Novel Objects (SIMPLER)	65.2%	42.4%	-	-	50.8%
Novel Skills (SIMPLER)	48.6%	20.4%	-	-	18.9%
Real-world In-Domain (Realman)	64.6%	58.4%	50.7	-	-
Real-world Novel Objects	50.6%	26.9%	21.8	-	-
Real-world New Skills	58.0%	35.1%	28.5	-	-

Imagined video outcomes (visualizations) correlate statistically with action success, validated both by automated foreground-keypoint similarity and human assessment (e.g., semantic + physical plausibility, 84.0% for novel objects, 63.4% for new skills).

5. Visual Imagination: Mechanism and Reliability in Manipulation

The joint prediction of actions and their future video consequences creates an implicit reliability metric:

Imagination–Execution Correlation: The similarity of the predicted future frames to resulting execution frames, assessed via SIFT+SAM tracking, robustly predicts successful manipulation.
Dual Prediction Paradigm: VideoVLA ablation reveals that removing video forecasting or operating solely on action denoising drastically reduces generalization (>50% performance drop). The requirement of visual imagination enforces planning consistency with environmental outcomes.

This suggests that generative visual imagination is not only a tool for planning but serves as a confidence measure for manipulator decision cycles, supporting transfer to unseen objects and cross-embodiment skill imitation.

6. Broader Implications and Directions

VideoVLA demonstrates that video diffusion models can extend beyond generative tasks into closed-loop policy synthesis, fundamentally altering the trajectory of robot learning by leveraging the visual and causal grounding capabilities of large video models. The framework is applicable to varied robotic platforms and tasks alongside simulation, raising possibilities for further expansion into domains requiring more sophisticated long-horizon reasoning, richer sensory modalities, and explicit reality grounding strategies (Shen et al., 7 Dec 2025). Integration with reinforcement learning (policy gradients, actor-critic) and scaling to larger, more diverse corpora remain important future avenues.

7. Limitations and Open Challenges

Reality Gap: Human evaluation underscores persistent difficulty in fully grounding imagined futures.
Data Diversity: Pre-trained video models offer robustness but may still be domain-constrained via current OXE limitations.
Scaling and Modalities: Further work is needed to incorporate new sensory inputs (depth, force), hierarchical planning mechanisms, and adaptive horizon predictions.

By positioning joint visual–action forecasting at the center of manipulation, VideoVLA provides an extensible foundation for open-world robotics and sheds light on the critical link between imagination and generalizable control.

PDF Markdown Chat (Pro)

References (1)

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VideoVLA.

VideoVLA: Unified Multi-Modal Robotics

1. Model Architecture: Multi-Modal Diffusion Transformer