V-JEPA-2-AC: Video World Modeling for Robotics

Updated 1 January 2026

The paper demonstrates a two-stage model that decouples a frozen vision encoder from an action-conditioned predictor to enable zero-shot robotic manipulation.
The method leverages a Vision Transformer pretrained on web-scale data and employs a compact Transformer for latent dynamics prediction using L₁ reconstruction loss.
Empirical results show robust performance in grasping, reaching, and pick-and-place tasks, outperforming diffusion-based models in efficiency and control.

V-JEPA-2-AC is a two-stage, self-supervised video world modeling and planning system that leverages large-scale video representation learning and action-conditioned latent dynamics for zero-shot robotic manipulation. It is grounded in the V-JEPA 2 architecture and applies a frozen vision backbone for extracting latent video states, followed by post-training a compact action-conditioned predictor on robot data. The resulting world model enables image-goal planning in unseen physical environments using only monocular vision, without explicit task supervision or reward signals (Assran et al., 11 Jun 2025).

1. Model Architecture

V-JEPA-2-AC employs a bifurcated design:

Stage 1: Encoder Pre-training
- Utilizes the action-free V-JEPA 2 encoder $E_\theta$ ; each RGB frame $x_t \in \mathbb{R}^{H_0 \times W_0 \times 3}$ is transformed into a latent map $z_t=E_\theta(x_t)\in\mathbb{R}^{H \times W \times D}$ .
- The encoder backbone is a Vision Transformer (ViT-g, up to $1$B parameters), pretrained on over $1$M hours of web-scale video and image data (VM22M), supporting up to $64$-frame clips and $384 \times 384$ spatial resolution.
Stage 2: Action-Conditioned Predictor
- Freezes $E_\theta$ and uses it to embed video frames during all subsequent learning and inference.
- Introduces a $300$M-parameter, $24$-layer Transformer ( $P_\phi$ ) with block-causal attention, ingesting three token types per timestep:
- Patch features: $z_t$
- Proprioceptive states: $s_t \in \mathbb{R}^7$ (Cartesian xyz, orientation, gripper)
- Actions: $a_t = s_{t+1} - s_t \in \mathbb{R}^7$
- Input tokens are projected into a $1024$-dimensional hidden space and concatenated as chronological sequences with causal masking.

The predictor outputs the next latent patch map $\hat{z}_{t+1}$ , matching the spatial dimensions of $z_{t+1}$ . There are no modifications to the encoder weights during post-training; the architecture is rigidly modular (Assran et al., 11 Jun 2025).

2. Training Objectives and Loss Functions

All learning proceeds without task labels or environmental rewards, relying exclusively on reconstruction losses:

Teacher-Forcing ( $\mathcal{L}_{\rm tf}$ ):

$\hat{z}_{k+1} = P_\phi\left((z_t, s_t, a_t)_{t \leq k}\right)$

$\mathcal{L}_{\rm tf}(\phi) = \frac{1}{T} \sum_{k=1}^T \|\hat{z}_{k+1} - z_{k+1}\|_1$

(Eq 2 in (Assran et al., 11 Jun 2025))

Rollout Loss ( $\mathcal{L}_{\rm ro}$ ):

$\hat{z}_{t+1}^{(2)} = P_\phi((a_{1:t}, s_1, z_1)), \quad (t=2)$

$\mathcal{L}_{\rm ro}(\phi) = \|\hat{z}_3 - z_3\|_1$

(Eq 3)

Total Loss:

$\mathcal{L}(\phi) = \mathcal{L}_{\rm tf}(\phi) + \mathcal{L}_{\rm ro}(\phi)$

(Eq 4)

There is no explicit contrastive or adversarial regularization; all supervision derives from the L₁ reconstruction of encoder-derived latents (Assran et al., 11 Jun 2025). A plausible implication is that modeling focuses solely on physical state prediction fidelity in a latent space, without explicit classification or discrimination tasks.

3. Latent World Model Formulation

Latent State Extraction: For every video frame, $z_t = E_\theta(x_t)$ produces a spatial map of latent embeddings corresponding to visual content.
Action and State Conditioning: Actions $a_t$ are computed as differences in proprioceptive state vectors across timesteps, $a_t = s_{t+1} - s_t$ .
Transition Modeling: The core latent dynamics are parameterized as:

$z_{t+1} = f_\phi(z_{1:t}, a_{1:t}, s_{1:t}) \approx P_\phi\left((z_t, s_t, a_t)_{t \leq T}\right)$

No additional regularizers modify the latent transition; only L₁ reconstruction loss applies (Assran et al., 11 Jun 2025).

This formalism enables an entirely end-to-end, action-conditioned latent space for video world modeling and planning. This suggests architectural efficiency, with all policy-relevant inference conducted in a shared, pretrained visual latent space.

4. Planning and Inference Mechanism

Image-goal planning proceeds in the latent space using an iterative trajectory optimization:

Goal Encoding: The user specifies a goal image $x_g$ , embedded as $z_g = E_\theta(x_g)$ .
Current State Encoding: The latest frame $x_k$ and proprioceptive state $s_k$ are mapped to $z_k$ and $s_k$ .
Trajectory Optimization: The method seeks a $T$ -step action sequence $a_{1:T}$ that minimizes

$\mathcal{E}(a_{1:T}; z_k, s_k, z_g) = \left\|P_\phi(a_{1:T}; s_k, z_k) - z_g\right\|_1$

Solution:

$a^*_{1:T} = \arg\min_{a_{1:T}} \mathcal{E}(a_{1:T}; z_k, s_k, z_g)$

This is optimized via the Cross-Entropy Method (CEM), which samples action sequences from Gaussian distributions, selects the highest-scoring trajectories, and refines the mean/variance over several iterations (typically 5–10).

Execution: The robot performs only $a_1^*$ before replanning (receding-horizon / MPC) (Assran et al., 11 Jun 2025).

A plausible implication is that planning is robust to changes in environment and camera pose, as all inference is grounded in frozen visual latents.

5. Data, Training Protocols, and Deployment

Pretraining Data: Mask-denoising JEPA on VM22M (web-scale video+image corpus, over $1$M hours); up to ViT-g scale, using $64$-frame, $384\times384$ clips.
Action-Conditioned Post-Training: Conducted with $\sim62$ h unlabeled Droid robot video, $16$-frame clips @ $256\times256$ resolution, no rewards or task labels, AdamW optimizer, $96$K total steps.
Deployment: Franka Panda arms (RobotiQ gripper, monocular uncalibrated camera), zero-shot in two physical environments (no in-lab training), low-level operational-space control via planned $\Delta$ -Cartesian actions (Assran et al., 11 Jun 2025).

This approach demonstrates domain generalization, leveraging tele-operation data from Droid for policy transfer without retraining.

6. Empirical Evaluation and Comparative Performance

Performance metrics and ablations establish V-JEPA-2-AC's capabilities:

Method	Reach	Grasp (Cup/Box)	Reach w/ Obj (Cup/Box)	Pick-and-Place (Cup/Box)
Octo (BC)	100%	15% / 0%	15% / 70%	15% / 10%
V-JEPA 2-AC	100%	65% / 25%	75% / 75%	80% / 65%

Meta-analysis with latent-diffusion world models (Cosmos):

Cosmos: 80 samples, 10 refinements, 4 minutes/action (horizon=1), reach = 80%, manipulation = 0–20%
V-JEPA 2-AC: 800 samples, 16 seconds/action, reach = 100%, manipulation = 60–80%

Ablations affirm the necessity of action-conditioning; unconditioned variants fail to infer control. Camera-pose dependence is linear in azimuth error, but can be addressed via linear calibration (Assran et al., 11 Jun 2025). A plausible implication is that latent-goal planning supports a broad spectrum of robot tasks without task-specific reward engineering or environment-specific retraining.

7. Context, Limitations, and Significance

V-JEPA-2-AC illustrates scalable video-LM world modeling with the following properties:

Large-scale joint embedding from internet videos yields transferable latent representations for action-conditioned prediction.
The separation of vision (frozen encoder) and dynamics (learned predictor) modularizes generalization and control.
Zero-shot deployment in unseen labs, sans reward signals or task labels, achieves high success rates (65–80%) across grasp, reach-with-object, and pick-and-place tasks (Assran et al., 11 Jun 2025).
Comparative study with diffusion-based world models evidences substantial improvements in efficiency and manipulation performance.

A plausible implication is that V-JEPA-2-AC's architecture and training protocol can generalize to more diverse environments and tasks by scaling video encoder pretraining and action-conditioned post-training, without revisiting reward-driven RL or in-situ data collection.

V-JEPA-2-AC, through its large-scale, decoupled training and latent planning formalism, represents a significant stage in the unification of self-supervised video representation learning with zero-shot robotic world modeling and control (Assran et al., 11 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to V-JEPA-2-AC.