Papers
Topics
Authors
Recent
2000 character limit reached

V-JEPA-2-AC: Video World Modeling for Robotics

Updated 1 January 2026
  • The paper demonstrates a two-stage model that decouples a frozen vision encoder from an action-conditioned predictor to enable zero-shot robotic manipulation.
  • The method leverages a Vision Transformer pretrained on web-scale data and employs a compact Transformer for latent dynamics prediction using L₁ reconstruction loss.
  • Empirical results show robust performance in grasping, reaching, and pick-and-place tasks, outperforming diffusion-based models in efficiency and control.

V-JEPA-2-AC is a two-stage, self-supervised video world modeling and planning system that leverages large-scale video representation learning and action-conditioned latent dynamics for zero-shot robotic manipulation. It is grounded in the V-JEPA 2 architecture and applies a frozen vision backbone for extracting latent video states, followed by post-training a compact action-conditioned predictor on robot data. The resulting world model enables image-goal planning in unseen physical environments using only monocular vision, without explicit task supervision or reward signals (Assran et al., 11 Jun 2025).

1. Model Architecture

V-JEPA-2-AC employs a bifurcated design:

  • Stage 1: Encoder Pre-training
    • Utilizes the action-free V-JEPA 2 encoder EθE_\theta; each RGB frame xtRH0×W0×3x_t \in \mathbb{R}^{H_0 \times W_0 \times 3} is transformed into a latent map zt=Eθ(xt)RH×W×Dz_t=E_\theta(x_t)\in\mathbb{R}^{H \times W \times D}.
    • The encoder backbone is a Vision Transformer (ViT-g, up to $1$B parameters), pretrained on over $1$M hours of web-scale video and image data (VM22M), supporting up to $64$-frame clips and 384×384384 \times 384 spatial resolution.
  • Stage 2: Action-Conditioned Predictor
    • Freezes EθE_\theta and uses it to embed video frames during all subsequent learning and inference.
    • Introduces a $300$M-parameter, $24$-layer Transformer (PϕP_\phi) with block-causal attention, ingesting three token types per timestep:
    • Patch features: ztz_t
    • Proprioceptive states: stR7s_t \in \mathbb{R}^7 (Cartesian xyz, orientation, gripper)
    • Actions: at=st+1stR7a_t = s_{t+1} - s_t \in \mathbb{R}^7
    • Input tokens are projected into a $1024$-dimensional hidden space and concatenated as chronological sequences with causal masking.

The predictor outputs the next latent patch map z^t+1\hat{z}_{t+1}, matching the spatial dimensions of zt+1z_{t+1}. There are no modifications to the encoder weights during post-training; the architecture is rigidly modular (Assran et al., 11 Jun 2025).

2. Training Objectives and Loss Functions

All learning proceeds without task labels or environmental rewards, relying exclusively on reconstruction losses:

  • Teacher-Forcing (Ltf\mathcal{L}_{\rm tf}):

z^k+1=Pϕ((zt,st,at)tk)\hat{z}_{k+1} = P_\phi\left((z_t, s_t, a_t)_{t \leq k}\right)

Ltf(ϕ)=1Tk=1Tz^k+1zk+11\mathcal{L}_{\rm tf}(\phi) = \frac{1}{T} \sum_{k=1}^T \|\hat{z}_{k+1} - z_{k+1}\|_1

(Eq 2 in (Assran et al., 11 Jun 2025))

  • Rollout Loss (Lro\mathcal{L}_{\rm ro}):

z^t+1(2)=Pϕ((a1:t,s1,z1)),(t=2)\hat{z}_{t+1}^{(2)} = P_\phi((a_{1:t}, s_1, z_1)), \quad (t=2)

Lro(ϕ)=z^3z31\mathcal{L}_{\rm ro}(\phi) = \|\hat{z}_3 - z_3\|_1

(Eq 3)

  • Total Loss:

L(ϕ)=Ltf(ϕ)+Lro(ϕ)\mathcal{L}(\phi) = \mathcal{L}_{\rm tf}(\phi) + \mathcal{L}_{\rm ro}(\phi)

(Eq 4)

There is no explicit contrastive or adversarial regularization; all supervision derives from the L₁ reconstruction of encoder-derived latents (Assran et al., 11 Jun 2025). A plausible implication is that modeling focuses solely on physical state prediction fidelity in a latent space, without explicit classification or discrimination tasks.

3. Latent World Model Formulation

  • Latent State Extraction: For every video frame, zt=Eθ(xt)z_t = E_\theta(x_t) produces a spatial map of latent embeddings corresponding to visual content.
  • Action and State Conditioning: Actions ata_t are computed as differences in proprioceptive state vectors across timesteps, at=st+1sta_t = s_{t+1} - s_t.
  • Transition Modeling: The core latent dynamics are parameterized as:

zt+1=fϕ(z1:t,a1:t,s1:t)Pϕ((zt,st,at)tT)z_{t+1} = f_\phi(z_{1:t}, a_{1:t}, s_{1:t}) \approx P_\phi\left((z_t, s_t, a_t)_{t \leq T}\right)

This formalism enables an entirely end-to-end, action-conditioned latent space for video world modeling and planning. This suggests architectural efficiency, with all policy-relevant inference conducted in a shared, pretrained visual latent space.

4. Planning and Inference Mechanism

Image-goal planning proceeds in the latent space using an iterative trajectory optimization:

  • Goal Encoding: The user specifies a goal image xgx_g, embedded as zg=Eθ(xg)z_g = E_\theta(x_g).
  • Current State Encoding: The latest frame xkx_k and proprioceptive state sks_k are mapped to zkz_k and sks_k.
  • Trajectory Optimization: The method seeks a TT-step action sequence a1:Ta_{1:T} that minimizes

E(a1:T;zk,sk,zg)=Pϕ(a1:T;sk,zk)zg1\mathcal{E}(a_{1:T}; z_k, s_k, z_g) = \left\|P_\phi(a_{1:T}; s_k, z_k) - z_g\right\|_1

Solution:

a1:T=argmina1:TE(a1:T;zk,sk,zg)a^*_{1:T} = \arg\min_{a_{1:T}} \mathcal{E}(a_{1:T}; z_k, s_k, z_g)

This is optimized via the Cross-Entropy Method (CEM), which samples action sequences from Gaussian distributions, selects the highest-scoring trajectories, and refines the mean/variance over several iterations (typically 5–10).

A plausible implication is that planning is robust to changes in environment and camera pose, as all inference is grounded in frozen visual latents.

5. Data, Training Protocols, and Deployment

  • Pretraining Data: Mask-denoising JEPA on VM22M (web-scale video+image corpus, over $1$M hours); up to ViT-g scale, using $64$-frame, 384×384384\times384 clips.
  • Action-Conditioned Post-Training: Conducted with 62\sim62h unlabeled Droid robot video, $16$-frame clips @ 256×256256\times256 resolution, no rewards or task labels, AdamW optimizer, $96$K total steps.
  • Deployment: Franka Panda arms (RobotiQ gripper, monocular uncalibrated camera), zero-shot in two physical environments (no in-lab training), low-level operational-space control via planned Δ\Delta-Cartesian actions (Assran et al., 11 Jun 2025).

This approach demonstrates domain generalization, leveraging tele-operation data from Droid for policy transfer without retraining.

6. Empirical Evaluation and Comparative Performance

Performance metrics and ablations establish V-JEPA-2-AC's capabilities:

Method Reach Grasp (Cup/Box) Reach w/ Obj (Cup/Box) Pick-and-Place (Cup/Box)
Octo (BC) 100% 15% / 0% 15% / 70% 15% / 10%
V-JEPA 2-AC 100% 65% / 25% 75% / 75% 80% / 65%

Meta-analysis with latent-diffusion world models (Cosmos):

  • Cosmos: 80 samples, 10 refinements, 4 minutes/action (horizon=1), reach = 80%, manipulation = 0–20%
  • V-JEPA 2-AC: 800 samples, 16 seconds/action, reach = 100%, manipulation = 60–80%

Ablations affirm the necessity of action-conditioning; unconditioned variants fail to infer control. Camera-pose dependence is linear in azimuth error, but can be addressed via linear calibration (Assran et al., 11 Jun 2025). A plausible implication is that latent-goal planning supports a broad spectrum of robot tasks without task-specific reward engineering or environment-specific retraining.

7. Context, Limitations, and Significance

V-JEPA-2-AC illustrates scalable video-LM world modeling with the following properties:

  • Large-scale joint embedding from internet videos yields transferable latent representations for action-conditioned prediction.
  • The separation of vision (frozen encoder) and dynamics (learned predictor) modularizes generalization and control.
  • Zero-shot deployment in unseen labs, sans reward signals or task labels, achieves high success rates (65–80%) across grasp, reach-with-object, and pick-and-place tasks (Assran et al., 11 Jun 2025).
  • Comparative study with diffusion-based world models evidences substantial improvements in efficiency and manipulation performance.

A plausible implication is that V-JEPA-2-AC's architecture and training protocol can generalize to more diverse environments and tasks by scaling video encoder pretraining and action-conditioned post-training, without revisiting reward-driven RL or in-situ data collection.

V-JEPA-2-AC, through its large-scale, decoupled training and latent planning formalism, represents a significant stage in the unification of self-supervised video representation learning with zero-shot robotic world modeling and control (Assran et al., 11 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to V-JEPA-2-AC.