Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 164 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 72 tok/s Pro

Kimi K2 204 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (2506.09985v1)

Published 11 Jun 2025 in cs.AI, cs.CV, cs.LG, and cs.RO

Abstract: A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a LLM, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

Summary

The paper introduces a two-stage framework that pre-trains on large-scale video data and fine-tunes with minimal robot interaction to learn predictive world models.
It leverages a Vision Transformer with a Joint-Embedding Predictive Architecture to capture spatiotemporal dynamics while avoiding costly pixel-level predictions.
The approach achieves state-of-the-art performance in motion understanding, action anticipation, and zero-shot planning, enabling efficient real-world robot control.

V-JEPA 2 (2506.09985) is a self-supervised video model that demonstrates the potential of learning world models from large-scale observational data (internet videos) combined with a small amount of interaction data (robot trajectories) to enable understanding, prediction, and planning capabilities. The core idea is to use a Joint-Embedding Predictive Architecture (JEPA) to learn robust representations and dynamics models in a learned latent space, avoiding computationally expensive pixel-level prediction common in generative models.

The approach involves a two-stage training procedure:

Action-Free Pre-training (V-JEPA 2): A large-scale video encoder is pre-trained on over 1 million hours of internet video and 1 million images using a mask-denoising feature prediction objective. The model learns to predict representations of masked video segments in a learned embedding space.
Action-Conditioned Post-training (V-JEPA 2-AC): The pre-trained V-JEPA 2 encoder is frozen, and a smaller action-conditioned predictor network is trained on top of its representations using a small amount of unlabeled robot interaction data. This model learns to predict future state representations given past states, actions, and proprioceptive information.

This staged approach allows the model to first learn general visual understanding and dynamics from diverse, web-scale observational data, and then specialize this knowledge for goal-conditioned planning by training on limited interactive experience.

V-JEPA 2 Pre-training: Scaling Self-Supervised Video Learning

The first stage focuses on training the V-JEPA 2 encoder, typically a Vision Transformer (ViT). The objective is to minimize the L1 distance between the output of a predictor network (which sees a masked view of the video and learnable mask tokens) and the representation of the full video from a target network (an exponential moving average of the encoder). This encourages the model to learn predictive features in a compact latent space.

Key ingredients for scaling this pre-training effectively were identified:

Data Scaling: Increasing the dataset size from 2 million to 22 million videos (VideoMix22M) by combining public sources like SSv2, Kinetics, HowTo100M, YT-Temporal-1B (YT1B), and ImageNet. A retrieval-based data curation strategy for YT1B was crucial to filter noisy content and improve performance.
Model Scaling: Scaling the encoder architecture from 300 million (ViT-L) to 1 billion (ViT-g) parameters. The predictor architecture was kept smaller and fixed.
Longer Training: Using a warmup-constant-decay learning rate schedule enabled training for up to 252,000 iterations, effectively leveraging the larger dataset.
Higher Resolution and Longer Duration: Employing a progressive resolution training strategy where training starts with shorter, lower-resolution clips (16 frames, $256 \times 256$ ) and increases resolution and duration (up to 64 frames, $384 \times 384$ ) during the final cooldown phase. This drastically reduces the computational cost compared to training at full resolution throughout (up to 8.4x speedup) while still yielding performance benefits.

The V-JEPA 2 encoder uses a 3D extension of Rotary Position Embedding (RoPE) for encoding spatiotemporal position, which helped stabilize training for larger models. The input videos are patchified into tubelets before being processed by the transformer encoder.

V-JEPA 2-AC: Learning an Action-Conditioned World Model

In the second stage, the frozen V-JEPA 2 encoder processes individual frames from robot interaction videos (Droid dataset, < 62 hours of unlabeled data) to produce a sequence of feature maps. An action-conditioned predictor network (a $\sim$ 300M parameter transformer with block-causal attention) takes these feature maps, along with corresponding end-effector states and computed actions (changes in end-effector state), and predicts the representation of the next video frame.

The training objective for V-JEPA 2-AC combines a teacher-forcing loss (predicting the ground truth next frame representation given previous ground truth inputs) and a two-step rollout loss (predicting the representation two steps ahead by feeding the predictor's output back as input). Both losses minimize the L1 distance between predicted and target representations.

Inputs: Sequence of V-JEPA 2 feature maps, 7D end-effector states, and 7D action vectors.
Architecture: Transformer network processing interleaved visual, state, and action tokens. Uses 3D-RoPE for visual tokens and temporal RoPE for action/state tokens.
Objective: L1 loss on predicted next frame representations. $\mathcal{L}(\phi) \coloneqq \sum_{k=1}^{T-1} \lVert P_\phi((a_t, s_t, E(x_t))_{t \leq k}) - E(x_{k+1}) \rVert_1 + \lVert P_\phi(a_{1:2}, s_1, E(x_1)) - E(x_3) \rVert_1$ (with $T=15$ for teacher forcing, $T=2$ for rollout in practice).

Planning: Zero-shot Robot Control

V-JEPA 2-AC is used for robot control via Model Predictive Control (MPC). Given a goal image, the robot's current observation and state, the model plans a sequence of actions for a fixed time horizon ( $T$ ). This is done by minimizing a goal-conditioned energy function: $\mathcal{E}(\hat{a}_{1:T};\ z_k, s_k, z_g) \coloneqq \lVert P(\hat{a}_{1:T}; s_k, z_k) - z_g \rVert_1$ , where $z_k$ and $z_g$ are the V-JEPA 2 representations of the current and goal images, respectively. The minimization seeks an action sequence $\hat{a}_{1:T}$ that results in a predicted future state representation $P(\hat{a}_{1:T}; s_k, z_k)$ that is close to the goal state representation $z_g$ in the learned latent space.

Goal Specification: Visual goals (image of the desired final state), and optionally sub-goals for complex tasks.
Planning Algorithm: Cross-Entropy Method (CEM) is used to optimize the sequence of actions. The first action of the optimized sequence is executed, and the process repeats in a receding horizon control loop.
Deployment: Zero-shot on Franka arms in new environments not present in the Droid dataset, using monocular RGB camera input.
Tasks: Single-goal reaching, grasping, reach with object, and pick-and-place.

Performance and Applications:

The paper demonstrates V-JEPA 2's capabilities across understanding, prediction, and planning:

Understanding (Probe-based Classification): V-JEPA 2 excels at encoding fine-grained motion information, achieving state-of-the-art performance on motion understanding tasks (e.g., 77.3% top-1 accuracy on Something-Something v2) while being competitive on appearance understanding tasks compared to other self-supervised and vision-language pre-trained encoders. Scaling model size and input resolution consistently improved performance.
Understanding (Video Question-Answering): By aligning the V-JEPA 2 encoder with a LLM backbone (Qwen2-7B-Instruct, Llama 3.1 8B Instruct) using visual instruction tuning on a large dataset of image- and video-text pairs (up to 88.5M samples), V-JEPA 2 achieves state-of-the-art performance among 8B parameter models on multiple VidQA benchmarks, including PerceptionTest (84.0), MVP (44.5), TempCompass (76.9), TemporalBench (36.7), and TOMATO (40.3). Notably, an encoder pre-trained without language supervision can achieve SOTA when aligned with sufficient data, challenging prior assumptions. Scaling the encoder and input resolution also improved VidQA performance.
Prediction (Action Anticipation): V-JEPA 2 achieves state-of-the-art performance on the Epic-Kitchens-100 human action anticipation task (39.7 recall-at-5), significantly outperforming previous task-specific and VL models. Performance scaled linearly with model size and benefited from higher input resolution.
Planning (Robot Control): V-JEPA 2-AC enabled successful zero-shot prehensile manipulation tasks (grasp, reach with object, pick-and-place) on real robots in new environments using image goals. Despite being trained on only < 62 hours of unlabeled data, it achieved higher success rates on these tasks (e.g., 80% average on pick-and-place cup) compared to baselines like Octo (a VL-Action behavior cloning model) and a Cosmos (video generation model) used for planning, while requiring significantly less planning time per action (16 seconds vs. 4 minutes).

Implementation Considerations and Limitations:

Computational Requirements: Training V-JEPA 2 at scale requires significant compute resources (e.g., GPU-years). The progressive resolution training helps mitigate this. Planning with V-JEPA 2-AC using CEM is computationally less intensive than planning with generative models like Cosmos, but still requires seconds per action on a single GPU.
Data Curation: The performance of V-JEPA 2 pre-training benefits from curated large-scale video data, suggesting that sheer data volume is not enough; data quality and distribution matter.
Camera Sensitivity: V-JEPA 2-AC implicitly infers the action coordinate axis from monocular camera input. Without explicit calibration, it shows sensitivity to camera position, which can affect planning accuracy. A potential future direction is unsupervised online calibration.
Long-Horizon Planning: Autoregressive prediction in latent space can suffer from error accumulation, limiting the effective planning horizon. The search space for actions also grows exponentially with horizon. This necessitates using sub-goals for complex long-horizon tasks like pick-and-place in the current setup.
Goal Modality: The current planning setup relies on image goals. Extending this to language-based goals is an important future step.
Scaling Potential: While scaling to 1B parameters showed benefits, further scaling of vision encoders to larger sizes is a promising direction for future work.

In summary, V-JEPA 2 provides a strong foundation for building versatile AI agents by leveraging self-supervised learning on vast amounts of video data. The learned representations capture rich spatio-temporal information useful for various understanding and prediction tasks, and when combined with minimal interaction data, can power effective model-based planning for physical world tasks like robot manipulation. The paper demonstrates the feasibility of this approach and sets a new bar for zero-shot robot control capabilities derived from large-scale self-supervised pre-training.