V-JEPA 2: Self-Supervised Video Learning

Updated 12 November 2025

The paper introduces V-JEPA 2, a self-supervised architecture that predicts masked spatio-temporal regions to enable robust video representation and zero-shot robotic planning.
V-JEPA 2 is a video representation learning framework utilizing 3D tubelet tokenization, multi-block masking, and transformer encoders to achieve strong motion understanding and cross-modal alignment.
The model employs a dual forward pass with EMA teacher-student training, yielding state-of-the-art metrics on benchmarks like Something-Something v2, Diving-48, and various robotic planning tasks.

V-JEPA 2 is a self-supervised video representation learning architecture based on the Joint-Embedding Predictive Architecture (JEPA) paradigm, with extensions enabling scalable training, strong downstream motion understanding, cross-modal alignment, and zero-shot robotic planning. The architecture is designed to learn generalizable latent representations by predicting masked spatio-temporal regions in latent space, using large-scale internet video data and a limited amount of robot interaction data. The V-JEPA 2 scheme introduces modifications over prior JEPA and V-JEPA frameworks, culminating in state-of-the-art results for video understanding and planning benchmarks, especially when combined with LLMs and action-conditioned policy heads (Assran et al., 11 Jun 2025, Li et al., 29 Sep 2025).

1. Foundational Principles and Architectural Design

V-JEPA 2 builds upon the JEPA approach, where a video input is decomposed into spatio-temporal patches ("tubelets"), a subset of which are masked out, and a transformer-based encoder $E_\theta$ computes representations for the visible patches. A small transformer "predictor" $P_\phi$ , conditioned on mask tokens $\Delta$ , is trained to predict the embeddings of the masked patches. V-JEPA 2 extends this principle through several technical innovations:

3D Tubelet Tokenization: Each video clip is divided into tubelets spanning 2 frames $\times$ 16 height $\times$ 16 width (2×16×16).
3D-RoPE Positional Encoding: 3D rotary positional embeddings are applied over the temporal and spatial axes to enable efficient sequence modeling.
Transformer Encoder: Large ViT-style backbone architectures (ViT-L, -H, -g up to 1B parameters) are employed, supporting deep scaling.
Predictor: For action-free pre-training, a compact transformer (ViT-S-like) acts as the predictor (12 layers, width 384, ≈22M parameters). For action-conditioned planning (V-JEPA 2-AC), a larger (24-layer, width 1024, ≈300M parameters) predictor with block-causal attention is used.
Masking Scheme: "Multi-block" spatio-temporal masking with short- and long-range blocks results in an overall masking ratio approaching 90%, with spatial scale $0.15$–$0.7$ and temporal mask ratio ≈1.0. Tubelet aspect ratios are sampled over $[0.75,1.5]$ .

An exponential moving average (EMA) copy of the student encoder $\bar f_\theta$ serves as the teacher, with identical architecture. The predictor maps the student’s visible patch embeddings (plus mask tokens for occluded positions) to the teacher’s target latent space.

2. Self-Supervised Objectives and Training Procedures

V-JEPA 2 employs a self-supervised masked-latent prediction loss tailored to large-scale video data. For a video input $x\cup y$ , with $x$ denoting visible patches and $y$ denoting masked patches, the objective is:

$\mathcal{L}_{\rm mlp}(\theta,\phi) =\mathbb{E}_{x,y} \left\| g_\phi(f_\theta(x), \delta(y)) - \operatorname{stop\_grad}(\bar f_\theta(y)) \right\|_{1}$

where $\delta(y)$ encodes the spatio-temporal positions of masked tubelets and $\operatorname{stop\_grad}(\cdot)$ blocks gradient propagation through the teacher pathway to prevent collapse. The EMA teacher is updated at each step via:

$\bar\theta_t = \tau \bar\theta_{t-1} + (1-\tau) \theta_t, \quad \tau \lesssim 1$

No explicit contrastive, variance, or redundancy reduction regularizers are used; collapse is prevented by the combination of EMA and stop-gradient. This process requires two forward passes per batch (student + teacher), with dynamic coupling between the two via the EMA update.

Action-conditioned extension (V-JEPA 2-AC): Post-pretraining, a separate predictor is learned on robot data with the frozen V-JEPA 2 encoder. Losses include:

Teacher-forcing Loss: Predict next latent from actions, proprioception, and current latent, averaged over predicted steps.
Two-step Rollout Loss: Enforces model rollout consistency over multiple prediction steps.

3. Data Regime, Training Scaling, and Optimizations

V-JEPA 2 is pre-trained on VideoMix22M (VM22M), comprising approximately 1 million hours of video and 1 million images:

Datasets: Something-Something v2, Kinetics-400/600/700, HowTo100M, YT-Temporal-1B (curated), plus ImageNet as static videos.
Sampling Ratios: Empirically set, e.g., ImageNet 25%, HowTo100M 31%.
Augmentations: Random temporal cropping (4 fps), random-resize-crop (aspect $[0.75,1.35]$ ), tubelet patchification, multi-block masking.
Progressive Resolution Schedule: Training proceeds from 16 frames @256×256 pixels to 64 frames @384×384, achieving 8 $\times$ speedup.
Scaling Impact: Each of the following increases yields +1 to +1.5% average downstream accuracy: (1) dataset size up to 22M, (2) model up to 1B parameters, (3) longer training (252K steps), (4) higher resolution/longer clips. Evaluating with longer clips (64 vs. 16 frames) yields +9.7% average accuracy.

4. Downstream Results: Video Understanding, QA, and Robotic Planning

V-JEPA 2 achieves state-of-the-art results in several domains:

Motion Understanding (Frozen Probe): 77.3% top-1 on Something-Something v2 (ViT-g), 90.2% on Diving-48, 88.2% across six tasks, exceeding prior InternVideo2-1B by 7.6 pp.
Human Action Anticipation: Epic-Kitchens-100 recall@5 (action): 39.7% (+44% over PlausiVL).
Video Question Answering (with LLM alignment): PerceptionTest (8B model) 84.0% vs. 82.7% (PerceptionLM 8B); MVP paired accuracy 44.5% vs. 39.7%; TempCompass 76.9% vs. 72.7%; improvements extend to longest-horizon and compositional temporal QA.
Zero-Shot Robotic Planning (V-JEPA 2-AC): After 62h of Droid robot video, V-JEPA 2-AC achieves average 80% pick-and-place success, 75% reach-with-object, and 65% grasp on cup/box (vs. Octo: 15% grasp, Cosmos: 0–30%), deployed in two independent labs with no task-specific data or reward.

Performance Summary (selected benchmarks):

Method	Params	Pretrain FLOPs	SSv2 (%)	K400 (%)
V-JEPA 2 ViT-L	300M	1.9	73.7	85.1
V-JEPA 2 ViT-H	600M	3.5	74.0	85.3
V-JEPA 2 ViT-g	1B	5.3	75.3	86.6
SALT ViT-L	300M	1.2	74.9	85.4
SALT ViT-H	600M	1.5	75.4	86.0
SALT ViT-g	1B	1.9	76.2	86.8

V-JEPA 2’s frozen-backbone evaluation follows the probing/best-pool strategy on standard datasets. Ablations reveal data and model scale, progressive resolution, and YT1B curation as key contributors to accuracy.

V-JEPA 2 supports large multimodal alignment for video QA tasks:

Vision-Language Fusion: V-JEPA 2 encoders produce per-patch embeddings, projected to LLM token-space via MLP or cross-attention.
LLM Backbone: Multi-stage training using Qwen2-7B, Llama3-8B, or similar.
Training Regime: Image captioning, large-scale image QA, followed by video captioning and QA; standard cross-entropy loss.
Parameterization: At 8B model class, best results are achieved with stagewise fusion and optional vision encoder fine-tuning.

This fusion mechanism enables compositional temporal understanding and competitive (often best) accuracy on PerceptionTest, MVP, TempCompass, TOMATO, TVBench, and MVBench benchmarks.

6. Comparative Analysis: Architectural Trade-offs and Limitations

A detailed comparison with the SALT recipe (Li et al., 29 Sep 2025) underscores several aspects:

Compute Efficiency: V-JEPA 2 requires dual forward passes per batch (student and EMA teacher), inflating compute versus single-pass frozen-teacher regimes (SALT). Including initial teacher pretraining, SALT achieves 20–30% lower FLOPs for equivalent student scale.
Accuracy–Compute Pareto: SALT strictly dominates V-JEPA 2 across the accuracy-FLOPs Pareto frontier at all scales (e.g., +2.3pp avg. for ViT-L at matched compute).
Architectural Coupling: EMA ties student and teacher, restricting independent tuning. In contrast, SALT’s decoupled frozen-teacher architecture enables optimal compute allocation and supports scaling up the student with small, potentially sub-optimal teachers.
Training Stability and Model Selection: V-JEPA 2’s training loss is uncorrelated with downstream accuracy, requiring proxies such as RankMe, LiDAR, and $\alpha$ -ReQ for model selection and brittle hyperparameter tuning (e.g., careful EMA momentum scheduling, stop-gradient placement). SALT provides a better direct correlation between pretraining loss and frozen-backbone probing accuracy ( $R^2 \approx 0.95$ ), improving transparency and stability.
Collapse Prevention: V-JEPA 2 prevents zero-loss trivial collapse by restricting teacher gradient flow and using slow EMA updates, whereas SALT achieves collapse resistance via a static, frozen teacher and simple loss design.

7. Limitations, Open Questions, and Future Directions

V-JEPA 2 and its extensions exhibit several known limitations:

Camera Pose Sensitivity: In robotic planning, the model’s implicit action-axis inference leads to errors that vary linearly with camera angle; potential remedies include unsupervised "calibration" at deployment.
Short Horizon Rollouts: Planning is limited to short-horizon (T=1) due to error compounding and MPC search complexity, motivating development of hierarchical abstractions or longer-horizon planning modules.
Goal Specification: The planning framework currently relies on image goals; future work suggests extending to language-conditioned planning via LLM integration.
Scaling Ceiling: V-JEPA 2 is presently limited to 1B parameter scale; further improvements are anticipated with models exceeding 10B parameters.
Dynamic EMA Complexity: EMA-based self-distillation entails additional engineering and training complexity, as described above.

This suggests that future advancements may focus on deeper scaling, more robust fusion with LLMs for language-guided tasks, adaptation to varied camera geometries, and the development of hierarchical models for long-horizon control.

V-JEPA 2 constitutes a high-capacity, versatile, and empirically benchmarked approach for learning video representations applicable to understanding, anticipation, multimodal reasoning, and robotic planning, with its performance and efficiency trade-offs now sharply contextualized by static-teacher alternatives such as SALT (Li et al., 29 Sep 2025, Assran et al., 11 Jun 2025).