V-JEPA 2: Self-Supervised Video Learning
- The paper introduces V-JEPA 2, a self-supervised architecture that predicts masked spatio-temporal regions to enable robust video representation and zero-shot robotic planning.
- V-JEPA 2 is a video representation learning framework utilizing 3D tubelet tokenization, multi-block masking, and transformer encoders to achieve strong motion understanding and cross-modal alignment.
- The model employs a dual forward pass with EMA teacher-student training, yielding state-of-the-art metrics on benchmarks like Something-Something v2, Diving-48, and various robotic planning tasks.
V-JEPA 2 is a self-supervised video representation learning architecture based on the Joint-Embedding Predictive Architecture (JEPA) paradigm, with extensions enabling scalable training, strong downstream motion understanding, cross-modal alignment, and zero-shot robotic planning. The architecture is designed to learn generalizable latent representations by predicting masked spatio-temporal regions in latent space, using large-scale internet video data and a limited amount of robot interaction data. The V-JEPA 2 scheme introduces modifications over prior JEPA and V-JEPA frameworks, culminating in state-of-the-art results for video understanding and planning benchmarks, especially when combined with LLMs and action-conditioned policy heads (Assran et al., 11 Jun 2025, Li et al., 29 Sep 2025).
1. Foundational Principles and Architectural Design
V-JEPA 2 builds upon the JEPA approach, where a video input is decomposed into spatio-temporal patches ("tubelets"), a subset of which are masked out, and a transformer-based encoder computes representations for the visible patches. A small transformer "predictor" , conditioned on mask tokens , is trained to predict the embeddings of the masked patches. V-JEPA 2 extends this principle through several technical innovations:
- 3D Tubelet Tokenization: Each video clip is divided into tubelets spanning 2 frames 16 height 16 width (2×16×16).
- 3D-RoPE Positional Encoding: 3D rotary positional embeddings are applied over the temporal and spatial axes to enable efficient sequence modeling.
- Transformer Encoder: Large ViT-style backbone architectures (ViT-L, -H, -g up to 1B parameters) are employed, supporting deep scaling.
- Predictor: For action-free pre-training, a compact transformer (ViT-S-like) acts as the predictor (12 layers, width 384, ≈22M parameters). For action-conditioned planning (V-JEPA 2-AC), a larger (24-layer, width 1024, ≈300M parameters) predictor with block-causal attention is used.
- Masking Scheme: "Multi-block" spatio-temporal masking with short- and long-range blocks results in an overall masking ratio approaching 90%, with spatial scale $0.15$–$0.7$ and temporal mask ratio ≈1.0. Tubelet aspect ratios are sampled over .
An exponential moving average (EMA) copy of the student encoder serves as the teacher, with identical architecture. The predictor maps the student’s visible patch embeddings (plus mask tokens for occluded positions) to the teacher’s target latent space.
2. Self-Supervised Objectives and Training Procedures
V-JEPA 2 employs a self-supervised masked-latent prediction loss tailored to large-scale video data. For a video input , with denoting visible patches and denoting masked patches, the objective is:
where encodes the spatio-temporal positions of masked tubelets and blocks gradient propagation through the teacher pathway to prevent collapse. The EMA teacher is updated at each step via:
No explicit contrastive, variance, or redundancy reduction regularizers are used; collapse is prevented by the combination of EMA and stop-gradient. This process requires two forward passes per batch (student + teacher), with dynamic coupling between the two via the EMA update.
Action-conditioned extension (V-JEPA 2-AC): Post-pretraining, a separate predictor is learned on robot data with the frozen V-JEPA 2 encoder. Losses include:
- Teacher-forcing Loss: Predict next latent from actions, proprioception, and current latent, averaged over predicted steps.
- Two-step Rollout Loss: Enforces model rollout consistency over multiple prediction steps.
3. Data Regime, Training Scaling, and Optimizations
V-JEPA 2 is pre-trained on VideoMix22M (VM22M), comprising approximately 1 million hours of video and 1 million images:
- Datasets: Something-Something v2, Kinetics-400/600/700, HowTo100M, YT-Temporal-1B (curated), plus ImageNet as static videos.
- Sampling Ratios: Empirically set, e.g., ImageNet 25%, HowTo100M 31%.
- Augmentations: Random temporal cropping (4 fps), random-resize-crop (aspect ), tubelet patchification, multi-block masking.
- Progressive Resolution Schedule: Training proceeds from 16 frames @256×256 pixels to 64 frames @384×384, achieving 8 speedup.
- Scaling Impact: Each of the following increases yields +1 to +1.5% average downstream accuracy: (1) dataset size up to 22M, (2) model up to 1B parameters, (3) longer training (252K steps), (4) higher resolution/longer clips. Evaluating with longer clips (64 vs. 16 frames) yields +9.7% average accuracy.
4. Downstream Results: Video Understanding, QA, and Robotic Planning
V-JEPA 2 achieves state-of-the-art results in several domains:
- Motion Understanding (Frozen Probe): 77.3% top-1 on Something-Something v2 (ViT-g), 90.2% on Diving-48, 88.2% across six tasks, exceeding prior InternVideo2-1B by 7.6 pp.
- Human Action Anticipation: Epic-Kitchens-100 recall@5 (action): 39.7% (+44% over PlausiVL).
- Video Question Answering (with LLM alignment): PerceptionTest (8B model) 84.0% vs. 82.7% (PerceptionLM 8B); MVP paired accuracy 44.5% vs. 39.7%; TempCompass 76.9% vs. 72.7%; improvements extend to longest-horizon and compositional temporal QA.
- Zero-Shot Robotic Planning (V-JEPA 2-AC): After 62h of Droid robot video, V-JEPA 2-AC achieves average 80% pick-and-place success, 75% reach-with-object, and 65% grasp on cup/box (vs. Octo: 15% grasp, Cosmos: 0–30%), deployed in two independent labs with no task-specific data or reward.
Performance Summary (selected benchmarks):
| Method | Params | Pretrain FLOPs | SSv2 (%) | K400 (%) |
|---|---|---|---|---|
| V-JEPA 2 ViT-L | 300M | 1.9 | 73.7 | 85.1 |
| V-JEPA 2 ViT-H | 600M | 3.5 | 74.0 | 85.3 |
| V-JEPA 2 ViT-g | 1B | 5.3 | 75.3 | 86.6 |
| SALT ViT-L | 300M | 1.2 | 74.9 | 85.4 |
| SALT ViT-H | 600M | 1.5 | 75.4 | 86.0 |
| SALT ViT-g | 1B | 1.9 | 76.2 | 86.8 |
V-JEPA 2’s frozen-backbone evaluation follows the probing/best-pool strategy on standard datasets. Ablations reveal data and model scale, progressive resolution, and YT1B curation as key contributors to accuracy.
5. LLM Alignment and Multi-Modal Extensions
V-JEPA 2 supports large multimodal alignment for video QA tasks:
- Vision-Language Fusion: V-JEPA 2 encoders produce per-patch embeddings, projected to LLM token-space via MLP or cross-attention.
- LLM Backbone: Multi-stage training using Qwen2-7B, Llama3-8B, or similar.
- Training Regime: Image captioning, large-scale image QA, followed by video captioning and QA; standard cross-entropy loss.
- Parameterization: At 8B model class, best results are achieved with stagewise fusion and optional vision encoder fine-tuning.
This fusion mechanism enables compositional temporal understanding and competitive (often best) accuracy on PerceptionTest, MVP, TempCompass, TOMATO, TVBench, and MVBench benchmarks.
6. Comparative Analysis: Architectural Trade-offs and Limitations
A detailed comparison with the SALT recipe (Li et al., 29 Sep 2025) underscores several aspects:
- Compute Efficiency: V-JEPA 2 requires dual forward passes per batch (student and EMA teacher), inflating compute versus single-pass frozen-teacher regimes (SALT). Including initial teacher pretraining, SALT achieves 20–30% lower FLOPs for equivalent student scale.
- Accuracy–Compute Pareto: SALT strictly dominates V-JEPA 2 across the accuracy-FLOPs Pareto frontier at all scales (e.g., +2.3pp avg. for ViT-L at matched compute).
- Architectural Coupling: EMA ties student and teacher, restricting independent tuning. In contrast, SALT’s decoupled frozen-teacher architecture enables optimal compute allocation and supports scaling up the student with small, potentially sub-optimal teachers.
- Training Stability and Model Selection: V-JEPA 2’s training loss is uncorrelated with downstream accuracy, requiring proxies such as RankMe, LiDAR, and -ReQ for model selection and brittle hyperparameter tuning (e.g., careful EMA momentum scheduling, stop-gradient placement). SALT provides a better direct correlation between pretraining loss and frozen-backbone probing accuracy (), improving transparency and stability.
- Collapse Prevention: V-JEPA 2 prevents zero-loss trivial collapse by restricting teacher gradient flow and using slow EMA updates, whereas SALT achieves collapse resistance via a static, frozen teacher and simple loss design.
7. Limitations, Open Questions, and Future Directions
V-JEPA 2 and its extensions exhibit several known limitations:
- Camera Pose Sensitivity: In robotic planning, the model’s implicit action-axis inference leads to errors that vary linearly with camera angle; potential remedies include unsupervised "calibration" at deployment.
- Short Horizon Rollouts: Planning is limited to short-horizon (T=1) due to error compounding and MPC search complexity, motivating development of hierarchical abstractions or longer-horizon planning modules.
- Goal Specification: The planning framework currently relies on image goals; future work suggests extending to language-conditioned planning via LLM integration.
- Scaling Ceiling: V-JEPA 2 is presently limited to 1B parameter scale; further improvements are anticipated with models exceeding 10B parameters.
- Dynamic EMA Complexity: EMA-based self-distillation entails additional engineering and training complexity, as described above.
This suggests that future advancements may focus on deeper scaling, more robust fusion with LLMs for language-guided tasks, adaptation to varied camera geometries, and the development of hierarchical models for long-horizon control.
V-JEPA 2 constitutes a high-capacity, versatile, and empirically benchmarked approach for learning video representations applicable to understanding, anticipation, multimodal reasoning, and robotic planning, with its performance and efficiency trade-offs now sharply contextualized by static-teacher alternatives such as SALT (Li et al., 29 Sep 2025, Assran et al., 11 Jun 2025).