Papers
Topics
Authors
Recent
2000 character limit reached

V-JEPA 2: Self-Supervised Video Learning

Updated 12 November 2025
  • The paper introduces V-JEPA 2, a self-supervised architecture that predicts masked spatio-temporal regions to enable robust video representation and zero-shot robotic planning.
  • V-JEPA 2 is a video representation learning framework utilizing 3D tubelet tokenization, multi-block masking, and transformer encoders to achieve strong motion understanding and cross-modal alignment.
  • The model employs a dual forward pass with EMA teacher-student training, yielding state-of-the-art metrics on benchmarks like Something-Something v2, Diving-48, and various robotic planning tasks.

V-JEPA 2 is a self-supervised video representation learning architecture based on the Joint-Embedding Predictive Architecture (JEPA) paradigm, with extensions enabling scalable training, strong downstream motion understanding, cross-modal alignment, and zero-shot robotic planning. The architecture is designed to learn generalizable latent representations by predicting masked spatio-temporal regions in latent space, using large-scale internet video data and a limited amount of robot interaction data. The V-JEPA 2 scheme introduces modifications over prior JEPA and V-JEPA frameworks, culminating in state-of-the-art results for video understanding and planning benchmarks, especially when combined with LLMs and action-conditioned policy heads (Assran et al., 11 Jun 2025, Li et al., 29 Sep 2025).

1. Foundational Principles and Architectural Design

V-JEPA 2 builds upon the JEPA approach, where a video input is decomposed into spatio-temporal patches ("tubelets"), a subset of which are masked out, and a transformer-based encoder EθE_\theta computes representations for the visible patches. A small transformer "predictor" PϕP_\phi, conditioned on mask tokens Δ\Delta, is trained to predict the embeddings of the masked patches. V-JEPA 2 extends this principle through several technical innovations:

  • 3D Tubelet Tokenization: Each video clip is divided into tubelets spanning 2 frames ×\times 16 height ×\times 16 width (2×16×16).
  • 3D-RoPE Positional Encoding: 3D rotary positional embeddings are applied over the temporal and spatial axes to enable efficient sequence modeling.
  • Transformer Encoder: Large ViT-style backbone architectures (ViT-L, -H, -g up to 1B parameters) are employed, supporting deep scaling.
  • Predictor: For action-free pre-training, a compact transformer (ViT-S-like) acts as the predictor (12 layers, width 384, ≈22M parameters). For action-conditioned planning (V-JEPA 2-AC), a larger (24-layer, width 1024, ≈300M parameters) predictor with block-causal attention is used.
  • Masking Scheme: "Multi-block" spatio-temporal masking with short- and long-range blocks results in an overall masking ratio approaching 90%, with spatial scale $0.15$–$0.7$ and temporal mask ratio ≈1.0. Tubelet aspect ratios are sampled over [0.75,1.5][0.75,1.5].

An exponential moving average (EMA) copy of the student encoder fˉθ\bar f_\theta serves as the teacher, with identical architecture. The predictor maps the student’s visible patch embeddings (plus mask tokens for occluded positions) to the teacher’s target latent space.

2. Self-Supervised Objectives and Training Procedures

V-JEPA 2 employs a self-supervised masked-latent prediction loss tailored to large-scale video data. For a video input xyx\cup y, with xx denoting visible patches and yy denoting masked patches, the objective is:

Lmlp(θ,ϕ)=Ex,ygϕ(fθ(x),δ(y))stop_grad(fˉθ(y))1\mathcal{L}_{\rm mlp}(\theta,\phi) =\mathbb{E}_{x,y} \left\| g_\phi(f_\theta(x), \delta(y)) - \operatorname{stop\_grad}(\bar f_\theta(y)) \right\|_{1}

where δ(y)\delta(y) encodes the spatio-temporal positions of masked tubelets and stop_grad()\operatorname{stop\_grad}(\cdot) blocks gradient propagation through the teacher pathway to prevent collapse. The EMA teacher is updated at each step via:

θˉt=τθˉt1+(1τ)θt,τ1\bar\theta_t = \tau \bar\theta_{t-1} + (1-\tau) \theta_t, \quad \tau \lesssim 1

No explicit contrastive, variance, or redundancy reduction regularizers are used; collapse is prevented by the combination of EMA and stop-gradient. This process requires two forward passes per batch (student + teacher), with dynamic coupling between the two via the EMA update.

Action-conditioned extension (V-JEPA 2-AC): Post-pretraining, a separate predictor is learned on robot data with the frozen V-JEPA 2 encoder. Losses include:

  • Teacher-forcing Loss: Predict next latent from actions, proprioception, and current latent, averaged over predicted steps.
  • Two-step Rollout Loss: Enforces model rollout consistency over multiple prediction steps.

3. Data Regime, Training Scaling, and Optimizations

V-JEPA 2 is pre-trained on VideoMix22M (VM22M), comprising approximately 1 million hours of video and 1 million images:

  • Datasets: Something-Something v2, Kinetics-400/600/700, HowTo100M, YT-Temporal-1B (curated), plus ImageNet as static videos.
  • Sampling Ratios: Empirically set, e.g., ImageNet 25%, HowTo100M 31%.
  • Augmentations: Random temporal cropping (4 fps), random-resize-crop (aspect [0.75,1.35][0.75,1.35]), tubelet patchification, multi-block masking.
  • Progressive Resolution Schedule: Training proceeds from 16 frames @256×256 pixels to 64 frames @384×384, achieving 8×\times speedup.
  • Scaling Impact: Each of the following increases yields +1 to +1.5% average downstream accuracy: (1) dataset size up to 22M, (2) model up to 1B parameters, (3) longer training (252K steps), (4) higher resolution/longer clips. Evaluating with longer clips (64 vs. 16 frames) yields +9.7% average accuracy.

4. Downstream Results: Video Understanding, QA, and Robotic Planning

V-JEPA 2 achieves state-of-the-art results in several domains:

  • Motion Understanding (Frozen Probe): 77.3% top-1 on Something-Something v2 (ViT-g), 90.2% on Diving-48, 88.2% across six tasks, exceeding prior InternVideo2-1B by 7.6 pp.
  • Human Action Anticipation: Epic-Kitchens-100 recall@5 (action): 39.7% (+44% over PlausiVL).
  • Video Question Answering (with LLM alignment): PerceptionTest (8B model) 84.0% vs. 82.7% (PerceptionLM 8B); MVP paired accuracy 44.5% vs. 39.7%; TempCompass 76.9% vs. 72.7%; improvements extend to longest-horizon and compositional temporal QA.
  • Zero-Shot Robotic Planning (V-JEPA 2-AC): After 62h of Droid robot video, V-JEPA 2-AC achieves average 80% pick-and-place success, 75% reach-with-object, and 65% grasp on cup/box (vs. Octo: 15% grasp, Cosmos: 0–30%), deployed in two independent labs with no task-specific data or reward.

Performance Summary (selected benchmarks):

Method Params Pretrain FLOPs SSv2 (%) K400 (%)
V-JEPA 2 ViT-L 300M 1.9 73.7 85.1
V-JEPA 2 ViT-H 600M 3.5 74.0 85.3
V-JEPA 2 ViT-g 1B 5.3 75.3 86.6
SALT ViT-L 300M 1.2 74.9 85.4
SALT ViT-H 600M 1.5 75.4 86.0
SALT ViT-g 1B 1.9 76.2 86.8

V-JEPA 2’s frozen-backbone evaluation follows the probing/best-pool strategy on standard datasets. Ablations reveal data and model scale, progressive resolution, and YT1B curation as key contributors to accuracy.

5. LLM Alignment and Multi-Modal Extensions

V-JEPA 2 supports large multimodal alignment for video QA tasks:

  • Vision-Language Fusion: V-JEPA 2 encoders produce per-patch embeddings, projected to LLM token-space via MLP or cross-attention.
  • LLM Backbone: Multi-stage training using Qwen2-7B, Llama3-8B, or similar.
  • Training Regime: Image captioning, large-scale image QA, followed by video captioning and QA; standard cross-entropy loss.
  • Parameterization: At 8B model class, best results are achieved with stagewise fusion and optional vision encoder fine-tuning.

This fusion mechanism enables compositional temporal understanding and competitive (often best) accuracy on PerceptionTest, MVP, TempCompass, TOMATO, TVBench, and MVBench benchmarks.

6. Comparative Analysis: Architectural Trade-offs and Limitations

A detailed comparison with the SALT recipe (Li et al., 29 Sep 2025) underscores several aspects:

  • Compute Efficiency: V-JEPA 2 requires dual forward passes per batch (student and EMA teacher), inflating compute versus single-pass frozen-teacher regimes (SALT). Including initial teacher pretraining, SALT achieves 20–30% lower FLOPs for equivalent student scale.
  • Accuracy–Compute Pareto: SALT strictly dominates V-JEPA 2 across the accuracy-FLOPs Pareto frontier at all scales (e.g., +2.3pp avg. for ViT-L at matched compute).
  • Architectural Coupling: EMA ties student and teacher, restricting independent tuning. In contrast, SALT’s decoupled frozen-teacher architecture enables optimal compute allocation and supports scaling up the student with small, potentially sub-optimal teachers.
  • Training Stability and Model Selection: V-JEPA 2’s training loss is uncorrelated with downstream accuracy, requiring proxies such as RankMe, LiDAR, and α\alpha-ReQ for model selection and brittle hyperparameter tuning (e.g., careful EMA momentum scheduling, stop-gradient placement). SALT provides a better direct correlation between pretraining loss and frozen-backbone probing accuracy (R20.95R^2 \approx 0.95), improving transparency and stability.
  • Collapse Prevention: V-JEPA 2 prevents zero-loss trivial collapse by restricting teacher gradient flow and using slow EMA updates, whereas SALT achieves collapse resistance via a static, frozen teacher and simple loss design.

7. Limitations, Open Questions, and Future Directions

V-JEPA 2 and its extensions exhibit several known limitations:

  • Camera Pose Sensitivity: In robotic planning, the model’s implicit action-axis inference leads to errors that vary linearly with camera angle; potential remedies include unsupervised "calibration" at deployment.
  • Short Horizon Rollouts: Planning is limited to short-horizon (T=1) due to error compounding and MPC search complexity, motivating development of hierarchical abstractions or longer-horizon planning modules.
  • Goal Specification: The planning framework currently relies on image goals; future work suggests extending to language-conditioned planning via LLM integration.
  • Scaling Ceiling: V-JEPA 2 is presently limited to 1B parameter scale; further improvements are anticipated with models exceeding 10B parameters.
  • Dynamic EMA Complexity: EMA-based self-distillation entails additional engineering and training complexity, as described above.

This suggests that future advancements may focus on deeper scaling, more robust fusion with LLMs for language-guided tasks, adaptation to varied camera geometries, and the development of hierarchical models for long-horizon control.


V-JEPA 2 constitutes a high-capacity, versatile, and empirically benchmarked approach for learning video representations applicable to understanding, anticipation, multimodal reasoning, and robotic planning, with its performance and efficiency trade-offs now sharply contextualized by static-teacher alternatives such as SALT (Li et al., 29 Sep 2025, Assran et al., 11 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to V-JEPA 2.