V-JEPA 2: Video Joint-Embedding Predictive Architecture

Updated 28 June 2026

The paper introduces a self-supervised masked latent-prediction framework that uses high masking and a momentum teacher–student paradigm to predict masked video embeddings.
V-JEPA 2 leverages transformer-based context and predictor modules to efficiently learn video representations from large-scale datasets, achieving state-of-the-art results in recognition and anticipation tasks.
The method’s design supports integration into robotics and vision-language models, improving action planning and generalization without relying on pixel-level supervision.

Video Joint-Embedding Predictive Architecture 2 (V-JEPA 2) is a self-supervised masked latent-prediction framework for learning video representations. Built upon transformer-based architectures, V-JEPA 2 employs a high masking regime, a momentum teacher–student paradigm, and a predictive latent-space objective to produce embeddings tailored for generalizable motion understanding, temporal anticipation, and planning tasks. The method has demonstrated state-of-the-art results on video recognition, anticipation, and robotics applications without direct pixel-level supervision or language-based pretraining.

1. Architecture and Pretraining Framework

V-JEPA 2 consists of three key components:

Context (Student) Encoder ( $f_\theta$ ): A Video Vision Transformer (ViT-L/H/g) that processes only the unmasked “context” tubelets of an input video and outputs their latent embeddings.
Predictor ( $g_\phi$ ): A lightweight transformer (often 8–12 layers) that receives the student’s context embeddings and a binary mask indicating which tubelets are masked, producing predictions for the masked latents.
EMA (Momentum) Teacher Encoder ( $\bar f$ ): A separate encoder, maintained as an exponential moving average (EMA) of the student, processes the masked-out target tubelets and yields stable target embeddings.

The masking strategy employs random spatial and spatio-temporal blocks with short- and long-range sizes. Typically, 90% of tubelets (16×16 px × 2 frames) are masked, leaving the student highly dependent on limited visible context to solve the masked latent prediction task (Li et al., 29 Sep 2025, Assran et al., 11 Jun 2025).

The teacher’s parameters are updated at each step via EMA:

$\theta_t \leftarrow m \cdot \theta_t + (1 - m) \cdot \theta_s,$

with $m \in [0.996, 0.9999]$ (typically $m = 0.999$ ), ensuring the teacher evolves slowly to provide stable targets and prevent collapse (Li et al., 29 Sep 2025).

2. Mathematical Formulation and Training Objective

The training objective is an $\ell_1$ masked latent regression loss. For each input, context patches $x$ and masked patches $y$ are processed as follows:

$\min_{\theta,\phi}\; \mathbb{E}_{(x,y)}\, \|\, g_\phi(f_\theta(x),\delta_y) - \text{stop\_grad}(\bar f(y))\,\|_1,$

where $g_\phi$ 0 predicts the target embeddings for masked positions $g_\phi$ 1 (a binary indicator for masked locations), and $g_\phi$ 2 blocks gradients into the teacher. The loss is averaged per token (Li et al., 29 Sep 2025, Assran et al., 11 Jun 2025, Miao et al., 12 Feb 2026, Bardes et al., 2024).

Masking involves multi-block spatial and temporal dropout. Short-range blocks cover $g_\phi$ 315% spatial area and long-range blocks cover $g_\phi$ 470%, with masking applied across the full video sequence. This enforces reliance on contextually correlated temporal and spatial cues.

3. Data Regime, Hyperparameters, and Scaling

V-JEPA 2 is pretrained on large, web-scale video datasets such as VideoMix-22M (VM-22M), which aggregates Kinetics (400/600/700), Something-Something V2, HowTo100M, YT-Temporal-1B, and duplicated ImageNet for over 1 million hours of diverse video (Assran et al., 11 Jun 2025).

Typical pretraining settings:

Batch size: 3072
Optimizer: AdamW, $g_\phi$ 5, $g_\phi$ 6, weight decay annealed from 0.04 to 0.4
Learning rate: cosine decay, with a 10k-step warmup to a peak of $g_\phi$ 7, final $g_\phi$ 8
Gradient clipping: 0.02
Total steps: 240k (e.g., VM-22M), plus additional cooldown with longer clips/resolutions
Data augmentations: random crop, flip, block masking

No explicit regularizers (e.g. VICReg, Barlow Twins) are required beyond stop-gradient and EMA (Li et al., 29 Sep 2025). The model is trained at increasing scales, with encoder sizes from ViT-L (300M parameters) through ViT-g (1B), and up to ViT-G (2B) for later variants (Mur-Labadia et al., 15 Mar 2026).

4. Downstream Performance and Empirical Outcomes

V-JEPA 2 achieves benchmark results on motion-centric and general video understanding tasks, outperforming both image-pretrained and pixel-reconstruction baselines (Assran et al., 11 Jun 2025, Bardes et al., 2024, Miao et al., 12 Feb 2026). Representative accuracies for frozen-backbone evaluation are:

Model	Params	SSv2	K400
ViT-L	300M	73.7%	85.1%
ViT-H	600M	74.0%	85.3%
ViT-g	1B	75.3%	86.6%
ViT-g₍₃₈₄₎	1B	77.3%	87.3%

On Epic-Kitchens-100 (human action anticipation) with Recall@5, V-JEPA 2 ViT-g₍₃₈₄₎ achieves 39.7%, exceeding specialized anticipation models such as PlausiVL (8B, 27.6%) (Assran et al., 11 Jun 2025).

When aligned with LLMs for video QA (e.g., PerceptionTest, TempCompass, TOMATO), V-JEPA 2 establishes state-of-the-art results at the 8B parameter scale (e.g., 84.0 PerceptionTest, 76.9 TempCompass, 59.5 MVBench average), outperforming DINOv2, SigLIP2, and PE-G/14 (Assran et al., 11 Jun 2025).

V-JEPA 2’s architecture is empirically situated on the Pareto frontier in the compute–accuracy tradeoff, though later works introduce further improvements.

5. Extensions: V-JEPA 2-AC, Dense Supervision, and V-JEPA 2.1

Latent Action-Conditioned World Modeling (V-JEPA 2-AC):

V-JEPA 2 can be post-trained for robotics via an action-conditioned predictor $g_\phi$ 9, modeling future latent states given history and actions:

$\bar f$ 0

with additional rollout loss for planning. Zero-shot deployment on real Franka robots demonstrates planning performance surpassing prior latent diffusion methods in grasping and pick-and-place, with faster action cycles (Assran et al., 11 Jun 2025).

Dense Prediction and Hierarchical Self-Supervision (V-JEPA 2.1):

V-JEPA 2.1 augments the original framework by supervising both masked and unmasked (“context”) tokens with a distance-weighted context loss:

$\bar f$ 1

where $\bar f$ 2 weights prediction error for context tokens based on proximity to masked regions. The loss is further applied hierarchically at multiple ViT layers (Mur-Labadia et al., 15 Mar 2026). Multi-modal tokenizers enable unified training on images and videos, eliminating prior inefficiencies due to temporal duplication of images.

Key empirical improvements of V-JEPA 2.1 include mAP 7.71 on Ego4D STA v2 (+21% over V-JEPA 2), 40.8 Recall@5 on Epic-Kitchens-100, 47.9 mIoU ADE20K, and 0.307 RMSE on NYUv2 depth estimation (Mur-Labadia et al., 15 Mar 2026).

6. Computational Tradeoffs and Alternatives (SALT)

“Rethinking JEPA” introduces SALT (Static-teacher Asymmetric Latent Training), a frozen-teacher variant that decouples pixel reconstruction (teacher stage) and masked latent prediction (student stage). SALT demonstrates superior compute efficiency and easier model selection compared to V-JEPA 2: linearly correlating student loss with probing accuracy (R²≈0.95), outperforming V-JEPA 2 by 2–3% under frozen feature evaluation while using less compute per final model (Li et al., 29 Sep 2025).

7. Practical Implications and Integration in Robotics/Policy Models

V-JEPA 2 embeddings are particularly well-suited for robotics and vision-language-action (VLA) policy fusion. By virtue of learning to predict only what is spatiotemporally predictable, V-JEPA 2 discards nuisance factors (lighting, clutter) and emphasizes taskrelevant motion and state information (Miao et al., 12 Feb 2026). Fusing V-JEPA 2 into VLA models through early fusion or gated cross-attention yields improvements of 6–18 points in benchmark success rates across LIBERO, LIBERO+, RoboTwin 2.0, and real-world pick-and-place tasks. These gains hold across both simulation and hardware, even in challenging domain-randomized settings.

A plausible implication is that V-JEPA 2’s masked latent prediction objective induces inductive biases toward world modeling, anticipation, and action planning, further substantiated by its strong sample efficiency and generalization relative to prior image-pretrained encoders.

References:

(Li et al., 29 Sep 2025, Assran et al., 11 Jun 2025, Bardes et al., 2024, Miao et al., 12 Feb 2026, Mur-Labadia et al., 15 Mar 2026)