V-JEPA ViT: Self-Supervised Video Modeling

Updated 21 November 2025

The paper introduces a self-supervised vision transformer that predicts masked latent features instead of pixels, achieving state-of-the-art video understanding.
It employs a novel JEPA framework with masking of tubelets and an EMA teacher to prevent collapse and enhance spatiotemporal inference.
Empirical results demonstrate robust performance in video classification, multimodal alignment, and robotic planning with compute-optimal training schemes.

The V-JEPA Vision Transformer family encompasses a class of self-supervised, video- and image-centric models that learn rich latent representations by predicting masked region embeddings in feature space, rather than reconstructing pixels. This architecture—rooted in the Joint-Embedding Predictive Architecture (JEPA) framework—has demonstrated state-of-the-art results in large-scale video understanding, human action anticipation, multimodal alignment, and robotics. Below, key aspects of the V-JEPA ViT family, including its core principles, architectural innovations, learning objectives, empirical performance, and efficiency trade-offs, are delineated.

1. Conceptual Foundations and Architecture

V-JEPA models realize the JEPA principle for video: learn general-purpose spatiotemporal representations via mask-predictive feature regression using Vision Transformers (ViT) as backbones (Assran et al., 11 Jun 2025, Bardes et al., 15 Feb 2024). The canonical V-JEPA pipeline comprises three primary modules:

Masking Operator: Video clips of shape $T \times H \times W$ are partitioned into tubelets (2 × 16 × 16 temporal-spatial blocks), and a high-proportion random subset is masked.
ViT-based Encoder $E_\theta$ : Processes visible (unmasked) tubelets to produce token embeddings.
Predictor $P_\phi$ : Given masked token positions (as learnable embeddings) and encoder outputs, predicts the latent features of the masked tubelets.

A teacher encoder $E_{\bar\theta}$ —conventionally an exponential moving average (EMA) of the student encoder weights—is used to furnish targets for prediction, with gradient flow stopped to prevent trivial solutions.

In advanced variants, such as V-JEPA 2, token embeddings incorporate a factorized 3D rotary positional encoding (3D-RoPE), splitting each token embedding by temporal and spatial axes and applying 1D RoPE independently (Assran et al., 11 Jun 2025):

$\mathrm{RoPE}_k(v, p) = [R(\omega_1 p) v_1,\, R(\omega_2 p) v_2,\ldots]$

where each $2 \times 2$ block $R(\omega_i p)$ effects a planar rotation by $\omega_i p$ .

2. Learning Objective and Collapse Prevention

The central V-JEPA loss is a masked latent regression applied over the masked tubelet positions. Given encoder $E_\theta$ and teacher $E_{\bar\theta}$ , with predictor $P_\phi$ and mask-token positions $\Delta_y$ , the loss is:

$\mathcal{L}_{\text{mask}} = \|\;P_\phi(\Delta_y, E_\theta(x))\;-\;\mathrm{sg}(E_{\bar\theta}(y))\;\|_1$

This $L_1$ or $L_2$ regression is restricted to masked regions. The stop-gradient operation on the teacher deters representational collapse, augmented by the use of a predictor head and EMA teacher (Bardes et al., 15 Feb 2024, Li et al., 29 Sep 2025).

In the action-conditioned extension (V-JEPA 2-AC), the model is augmented for robotic planning:

$\mathcal{L} = \mathcal{L}_{\text{tf}} + \mathcal{L}_{\text{roll}},\quad \mathcal{L}_{\text{tf}} = \frac{1}{T} \sum_{k=1}^T \| P_\phi((a_t, s_t, z_t)_{t \leq k}) - z_{k+1} \|_1$

$\mathcal{L}_{\text{roll}} = \| P_\phi(a_{1:\tau}, s_1, z_1) - z_{\tau+1} \|_1$

where $a_k$ and $s_k$ are action and state vectors, respectively (Assran et al., 11 Jun 2025).

3. Vision Transformer Backbone and Tokenization

V-JEPA models utilize scalable ViT encoders. Key configurations include ViT-L/16 (24 layers, 1024-dim, 16 heads), ViT-H/16 (32 layers, 1280-dim, 16 heads), and ViT-g/16 (40 layers, 1408-dim, 22 heads) (Assran et al., 11 Jun 2025, Bardes et al., 15 Feb 2024). Tubelet tokenization is employed: each video is decomposed into 2×16×16 3D patches vectorized to $D$ -dim tokens. Multi-block spatio-temporal masking, often masking ~90% of the tubelets, compels the model to perform high-level inference rather than trivial local interpolation (Bardes et al., 15 Feb 2024, Li et al., 29 Sep 2025).

4. Pretraining, Optimization, and Scaling

Pretraining leverages web-scale collections such as VideoMix22M (22M samples, ~1M hours), combining large-scale video (SSv2, Kinetics-700, HowTo100M, Curated-YT1B) and static images (ImageNet as 16-frame pseudo-videos) (Assran et al., 11 Jun 2025). VideoMix2M (2M videos) is used in prior work (Bardes et al., 15 Feb 2024).

Optimizers: AdamW with carefully tuned weight decay (0.04), progressive learning rate schedules (e.g., warmup to 5.25e-4, cooldown to 1e-6), and large batch sizes ( $\sim$ 3000) are standard. EMA decay for the teacher is set tightly ($0.99925$). Resolution is progressively increased during training. Similar settings hold for the SALT regime, with two-stage training: (1) pixel reconstruction for a frozen teacher, and (2) masked-latent prediction for the student (Li et al., 29 Sep 2025).

5. Extensions: Video-Language and Robotic Planning

Video-Language Alignment: V-JEPA 2 integrates with frozen LLMs (Qwen2-7B, Llama 3.1-8B) via a projector (MLP or attentive pooler), enabling visual instruction tuning (image captioning, QA, video captioning/QA) over up to 88.5M video-text pairs (Assran et al., 11 Jun 2025). The result is state-of-the-art performance across video-centric QA (PerceptionTest 84.0%, TempCompass 76.9%).

Robotic World Modeling (V-JEPA 2-AC): Action-conditioned world models are trained with minimal robot data (e.g., 62 hours from Droid at 4 fps, 256² resolution), enabling zero-shot deployment on Franka arms for manipulation tasks. Notably, the model generalizes to novel labs and tasks without task-specific reward or in-situ data collection (Assran et al., 11 Jun 2025).

6. Empirical Performance and Comparative Analysis

V-JEPA models set state-of-the-art benchmarks in both frozen and fine-tuned regimes:

Model	SSv2 Top-1	Kinetics-400	Epic-Kitchens-100 R@5	PerceptionTest	Avg. (video)
V-JEPA 2 ViT-g/384	77.3	87.3	39.7	84.0	88.2%
V-JEPA ViT-H/16_384	72.2	81.9	—	—	—
V-JEPA 2 vs VideoMAE (frozen)	+3 to +6 pp	+3-4 pp	—	—	—
SALT ViT-g	76.2	—	—	—	—

V-JEPA 2 exceeds prior task-specific video models (Epic-Kitchens-100: 39.7 R@5, 44% gain over previous SOTA), and achieves robust transferability to static-image tasks (ImageNet: 85.1%) (Assran et al., 11 Jun 2025, Bardes et al., 15 Feb 2024). On robotics, zero-shot picking/placing achieves up to 80% success (pick-and-place cup) with no target-environment data collection.

7. SALT: Static Teacher Advances in Latent Prediction

SALT (Static-teacher Asymmetric Latent Training) revisits the EMA-based collapse prevention of standard V-JEPA, instead training a frozen ViT teacher via pixel reconstruction and subsequently fitting a larger student via masked-latent regression (Li et al., 29 Sep 2025). Key empirics:

At fixed compute, SALT surpasses V-JEPA 2 in downstream accuracy; e.g., SALT ViT-L achieves 74.9% on SSv2 at 1.2×10²¹ FLOPs vs. V-JEPA 2's 73.7% at 1.9×10²¹ FLOPs.
Student performance is robust to teacher size and quality; high-performing students emerge even with modest teachers.
Compute-optimal allocation favors brief teacher training (~40k steps) and majority allocation to student (e.g., 200k steps).

SALT yields simpler, decoupled objectives, transparent loss metrics for model selection, and consistently dominates the FLOPs-accuracy Pareto frontier across all scales.

8. Significance and Theoretical Insights

V-JEPA’s feature-prediction regimen, as opposed to pixel-reconstruction or contrastive paradigms, rapidly learns semantic, temporally-sensitive representations with high transfer utility. The masking strategy (multi-block, spatio-temporal, high-ratio) compels global reasoning and robustness. A plausible implication is that predicting in learned latent spaces rather than pixels or fixed features provides superior invariance and reduces the need for strong augmentations (Bardes et al., 15 Feb 2024, Assran et al., 2023).

The transition from EMA to static/frozen teacher in SALT streamlines scalability and architecture search, while empirical evidence suggests the bulk of the compute should be reallocated away from teacher pretraining to student fitting (Li et al., 29 Sep 2025).

References

"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning" (Assran et al., 11 Jun 2025)
"Revisiting Feature Prediction for Learning Visual Representations from Video" (Bardes et al., 15 Feb 2024)
"Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (Assran et al., 2023)
"Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers" (Li et al., 29 Sep 2025)