V-JEPA 2 Video Encoder
- The paper introduces V-JEPA 2, a self-supervised vision transformer that predicts internal features using a momentum-distilled teacher-student paradigm without pixel reconstruction.
- V-JEPA 2 employs a 3D ViT backbone with tubelet patch embeddings and 3D rotary positional encoding to capture detailed spatio-temporal dynamics from large-scale video datasets.
- The encoder achieves state-of-the-art transfer performance on multiple benchmarks and extends to robotic planning through its action-conditioned variant, V-JEPA 2-AC.
V-JEPA 2 video encoder is a self-supervised vision transformer (ViT) architecture engineered to learn high-fidelity, generalizable video representations at scale via feature-space masked prediction. Emerging from the family of Joint-Embedding Predictive Architectures, V-JEPA 2 dispenses with explicit pixel reconstruction, contrastive losses, and requires no negative samples or text, operating solely by predicting the internal features of masked video regions using a momentum-distilled teacher-student paradigm. It is specifically optimized for large-scale, diverse video corpora and demonstrates strong transfer to both spatial and temporal downstream tasks under frozen backbone evaluation. The action-conditioned extension, V-JEPA 2-AC, enables powerful world-model-based robotic planning from video alone.
1. Architectural Foundations and Feature Pipeline
V-JEPA 2 utilizes a 3D vision transformer backbone with “tubelet” patch embedding and rotary positional encoding to process video clips as spatio-temporal token grids. For a canonical input (e.g., 16 frames at 256×256 pixels), strided convolutional embedding generates non-overlapping 2×16×16 tubelets, yielding tokens (2048 for a 16×256×256 clip). The backbone is deployed in multiple configurations:
- ViT-L/16: 24 layers, width 1024, 16 heads, ≈300M parameters
- ViT-H/16: 32 layers, width 1280, 16 heads, ≈600M parameters
- ViT-g/16: 40 layers, width 1408, 22 heads, ≈1B parameters
A distinctive aspect is the absence of a [CLS] token; each tubelet is individually represented, facilitating fine-grained localization in feature space. The encoder’s outputs, for unmasked regions, are directly processed by a dedicated small ViT-based predictor (12 layers, 384 width, ≈22M params), which also incorporates position-aware “mask tokens” to represent the locations of masked-out tubelets (Assran et al., 11 Jun 2025).
Positional encoding employs 3D rotary embeddings (3D-RoPE) along the temporal, height, and width axes, preserving spatio-temporal structural information across blocks (Assran et al., 11 Jun 2025).
2. Self-Supervised Objective and Training Strategy
V-JEPA 2’s defining mechanism is joint-embedding prediction in latent space. The student encoder receives a subset of visible tubelets , while the teacher encoder (an exponential-moving-average of ) processes the full, unmasked clip . The predictor is tasked to match the teacher’s features for masked positions: where is the stop-gradient operator and is a binary mask indicating masked token indices (Assran et al., 11 Jun 2025, Li et al., 29 Sep 2025).
Masking follows a multiblock regime: each iteration creates (i) a “short-range” masked block of ≈15% area spanning the full temporal window, and (ii) a “long-range” block covering ≈70% area, resulting in an aggregate ≈90% masking ratio. This aggressive spatial-temporal masking forces the predictor to capture both local and global semantics, promoting robustness in motion and appearance encoding (Bardes et al., 2024, Assran et al., 11 Jun 2025).
The teacher weights are updated via momentum as: with typical (Li et al., 29 Sep 2025).
3. Training Regime, Data, and Optimization
V-JEPA 2 is pre-trained on internet-scale video datasets. The largest configuration is trained on VM22M (≈22M samples, ~1.7M hours of video from datasets including YT-Temporal-1B, Kinetics-400/600/700, Something-Something-v2, HowTo100M, with additional 1M ImageNet images) using stratified sampling for category balance (Assran et al., 11 Jun 2025).
Training proceeds with:
- AdamW optimizer (weight decay from 0.04),
- Learning rate warmup (12K steps to peak 1e-4), followed by constant rate and “cooldown” phase with gradual decay,
- Progressive curriculum (shorter/lower-res clips early, transitioning to longer/higher-res later),
- Short augmentations: random resized crop (scale 0.3–1.0), aspect ratio jitter (0.75–1.35),
- Large batch sizes (3072), 240K–252K pretraining iterations (ViT-g/16 example).
Compute is substantial – ViT-g/16 requires ≈5.3×10²¹ FLOPs over 252K steps (16×256×256 clips), including both student and teacher passes (Li et al., 29 Sep 2025).
4. Evaluation Protocols and Benchmark Results
Downstream assessment adheres to the frozen-backbone convention: the encoder is fixed, and representations are probed using a four-layer attentive pooling stack (3 self-attn, 1 cross-attn) plus a linear head. Key benchmarks include:
- Video understanding (classification): Kinetics-400, Something-Something-v2, COIN, Jester, Diving-48
- Image recognition: ImageNet-1K (with images replicated as 16-frame clips)
- Action anticipation: Epic-Kitchens-100 (recall@5)
- Video QA: Alignment and end-to-end tuning with LLMs
Notable results (ViT-g/16, 384 input):
- SSv2 top-1: 77.3%
- Kinetics-400: 87.3%
- COIN: 91.1%
- Jester: 97.8%
- ImageNet: 85.1%
- EPIC-Kitchens-100 action anticipation: 39.7 recall@5 V-JEPA 2 thus establishes state of the art in the ≤1B parameter class for frozen encoder transfer (Assran et al., 11 Jun 2025).
5. Analysis, Ablations, and Scaling Behavior
Key ablation studies demonstrate:
- Feature-space prediction (L₁ loss) consistently outperforms pixel-space reconstruction (MSE), e.g., by +5% (Kinetics-400) and +1.5% (SSv2) (Bardes et al., 2024).
- Multiblock masking achieves >1% absolute gain over tubelet or causal masks.
- Attentive pooling replaces global averaging, leading to gains up to +17.3% (Kinetics-400) and +16.1% (SSv2).
- Label efficiency is high: under 5% labeled data, accuracy on SSv2 with ViT-H/16₃₈₄ drops only ~14% (vs. ~26% for pixel-based VideoMAEv2).
- Training curves scale smoothly by both model depth and input resolution.
Scaling studies further support that data curation, longer training, and higher input resolution each yield additive improvements (data curation +1.4%, model scaling +1.5%, longer training +0.8%, higher res/time +0.7%) in top-1 video classification metrics (Assran et al., 11 Jun 2025).
6. Extensions: Action-Conditioned World Models and Robotics
V-JEPA 2-AC extends the architecture for action-conditioned video prediction and robot planning. The frozen V-JEPA 2 encoder extracts per-frame embeddings, while a separate action-conditioned transformer operates block-causally over sequences of robot states, deltas (actions), and frame features.
Objective terms:
- Teacher-forced prediction loss:
- Model rollout loss to penalize compounding errors.
Zero-shot deployment is demonstrated on Franka arms across laboratories using image-based planning via Cross-Entropy Method. V-JEPA 2-AC achieves average success rates above 70% on complex manipulation tasks compared to 12.5% for diffusion-based baselines (Assran et al., 11 Jun 2025).
Visualizations highlight intuitive physical knowledge (e.g., object-handling persistence) and emergent camera calibration, suggesting rich structural representations beyond semantics.
7. Comparative Perspectives and Limitations
Relative to static-teacher approaches such as SALT, V-JEPA 2 demands higher compute and enforces strict architectural coupling between student and teacher models. While V-JEPA 2 achieves strong transfer performance, its accuracy–FLOPs frontier is strictly dominated by SALT, which operates with decoupled two-stage optimization and allows student models to scale independently for better compute allocation. Additionally, V-JEPA 2’s EMA-based joint embedding loss does not correlate directly with downstream accuracy, necessitating surrogate heuristics for checkpoint selection, whereas alternatives report strong direct correlation and greater transparency (Li et al., 29 Sep 2025).
A plausible implication is that future large-scale video representation learning may shift toward frozen-teacher architectures or hybridized approaches to reconcile efficiency and accuracy.
Key References:
- "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning" (Assran et al., 11 Jun 2025)
- "Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers" (Li et al., 29 Sep 2025)
- "Revisiting Feature Prediction for Learning Visual Representations from Video" (Bardes et al., 2024)