VJEPA-2: Latent World Model
- VJEPA-2 is a self-supervised latent world model based on the JEPA paradigm that predicts future video representations in latent space.
- It employs a masked spatio-temporal prediction strategy with an EMA target encoder to stabilize training and capture intuitive physics priors.
- When used as an inference-time reward, VJEPA-2 significantly boosts video generation performance, improving physical plausibility on benchmarks like PhysicsIQ.
VJEPA-2 is a self-supervised latent world model built on the Joint-Embedding-Predictive-Architecture (JEPA) paradigm, designed to encode and predict the future latent representations of video sequences by leveraging large-scale unlabeled data. Unlike pixel-reconstruction approaches, VJEPA-2 operates entirely in latent space, with a training regime emphasizing masked prediction of spatio-temporal patches. Its design and utilization as a physics prior have enabled substantial advances in the physical plausibility of generated videos, notably improving the performance of state-of-the-art video generative models on benchmarks such as PhysicsIQ (Yuan et al., 15 Jan 2026, &&&1&&&).
1. Architectural Foundations
VJEPA-2 comprises several interlocking modules:
- Context Encoder (): A Vision Transformer (ViT)-style spatio-temporal transformer encodes an input video of length into a sequence of latent tokens , where denote frame dimensions, is patch size, and is embedding dimensionality. Only a random subset of spatial and temporal patches, termed the "context," is revealed during training.
- Mask Tokens (): Learnable embeddings are placed at masked positions (both spatial and temporal) to indicate missing content and allow the predictor to attend to both observed context tokens and mask positions.
- Predictor Network (): This module, implemented as either a transformer or multi-layer perceptron, predicts the withheld masked latent representations from the combined sequence .
- EMA Target Encoder (): A slow-moving exponential average of is used as the "teacher" network, processing unmasked input to generate target latents. This strategy—mirroring the BYOL/MoCo family—stabilizes training and prevents collapse to trivial copying.
Notably, VJEPA-2 omits any explicit pixel-space decoder. The predictor reconstructs in the latent domain exclusively, prioritizing structured predictions over reconstructing raw appearance.
2. Self-Supervised Training Objectives
Training is based on masked prediction using two central objectives:
- Masked Denoising Regression Loss: For each training step, contiguous spatio-temporal patches are masked, leaving patches visible.
with the “sg(·)” denoting stop-gradient. Only and are updated. The network is thereby constrained to recover future (masked) spatio-temporal dependencies solely from the visible partial context.
- Contrastive Temporal Loss (optional): To further sharpen the embedding space and prevent collapse, a contrastive InfoNCE term is sometimes employed:
where and is a temperature parameter.
- Total Loss:
Through minimization over millions of Internet-scale video clips, VJEPA-2 acquires an embedding that encodes object permanence, continuity, and predictable scene dynamics.
3. Physics-Plausibility Inductive Biases
VJEPA-2’s self-supervised structure yields several crucial inductive biases:
- Latent Future Prediction: Predicting spatio-temporal latent tokens from partial context compels the model to ignore unpredictable, decorrelated details (e.g., lighting flicker, non-persistent textures) and focus on consistent, causally driven structure (rigid-body motion, conservation, fluid flows).
- Mask-Based Spatio-Temporal Modeling: The masking strategy encourages learning of object permanence and tracking, promoting continuity across space and time.
- EMA Target Stabilization: The teacher-student arrangement prevents trivial copying and instead encourages the extraction of robust, predictive signals.
Collectively, these biases align the learned latent spaces with Newtonian regularities, despite the absence of explicit physics labels during training.
4. Inference-Time Physics Reward and Alignment
VJEPA-2 is converted at inference into a differentiable reward function, termed WMReward, quantifying “surprise” when a generated video deviates from the world-model predictions.
- Sliding-Window Evaluation: For a generated video of length , a window of size slides over each position :
- Prediction: Predicted future latents from context:
- Actual Latents: Computed on the full window:
- Surprise Score: For the masked future positions, is the mean cosine surprise:
- Higher values indicate samples more consistent with VJEPA-2’s learned world model, implying higher physics plausibility.
Integration with Generative Models:
- Black-box Tilting: Generation distribution , where increases with .
- Gradient-Based Guidance: yields a guidance step:
Guidance is typically applied at select denoising steps for computational efficiency. - Best-of-N Sampling: Draw candidate generations, score with , and select the best. Used both stand-alone or with guidance.
5. Integration and Evaluation in Video Generation Pipelines
VJEPA-2 reward has been applied to video diffusion architectures (e.g., MAGI-1) as an inference-time alignment tool, without retraining the generative model.
Pipeline Example (MAGI-1 + VJEPA-2): At each generation chunk, candidate samples are generated; VJEPA-2 evaluates surprise; samples are selected based on the highest reward, or used to guide diffusion with the negative gradient of the surprise.
Reward in RL and Diffusion Guidance: While end-to-end RL fine-tuning is possible (via reward-augmented losses), principal gains are observed even without retraining by leveraging VJEPA-2 as an inference-time plug-in reward or guidance signal.
Empirical Results on PhysicsIQ:
- V2V generation baseline (MAGI-1): 56.31
- MAGI-1 + VJEPA-2: 62.64 ()
- I2V generation baseline: 30.23
- MAGI-1 + VJEPA-2: 36.86 ()
These results were achieved on the ICCV 2025 Perception Test PhysicsIQ Challenge, with the VJEPA-2-based system exceeding previous state of the art by 7.42% (Yuan et al., 15 Jan 2026, Yuan et al., 22 Oct 2025).
Comparative Table: VJEPA-2 Integration Results on PhysicsIQ
| Model | V2V Score | I2V Score | Δ (V2V) | Δ (I2V) |
|---|---|---|---|---|
| MAGI-1 baseline | 56.31 | 30.23 | — | — |
| MAGI-1 + VJEPA-2 | 62.64 | 36.86 | +6.33 | +6.63 |
6. Ablation Studies and Robustness Analysis
Extensive ablations confirm the robustness and scaling properties of VJEPA-2:
- Sliding-Window Hyperparameters: Performance is stable across a range of window sizes , context lengths (), strides, and frame rates.
- Backbone Scaling: Increasing model size (e.g., ViT-huge to ViT-giant) reliably improves PhysicsIQ scores under WMReward(BoN) evaluation.
- Example: For MAGI-1 V2V, , stride=16, 24 FPS
- ViT-huge: PhysicsIQ ≈ 57.1%
- ViT-giant: PhysicsIQ ≈ 60.8%
- Sampling and Guidance Modes:
- Guidance only: +4.8 improvement (V2V)
- BoN only: +3.5
- Combined: +6.33, indicating strong synergy.
This suggests that inference-time use of stronger VJEPA-2 backbones produces more reliable rewards and physically plausible generations, independent of precise tuning parameters.
7. Significance and Broader Implications
VJEPA-2 demonstrates that self-supervised, masked-prediction latent world models can encode substantial intuitive physics priors, suitable for generalization across image-, video-, and text-conditioned video generation. Used as a differentiable “surprise-based” reward, VJEPA-2 facilitates inference-time alignment of black-box generative models, bridging the gap between visual fidelity and genuine physical plausibility in synthesized video.
These developments underscore the broader potential of latent world models as plug-in, domain-agnostic physicality priors for generative modeling. The approach obviates the need for explicit physics labels or model retraining, offering a lightweight, flexible tool for improving outcome plausibility across diverse temporal prediction problems (Yuan et al., 15 Jan 2026, Yuan et al., 22 Oct 2025).