VJEPA-2: Latent World Model

Updated 15 February 2026

VJEPA-2 is a self-supervised latent world model based on the JEPA paradigm that predicts future video representations in latent space.
It employs a masked spatio-temporal prediction strategy with an EMA target encoder to stabilize training and capture intuitive physics priors.
When used as an inference-time reward, VJEPA-2 significantly boosts video generation performance, improving physical plausibility on benchmarks like PhysicsIQ.

VJEPA-2 is a self-supervised latent world model built on the Joint-Embedding-Predictive-Architecture (JEPA) paradigm, designed to encode and predict the future latent representations of video sequences by leveraging large-scale unlabeled data. Unlike pixel-reconstruction approaches, VJEPA-2 operates entirely in latent space, with a training regime emphasizing masked prediction of spatio-temporal patches. Its design and utilization as a physics prior have enabled substantial advances in the physical plausibility of generated videos, notably improving the performance of state-of-the-art video generative models on benchmarks such as PhysicsIQ (Yuan et al., 15 Jan 2026, &&&1&&&).

1. Architectural Foundations

VJEPA-2 comprises several interlocking modules:

Context Encoder ( $E_\theta$ ): A Vision Transformer (ViT)-style spatio-temporal transformer encodes an input video $x$ of length $T$ into a sequence of latent tokens $z \in \mathbb{R}^{T \cdot (H/P) \cdot (W/P) \times D}$ , where $H, W$ denote frame dimensions, $P$ is patch size, and $D$ is embedding dimensionality. Only a random subset of spatial and temporal patches, termed the "context," is revealed during training.
Mask Tokens ( $\Delta_m$ ): Learnable embeddings are placed at masked positions (both spatial and temporal) to indicate missing content and allow the predictor to attend to both observed context tokens and mask positions.
Predictor Network ( $P_\phi$ ): This module, implemented as either a transformer or multi-layer perceptron, predicts the withheld masked latent representations from the combined sequence $[E_\theta(x_\text{masked}); \Delta_m]$ .
EMA Target Encoder ( $\bar{E}_\theta$ ): A slow-moving exponential average of $E_\theta$ is used as the "teacher" network, processing unmasked input to generate target latents. This strategy—mirroring the BYOL/MoCo family—stabilizes training and prevents collapse to trivial copying.

Notably, VJEPA-2 omits any explicit pixel-space decoder. The predictor reconstructs in the latent domain exclusively, prioritizing structured predictions over reconstructing raw appearance.

2. Self-Supervised Training Objectives

Training is based on masked prediction using two central objectives:

Masked Denoising Regression Loss: For each training step, $M$ contiguous spatio-temporal patches are masked, leaving $C$ patches visible.

$L_\text{pred} = \|P_\phi(\Delta_m, E_\theta(x_\text{masked})) - \text{sg}\left(\bar E_\theta(x)\right)\|_1$

with the “sg(·)” denoting stop-gradient. Only $P_\phi$ and $E_\theta$ are updated. The network is thereby constrained to recover future (masked) spatio-temporal dependencies solely from the visible partial context.

Contrastive Temporal Loss (optional): To further sharpen the embedding space and prevent collapse, a contrastive InfoNCE term is sometimes employed:

$\mathcal{L}_{NCE} = -\mathbb{E}\Bigg[\log\frac{\exp\left(\text{sim}(\hat z_{t,i}, z_{t,i}) / \tau\right)}{\sum_{(t',j)} \exp\left(\text{sim}(\hat z_{t,i}, z_{t',j}) / \tau\right)}\Bigg]$

where $\text{sim}(a, b) = a^\top b / (\|a\| \|b\|)$ and $\tau$ is a temperature parameter.

Total Loss:

$\mathcal{L}_{\text{VJEPA2}} = \mathcal{L}_{\text{pred}} + \lambda_{NCE} \mathcal{L}_{NCE}$

Through minimization over millions of Internet-scale video clips, VJEPA-2 acquires an embedding that encodes object permanence, continuity, and predictable scene dynamics.

3. Physics-Plausibility Inductive Biases

VJEPA-2’s self-supervised structure yields several crucial inductive biases:

Latent Future Prediction: Predicting spatio-temporal latent tokens from partial context compels the model to ignore unpredictable, decorrelated details (e.g., lighting flicker, non-persistent textures) and focus on consistent, causally driven structure (rigid-body motion, conservation, fluid flows).
Mask-Based Spatio-Temporal Modeling: The masking strategy encourages learning of object permanence and tracking, promoting continuity across space and time.
EMA Target Stabilization: The teacher-student arrangement prevents trivial copying and instead encourages the extraction of robust, predictive signals.

Collectively, these biases align the learned latent spaces with Newtonian regularities, despite the absence of explicit physics labels during training.

4. Inference-Time Physics Reward and Alignment

VJEPA-2 is converted at inference into a differentiable reward function, termed WMReward, quantifying “surprise” when a generated video deviates from the world-model predictions.

Sliding-Window Evaluation: For a generated video $x$ $x$ of length $L$ $L$ , a window of size $C+M$ $C + M$ slides over each position $k \in K$ $k \in K$ :
- Prediction: Predicted future latents from context: $\hat{z}_k = P_\phi(\Delta_m, E_\theta(x^{k-C+1:k}))$
- Actual Latents: Computed on the full window: $z_k = E_\theta(x^{k-C+1:k+M})$
- Surprise Score: For the $M$ masked future positions, $r(x)$ is the mean cosine surprise:
$r(x) = \frac{1}{|K|} \sum_{k \in K} [1 - \cos(\hat{z}_k^{\rm fut}, z_k^{\rm fut})]$ - Higher $r(x)$ values indicate samples more consistent with VJEPA-2’s learned world model, implying higher physics plausibility.
Integration with Generative Models:
- Black-box Tilting: Generation distribution $p^*(x) \propto w(x) p(x)$ , where $w(x)$ increases with $r(x)$ .
- Gradient-Based Guidance: $w(x) = \exp(\lambda r(x))$ yields a guidance step:
$\nabla_{x_t} \log p_t^*(x_t) \approx \nabla_{x_t} \log p_t(x_t) + \lambda \nabla_{x_t} r\left(x_{0|t}(x_t)\right)$

Guidance is typically applied at select denoising steps for computational efficiency. - Best-of-N Sampling: Draw $N$ candidate generations, score with $r(x)$ , and select the best. Used both stand-alone or with guidance.

5. Integration and Evaluation in Video Generation Pipelines

VJEPA-2 reward has been applied to video diffusion architectures (e.g., MAGI-1) as an inference-time alignment tool, without retraining the generative model.

Pipeline Example (MAGI-1 + VJEPA-2): At each generation chunk, candidate samples are generated; VJEPA-2 evaluates surprise; samples are selected based on the highest reward, or used to guide diffusion with the negative gradient of the surprise.
Reward in RL and Diffusion Guidance: While end-to-end RL fine-tuning is possible (via reward-augmented losses), principal gains are observed even without retraining by leveraging VJEPA-2 as an inference-time plug-in reward or guidance signal.
Empirical Results on PhysicsIQ:
- V2V generation baseline (MAGI-1): 56.31
- MAGI-1 + VJEPA-2: 62.64 ( $\Delta=+6.33$ )
- I2V generation baseline: 30.23
- MAGI-1 + VJEPA-2: 36.86 ( $\Delta=+6.63$ )

These results were achieved on the ICCV 2025 Perception Test PhysicsIQ Challenge, with the VJEPA-2-based system exceeding previous state of the art by 7.42% (Yuan et al., 15 Jan 2026, Yuan et al., 22 Oct 2025).

Comparative Table: VJEPA-2 Integration Results on PhysicsIQ

Model	V2V Score	I2V Score	Δ (V2V)	Δ (I2V)
MAGI-1 baseline	56.31	30.23	—	—
MAGI-1 + VJEPA-2	62.64	36.86	+6.33	+6.63

6. Ablation Studies and Robustness Analysis

Extensive ablations confirm the robustness and scaling properties of VJEPA-2:

Sliding-Window Hyperparameters: Performance is stable across a range of window sizes $(C+M)$ , context lengths ( $C$ ), strides, and frame rates.
Backbone Scaling: Increasing model size (e.g., ViT-huge to ViT-giant) reliably improves PhysicsIQ scores under WMReward(BoN) evaluation.
- Example: For MAGI-1 V2V, $C=16, M=16$ , stride=16, 24 FPS
- ViT-huge: PhysicsIQ ≈ 57.1%
- ViT-giant: PhysicsIQ ≈ 60.8%
Sampling and Guidance Modes:
- Guidance only: +4.8 improvement (V2V)
- BoN only: +3.5
- Combined: +6.33, indicating strong synergy.

This suggests that inference-time use of stronger VJEPA-2 backbones produces more reliable rewards and physically plausible generations, independent of precise tuning parameters.

7. Significance and Broader Implications

VJEPA-2 demonstrates that self-supervised, masked-prediction latent world models can encode substantial intuitive physics priors, suitable for generalization across image-, video-, and text-conditioned video generation. Used as a differentiable “surprise-based” reward, VJEPA-2 facilitates inference-time alignment of black-box generative models, bridging the gap between visual fidelity and genuine physical plausibility in synthesized video.

These developments underscore the broader potential of latent world models as plug-in, domain-agnostic physicality priors for generative modeling. The approach obviates the need for explicit physics labels or model retraining, offering a lightweight, flexible tool for improving outcome plausibility across diverse temporal prediction problems (Yuan et al., 15 Jan 2026, Yuan et al., 22 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Inference-time Physics Alignment of Video Generative Models with Latent World Models (2026)

Improving the Physics of Video Generation with VJEPA-2 Reward Signal (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VJEPA-2.

VJEPA-2: Latent World Model

1. Architectural Foundations

2. Self-Supervised Training Objectives

3. Physics-Plausibility Inductive Biases

4. Inference-Time Physics Reward and Alignment

5. Integration and Evaluation in Video Generation Pipelines

Comparative Table: VJEPA-2 Integration Results on PhysicsIQ

6. Ablation Studies and Robustness Analysis

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VJEPA-2: Latent World Model

1. Architectural Foundations

2. Self-Supervised Training Objectives

3. Physics-Plausibility Inductive Biases

4. Inference-Time Physics Reward and Alignment

5. Integration and Evaluation in Video Generation Pipelines

Comparative Table: VJEPA-2 Integration Results on PhysicsIQ

6. Ablation Studies and Robustness Analysis

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research