Papers
Topics
Authors
Recent
Search
2000 character limit reached

VJEPA-2: Latent World Model

Updated 15 February 2026
  • VJEPA-2 is a self-supervised latent world model based on the JEPA paradigm that predicts future video representations in latent space.
  • It employs a masked spatio-temporal prediction strategy with an EMA target encoder to stabilize training and capture intuitive physics priors.
  • When used as an inference-time reward, VJEPA-2 significantly boosts video generation performance, improving physical plausibility on benchmarks like PhysicsIQ.

VJEPA-2 is a self-supervised latent world model built on the Joint-Embedding-Predictive-Architecture (JEPA) paradigm, designed to encode and predict the future latent representations of video sequences by leveraging large-scale unlabeled data. Unlike pixel-reconstruction approaches, VJEPA-2 operates entirely in latent space, with a training regime emphasizing masked prediction of spatio-temporal patches. Its design and utilization as a physics prior have enabled substantial advances in the physical plausibility of generated videos, notably improving the performance of state-of-the-art video generative models on benchmarks such as PhysicsIQ (Yuan et al., 15 Jan 2026, &&&1&&&).

1. Architectural Foundations

VJEPA-2 comprises several interlocking modules:

  • Context Encoder (EθE_\theta): A Vision Transformer (ViT)-style spatio-temporal transformer encodes an input video xx of length TT into a sequence of latent tokens zRT(H/P)(W/P)×Dz \in \mathbb{R}^{T \cdot (H/P) \cdot (W/P) \times D}, where H,WH, W denote frame dimensions, PP is patch size, and DD is embedding dimensionality. Only a random subset of spatial and temporal patches, termed the "context," is revealed during training.
  • Mask Tokens (Δm\Delta_m): Learnable embeddings are placed at masked positions (both spatial and temporal) to indicate missing content and allow the predictor to attend to both observed context tokens and mask positions.
  • Predictor Network (PϕP_\phi): This module, implemented as either a transformer or multi-layer perceptron, predicts the withheld masked latent representations from the combined sequence [Eθ(xmasked);Δm][E_\theta(x_\text{masked}); \Delta_m].
  • EMA Target Encoder (Eˉθ\bar{E}_\theta): A slow-moving exponential average of EθE_\theta is used as the "teacher" network, processing unmasked input to generate target latents. This strategy—mirroring the BYOL/MoCo family—stabilizes training and prevents collapse to trivial copying.

Notably, VJEPA-2 omits any explicit pixel-space decoder. The predictor reconstructs in the latent domain exclusively, prioritizing structured predictions over reconstructing raw appearance.

2. Self-Supervised Training Objectives

Training is based on masked prediction using two central objectives:

  • Masked Denoising Regression Loss: For each training step, MM contiguous spatio-temporal patches are masked, leaving CC patches visible.

Lpred=Pϕ(Δm,Eθ(xmasked))sg(Eˉθ(x))1L_\text{pred} = \|P_\phi(\Delta_m, E_\theta(x_\text{masked})) - \text{sg}\left(\bar E_\theta(x)\right)\|_1

with the “sg(·)” denoting stop-gradient. Only PϕP_\phi and EθE_\theta are updated. The network is thereby constrained to recover future (masked) spatio-temporal dependencies solely from the visible partial context.

  • Contrastive Temporal Loss (optional): To further sharpen the embedding space and prevent collapse, a contrastive InfoNCE term is sometimes employed:

LNCE=E[logexp(sim(z^t,i,zt,i)/τ)(t,j)exp(sim(z^t,i,zt,j)/τ)]\mathcal{L}_{NCE} = -\mathbb{E}\Bigg[\log\frac{\exp\left(\text{sim}(\hat z_{t,i}, z_{t,i}) / \tau\right)}{\sum_{(t',j)} \exp\left(\text{sim}(\hat z_{t,i}, z_{t',j}) / \tau\right)}\Bigg]

where sim(a,b)=ab/(ab)\text{sim}(a, b) = a^\top b / (\|a\| \|b\|) and τ\tau is a temperature parameter.

  • Total Loss:

LVJEPA2=Lpred+λNCELNCE\mathcal{L}_{\text{VJEPA2}} = \mathcal{L}_{\text{pred}} + \lambda_{NCE} \mathcal{L}_{NCE}

Through minimization over millions of Internet-scale video clips, VJEPA-2 acquires an embedding that encodes object permanence, continuity, and predictable scene dynamics.

3. Physics-Plausibility Inductive Biases

VJEPA-2’s self-supervised structure yields several crucial inductive biases:

  • Latent Future Prediction: Predicting spatio-temporal latent tokens from partial context compels the model to ignore unpredictable, decorrelated details (e.g., lighting flicker, non-persistent textures) and focus on consistent, causally driven structure (rigid-body motion, conservation, fluid flows).
  • Mask-Based Spatio-Temporal Modeling: The masking strategy encourages learning of object permanence and tracking, promoting continuity across space and time.
  • EMA Target Stabilization: The teacher-student arrangement prevents trivial copying and instead encourages the extraction of robust, predictive signals.

Collectively, these biases align the learned latent spaces with Newtonian regularities, despite the absence of explicit physics labels during training.

4. Inference-Time Physics Reward and Alignment

VJEPA-2 is converted at inference into a differentiable reward function, termed WMReward, quantifying “surprise” when a generated video deviates from the world-model predictions.

  • Sliding-Window Evaluation: For a generated video xx of length LL, a window of size C+MC+M slides over each position kKk \in K:
    • Prediction: Predicted future latents from context: z^k=Pϕ(Δm,Eθ(xkC+1:k))\hat{z}_k = P_\phi(\Delta_m, E_\theta(x^{k-C+1:k}))
    • Actual Latents: Computed on the full window: zk=Eθ(xkC+1:k+M)z_k = E_\theta(x^{k-C+1:k+M})
    • Surprise Score: For the MM masked future positions, r(x)r(x) is the mean cosine surprise:

    r(x)=1KkK[1cos(z^kfut,zkfut)]r(x) = \frac{1}{|K|} \sum_{k \in K} [1 - \cos(\hat{z}_k^{\rm fut}, z_k^{\rm fut})] - Higher r(x)r(x) values indicate samples more consistent with VJEPA-2’s learned world model, implying higher physics plausibility.

  • Integration with Generative Models:

    • Black-box Tilting: Generation distribution p(x)w(x)p(x)p^*(x) \propto w(x) p(x), where w(x)w(x) increases with r(x)r(x).
    • Gradient-Based Guidance: w(x)=exp(λr(x))w(x) = \exp(\lambda r(x)) yields a guidance step:

    xtlogpt(xt)xtlogpt(xt)+λxtr(x0t(xt))\nabla_{x_t} \log p_t^*(x_t) \approx \nabla_{x_t} \log p_t(x_t) + \lambda \nabla_{x_t} r\left(x_{0|t}(x_t)\right)

    Guidance is typically applied at select denoising steps for computational efficiency. - Best-of-N Sampling: Draw NN candidate generations, score with r(x)r(x), and select the best. Used both stand-alone or with guidance.

5. Integration and Evaluation in Video Generation Pipelines

VJEPA-2 reward has been applied to video diffusion architectures (e.g., MAGI-1) as an inference-time alignment tool, without retraining the generative model.

  • Pipeline Example (MAGI-1 + VJEPA-2): At each generation chunk, candidate samples are generated; VJEPA-2 evaluates surprise; samples are selected based on the highest reward, or used to guide diffusion with the negative gradient of the surprise.

  • Reward in RL and Diffusion Guidance: While end-to-end RL fine-tuning is possible (via reward-augmented losses), principal gains are observed even without retraining by leveraging VJEPA-2 as an inference-time plug-in reward or guidance signal.

  • Empirical Results on PhysicsIQ:

    • V2V generation baseline (MAGI-1): 56.31
    • MAGI-1 + VJEPA-2: 62.64 (Δ=+6.33\Delta=+6.33)
    • I2V generation baseline: 30.23
    • MAGI-1 + VJEPA-2: 36.86 (Δ=+6.63\Delta=+6.63)

These results were achieved on the ICCV 2025 Perception Test PhysicsIQ Challenge, with the VJEPA-2-based system exceeding previous state of the art by 7.42% (Yuan et al., 15 Jan 2026, Yuan et al., 22 Oct 2025).

Comparative Table: VJEPA-2 Integration Results on PhysicsIQ

Model V2V Score I2V Score Δ (V2V) Δ (I2V)
MAGI-1 baseline 56.31 30.23
MAGI-1 + VJEPA-2 62.64 36.86 +6.33 +6.63

6. Ablation Studies and Robustness Analysis

Extensive ablations confirm the robustness and scaling properties of VJEPA-2:

  • Sliding-Window Hyperparameters: Performance is stable across a range of window sizes (C+M)(C+M), context lengths (CC), strides, and frame rates.
  • Backbone Scaling: Increasing model size (e.g., ViT-huge to ViT-giant) reliably improves PhysicsIQ scores under WMReward(BoN) evaluation.
    • Example: For MAGI-1 V2V, C=16,M=16C=16, M=16, stride=16, 24 FPS
    • ViT-huge: PhysicsIQ ≈ 57.1%
    • ViT-giant: PhysicsIQ ≈ 60.8%
  • Sampling and Guidance Modes:
    • Guidance only: +4.8 improvement (V2V)
    • BoN only: +3.5
    • Combined: +6.33, indicating strong synergy.

This suggests that inference-time use of stronger VJEPA-2 backbones produces more reliable rewards and physically plausible generations, independent of precise tuning parameters.

7. Significance and Broader Implications

VJEPA-2 demonstrates that self-supervised, masked-prediction latent world models can encode substantial intuitive physics priors, suitable for generalization across image-, video-, and text-conditioned video generation. Used as a differentiable “surprise-based” reward, VJEPA-2 facilitates inference-time alignment of black-box generative models, bridging the gap between visual fidelity and genuine physical plausibility in synthesized video.

These developments underscore the broader potential of latent world models as plug-in, domain-agnostic physicality priors for generative modeling. The approach obviates the need for explicit physics labels or model retraining, offering a lightweight, flexible tool for improving outcome plausibility across diverse temporal prediction problems (Yuan et al., 15 Jan 2026, Yuan et al., 22 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VJEPA-2.