V-JEPA 2: Physics-Guided Video Embedding
- The paper introduces V-JEPA 2's main contribution: a self-supervised architecture that predicts dense, physics-based tokens to enhance video realism.
- It employs a dual loss combining diffusion and physics regression to jointly optimize generative accuracy and latent physical representation.
- The model integrates cross-attention to infuse high-dimensional physical cues into diffusion backbones, bridging visual perception with dynamic motion modeling.
A Video Joint Embedding Predictive Architecture, or V-JEPA 2, is a large-scale, self-supervised video representation model developed to provide a dense and predictive encoding of video dynamics suitable for downstream tasks such as generative video modeling, physical reasoning, or action anticipation. The architecture is positioned to address the current challenges in diffusion-based video generation, where capturing implicit physical realism and true temporality remains unsolved by classical transformer-based models. V-JEPA 2, as integrated in physics-aware generative frameworks, provides a learned, patch-level physical representation that bridges the gap between high-dimensional perceptual space and latent physical cues such as object motion, interaction likelihoods, and dynamic constraints.
1. Definition and Conceptual Overview
V-JEPA 2 is a large video joint embedding predictive architecture, trained self-supervised to encode spatiotemporal video patches into a continuous, predictive embedding. The model is designed to distill predictive representations of physical dynamics from video input, providing high-level features such as object velocities, collision and interaction probabilities, and environmental cues directly from raw visual streams. In the context of advanced video generation, V-JEPA 2 serves as an auxiliary teacher network or embedding source, allowing generative architectures — particularly diffusion-based DiT models — to access dense, learned physics priors (Satish et al., 7 Jan 2026).
This representation is not hand-crafted for specific physical signals, but is trained to model the evolution of video content across space and time, such that the learned embeddings naturally encode physical structures. The architecture supports patch-wise encodings of the video, and the output tokens are of high dimensionality (e.g., 2048 spatiotemporal patches × 1408 channels), readily adaptable to cross-attention modules in diffusion backbones.
2. Architectural Components and Embedding Mechanism
The canonical pipeline utilizing V-JEPA 2 begins with standard video encoding: real video sequences are mapped via a pretrained VAE to a compressed latent representation. V-JEPA 2, operating on these video clips, computes patch-level physics tokens, (with , in referenced configurations). The tokens are implicitly predictive — encouraging the network to represent information necessary for forecasting unseen regions of the video (joint embedding and context).
A dedicated predictor network — for instance, PredictorP — is trained to regress these V-JEPA 2 tokens directly from intermediate noisy latents (e.g., those occurring in the iterative denoising process of diffusion models). This predictor typically consists of stacked 3D convolutional layers for preliminary spatiotemporal feature extraction, followed by transformer blocks that utilize injected textual and timestep embeddings for conditioning (Satish et al., 7 Jan 2026).
The predicted physics tokens are inserted into the temporal transformer blocks of the generative DiT backbone via specialized cross-attention layers. This injection occurs after temporal self-attention, allowing the generative model to modulate its temporal predictions using the high-level physics encodings. A learned gating scalar controls the magnitude of this guidance.
3. Training Objective and Multi-Task Optimization
The system is trained with a dual loss:
- Diffusion Loss: Standard denoising diffusion noise prediction objective, quantifying the model’s ability to reconstruct the noise added to the latent at timestep :
- Physics Regression Loss: Mean-squared error between the predictor’s output and the ground-truth V-JEPA 2 physics tokens :
The total optimization blends these objectives with a balancing coefficient : This structure enables the joint learning of generative capacity and physical representation recovery (Satish et al., 7 Jan 2026).
4. Cross-Attention Injection and Physics Guidance
Physics tokens modulate the video generator by cross-attention at the temporal level. For each spatial patch sequence in time, a scaled dot-product attention operation is performed, with predicted V-JEPA 2 tokens as keys and values: where is derived from the current hidden state for a patch, and , are linear projections of the predicted physics tokens. A residual connection and gating scalar control the strength of update: This mechanism enables temporally coherent motion that respects the high-level physics encoded by V-JEPA 2, and allows the diffusion process to leverage rich, predictive cues beyond simple appearance modeling.
5. Model Validation, Empirical Observations, and Limitations
Empirical validation on the OpenVid-1M subset demonstrated stable convergence of both the diffusion and physics regression losses over standard training epochs. Key qualitative findings include:
- Predicted physics tokens visually correlate with video motion patterns.
- The joint model exhibits no significant instability or loss oscillations.
- Early analysis suggests the ability to recover nontrivial physics representations directly from corrupted diffusion latents.
The architecture faces computational challenges due to the large number of high-dimensional tokens (e.g., 2048 × 1408), resulting in increased memory consumption and the need for gradient checkpointing and mixed precision training. Future directions involve compressing the physics tokens (e.g., bottleneck/pooling to smaller sizes), ablation of cross-attention and predictor modules to quantify physical impact, and extending support to larger or alternative generative backbones (Satish et al., 7 Jan 2026).
6. Context: Role in Physics-Guided Video Generation
V-JEPA 2 constitutes a general paradigm for embedding learned physical representations within generative video models:
- By regressing V-JEPA 2 features from denoising latents and enabling direct guidance via physico-temporal attention, diffusion models are endowed with a form of latent "world-modeling" capacity.
- This approach is data- and modality-agnostic — it does not require explicit simulators or hand-crafted labels for training physical tasks, and is therefore universally compatible with large-scale self-supervised video datasets.
- Explicit evaluation against physics benchmarks (e.g., VideoPhy-2) is planned; a plausible implication is that this methodology enables significant improvements in semantic adherence and physical correctness without retraining the generative core.
7. Technical Summary and Pseudocode
The overall pipeline is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
z0 = VAE.encode(video_clip) t = sample_random_timestep() eps= sample_noise() zt = sqrt(alpha[t])*z0 + sqrt(1-alpha[t])*eps h_vis = Conv3DEnc(zt) h_fused = TransformerEncoder([h_vis, project(text), t_emb]) hat_p = TransformerDecoder(Q_phys, h_fused) # V-JEPA 2 regression eps_hat = DiT_with_phys(zt, t, text, hat_p) L_diff = ||eps - eps_hat||^2 L_phy = ||hat_p - p_star||^2 # p_star: true V-JEPA 2 tokens L_total = L_diff + lambda_phy * L_phy L_total.backward() optimizer.step() |
V-JEPA 2–guided video generation represents a principled approach for injecting dense, predictive physics representations into modern T2V diffusion models, enabling temporally consistent and physically plausible video synthesis across a wide array of scenarios and datasets.