PhysVideoGenerator & V-JEPA 2 Integration
- PhysVideoGenerator is a physics-aware generative framework that integrates V-JEPA 2 to extract and inject high-dimensional physics feature tokens from video data.
- It leverages a dedicated PredictorP network and cross-attention layers to map diffusion latents to physics token predictions, ensuring realistic temporal dynamics.
- Multi-task optimization combines denoising and physics regression losses, stabilizing training and producing physically plausible synthetic video outputs.
A Video Joint Embedding Predictive Architecture (V-JEPA 2) is a specialized high-dimensional joint embedding model, designed for predictive, self-supervised representation learning over temporal video signals. V-JEPA 2 is deployed as a pretrained physics feature extractor in PhysVideoGenerator frameworks to enable explicit physics guidance for generative models, particularly in the context of physics-aware video synthesis (Satish et al., 7 Jan 2026). The architecture and integration details summarized below follow the specifications and empirical findings in the PhysVideoGenerator technical report.
1. V-JEPA 2: Definition, Representational Scope, and Feature Topology
V-JEPA 2 is a temporal joint embedding architecture trained to produce high-level physics-centric feature representations from raw video clips. Its core output is a set of patchwise joint tokens: for each video clip of shape (channel × frames × spatial), V-JEPA 2 emits a tensor , where each "token" aggregates motion features (object velocity, collision likelihood, gravity direction) implicitly over a spatiotemporal patch. The architecture itself is originally introduced and trained by Assran et al. (Assran et al., 11 Jun 2025); PhysVideoGenerator utilizes it as a fixed, non-trainable feature source.
These tokens encode latent physics descriptors and do not split into hand-engineered primitives. Instead, they are leveraged in downstream generative models for tasks requiring consistent, realistic physical dynamics across temporal video window.
2. Integration into Physics-Guided Generative Video Models
PhysVideoGenerator instantiates V-JEPA 2 as a physics-onboarding prior injected into a DiT-based generator (specifically, Latte-1). The workflow consists of:
- For each training video clip , compute once and cache.
- During training, at each diffusion step, corrupt video latents to produce .
- A dedicated "PredictorP" subnetwork regresses predicted physics tokens from (plus text and timestep embeddings).
- is then injected into the temporal Transformer attention via a learnable cross-attention sublayer.
This mechanism confers the ability to recover physically salient intermediate states from noisy diffusion latents and directly conditions the video generator on V-JEPA 2 space, providing fine-grained inductive bias for physically plausible synthesis.
3. PredictorP: Physics Token Regression Network
The PredictorP module is a lightweight regression branch tasked with mapping corrupted diffusion latents (shape ) and projected text/timestep embeddings to predicted physics tokens . The pipeline:
- Input: is encoded via a 3D ConvNet yielding , then flattened.
- Fused features are passed through a 4-layer Transformer encoder alongside the text and time.
- A set of learnable queries attends via Transformer decoder to the fused context, outputting .
- Final output is computed with a linear head to match the target dimension.
The training objective is , directly supervising the physics token recovery against V-JEPA 2 ground-truth features.
4. Physics Cross-Attention Sublayer Injection
Physical tokens are delivered to the temporal Transformer stack of the DiT video denoising model via a specially designed cross-attention block:
- For each temporal patch representation (frames × feature dim), compute attention scores against via learned projections.
- Attention output is added to with a residual connection modulated by a gating scalar .
- This mechanism ensures that temporal modeling is explicitly influenced by patchwise physics features, directly steered by V-JEPA 2 signals as estimated from .
5. Multi-Task Optimization: Diffusion–Physics Joint Training
The overall training paradigm involves joint minimization of DDPM noise-prediction loss and physics regression loss:
where is standard denoising MSE, and is the physics token regression loss. is set to $0.1$ for stable optimization. All diffusion and physics regression gradients are propagated interleaved over training epochs, with V-JEPA 2 features fixed.
6. Technical Challenges Encountered
Challenges reported include:
- Memory pressure from high-dimensional physics tokens (2048 × 1408), requiring gradient checkpointing and BF16 precision.
- Balancing gradients between physics objectives and the dominant diffusion generator task.
- GPU overhead for 3D-convolutions and transformer stacks in PredictorP.
Planned mitigations: compressed token bottlenecks, alternative backbones, and selective ablations to evaluate the physics cross-attention's efficacy.
7. Empirical Validation and Future Directions
Empirical benchmarks establish that:
- The physics regression loss converges rapidly in initial epochs, indicating that diffusion latents hold sufficient information for physics recovery.
- Training remains stable without oscillatory loss patterns.
- Qualitative diagnostics confirm a visibly stronger association between motion patterns and predicted physics tokens in generative outputs.
Future evaluation will target standardized physics correctness scores (VideoPhy-2), optical flow consistency, and T-LPIPS. Directions include integrating pooled/quantized physics tokens, supporting large DiT variants, and performing detailed ablation studies on the necessity and timing of physics cross-attention, along with investigating classifier-free scaling of at inference.
In sum, V-JEPA 2 serves as a fixed physics feature oracle for self-supervised video generation, whose joint embedding tokens are recoverable from diffusive latents and inject real-world physics priors into the temporal dynamics of generative models via cross-attention and multi-task regression. This architecture enables physically grounded video synthesis without requiring hand-crafted priors, physics simulators, or retraining of backbone models, tuning the generative process toward consistency with latent dynamics captured directly from large-scale video corpus (Satish et al., 7 Jan 2026).