Papers
Topics
Authors
Recent
Search
2000 character limit reached

PhysVideoGenerator & V-JEPA 2 Integration

Updated 14 January 2026
  • PhysVideoGenerator is a physics-aware generative framework that integrates V-JEPA 2 to extract and inject high-dimensional physics feature tokens from video data.
  • It leverages a dedicated PredictorP network and cross-attention layers to map diffusion latents to physics token predictions, ensuring realistic temporal dynamics.
  • Multi-task optimization combines denoising and physics regression losses, stabilizing training and producing physically plausible synthetic video outputs.

A Video Joint Embedding Predictive Architecture (V-JEPA 2) is a specialized high-dimensional joint embedding model, designed for predictive, self-supervised representation learning over temporal video signals. V-JEPA 2 is deployed as a pretrained physics feature extractor in PhysVideoGenerator frameworks to enable explicit physics guidance for generative models, particularly in the context of physics-aware video synthesis (Satish et al., 7 Jan 2026). The architecture and integration details summarized below follow the specifications and empirical findings in the PhysVideoGenerator technical report.

1. V-JEPA 2: Definition, Representational Scope, and Feature Topology

V-JEPA 2 is a temporal joint embedding architecture trained to produce high-level physics-centric feature representations from raw video clips. Its core output is a set of patchwise joint tokens: for each video clip xx of shape 3×16×256×2563 \times 16 \times 256 \times 256 (channel × frames × spatial), V-JEPA 2 emits a tensor p∗∈R2048×1408p^* \in \mathbb{R}^{2048\times1408}, where each "token" aggregates motion features (object velocity, collision likelihood, gravity direction) implicitly over a spatiotemporal patch. The architecture itself is originally introduced and trained by Assran et al. (Assran et al., 11 Jun 2025); PhysVideoGenerator utilizes it as a fixed, non-trainable feature source.

These tokens encode latent physics descriptors and do not split into hand-engineered primitives. Instead, they are leveraged in downstream generative models for tasks requiring consistent, realistic physical dynamics across temporal video window.

2. Integration into Physics-Guided Generative Video Models

PhysVideoGenerator instantiates V-JEPA 2 as a physics-onboarding prior injected into a DiT-based generator (specifically, Latte-1). The workflow consists of:

  • For each training video clip xx, compute p∗=V-JEPA2(x)p^* = \text{V-JEPA2}(x) once and cache.
  • During training, at each diffusion step, corrupt video latents to produce ztz_t.
  • A dedicated "PredictorP" subnetwork regresses predicted physics tokens p^\hat p from ztz_t (plus text and timestep embeddings).
  • p^\hat p is then injected into the temporal Transformer attention via a learnable cross-attention sublayer.

This mechanism confers the ability to recover physically salient intermediate states from noisy diffusion latents and directly conditions the video generator on V-JEPA 2 space, providing fine-grained inductive bias for physically plausible synthesis.

3. PredictorP: Physics Token Regression Network

The PredictorP module is a lightweight regression branch tasked with mapping corrupted diffusion latents ztz_t (shape 4×16×32×324 \times 16 \times 32 \times 32) and projected text/timestep embeddings to predicted physics tokens p^∈R2048×1408\hat p \in \mathbb{R}^{2048\times1408}. The pipeline:

  1. Input: ztz_t is encoded via a 3D ConvNet yielding hvis∈R512×8×16×16h_{\text{vis}}\in\mathbb{R}^{512\times8\times16\times16}, then flattened.
  2. Fused features are passed through a 4-layer Transformer encoder alongside the text and time.
  3. A set of learnable queries QphysQ_{\text{phys}} attends via Transformer decoder to the fused context, outputting p~∈R2048×512\tilde p\in\mathbb{R}^{2048\times512}.
  4. Final output p^\hat p is computed with a linear head to match the target dimension.

The training objective is Lphy=∥p^−p∗∥22\mathcal L_{\text{phy}} = \|\hat p - p^*\|_2^2, directly supervising the physics token recovery against V-JEPA 2 ground-truth features.

4. Physics Cross-Attention Sublayer Injection

Physical tokens p^\hat p are delivered to the temporal Transformer stack of the DiT video denoising model via a specially designed cross-attention block:

  • For each temporal patch representation xtemp∈RF×dx_{\text{temp}} \in \mathbb{R}^{F\times d} (frames × feature dim), compute attention scores against p^\hat p via learned projections.
  • Attention output is added to xtempx_{\text{temp}} with a residual connection modulated by a gating scalar αphy\alpha_{\text{phy}}.
  • This mechanism ensures that temporal modeling is explicitly influenced by patchwise physics features, directly steered by V-JEPA 2 signals as estimated from ztz_t.

5. Multi-Task Optimization: Diffusion–Physics Joint Training

The overall training paradigm involves joint minimization of DDPM noise-prediction loss and physics regression loss:

Ltotal=Ldiff+λphyLphy\mathcal L_{\text{total}} = \mathcal L_{\text{diff}} + \lambda_{\text{phy}} \mathcal L_{\text{phy}}

where Ldiff\mathcal L_{\text{diff}} is standard denoising MSE, and Lphy\mathcal L_{\text{phy}} is the physics token regression loss. λphy\lambda_{\text{phy}} is set to $0.1$ for stable optimization. All diffusion and physics regression gradients are propagated interleaved over training epochs, with V-JEPA 2 features fixed.

6. Technical Challenges Encountered

Challenges reported include:

  • Memory pressure from high-dimensional physics tokens (2048 × 1408), requiring gradient checkpointing and BF16 precision.
  • Balancing gradients between physics objectives and the dominant diffusion generator task.
  • GPU overhead for 3D-convolutions and transformer stacks in PredictorP.

Planned mitigations: compressed token bottlenecks, alternative backbones, and selective ablations to evaluate the physics cross-attention's efficacy.

7. Empirical Validation and Future Directions

Empirical benchmarks establish that:

  • The physics regression loss Lphy\mathcal L_{\text{phy}} converges rapidly in initial epochs, indicating that diffusion latents hold sufficient information for physics recovery.
  • Training remains stable without oscillatory loss patterns.
  • Qualitative diagnostics confirm a visibly stronger association between motion patterns and predicted physics tokens in generative outputs.

Future evaluation will target standardized physics correctness scores (VideoPhy-2), optical flow consistency, and T-LPIPS. Directions include integrating pooled/quantized physics tokens, supporting large DiT variants, and performing detailed ablation studies on the necessity and timing of physics cross-attention, along with investigating classifier-free scaling of αphy\alpha_{\text{phy}} at inference.


In sum, V-JEPA 2 serves as a fixed physics feature oracle for self-supervised video generation, whose joint embedding tokens are recoverable from diffusive latents and inject real-world physics priors into the temporal dynamics of generative models via cross-attention and multi-task regression. This architecture enables physically grounded video synthesis without requiring hand-crafted priors, physics simulators, or retraining of backbone models, tuning the generative process toward consistency with latent dynamics captured directly from large-scale video corpus (Satish et al., 7 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PhysVideoGenerator.