PhysVideoGenerator & V-JEPA 2 Integration

Updated 14 January 2026

PhysVideoGenerator is a physics-aware generative framework that integrates V-JEPA 2 to extract and inject high-dimensional physics feature tokens from video data.
It leverages a dedicated PredictorP network and cross-attention layers to map diffusion latents to physics token predictions, ensuring realistic temporal dynamics.
Multi-task optimization combines denoising and physics regression losses, stabilizing training and producing physically plausible synthetic video outputs.

A Video Joint Embedding Predictive Architecture (V-JEPA 2) is a specialized high-dimensional joint embedding model, designed for predictive, self-supervised representation learning over temporal video signals. V-JEPA 2 is deployed as a pretrained physics feature extractor in PhysVideoGenerator frameworks to enable explicit physics guidance for generative models, particularly in the context of physics-aware video synthesis (Satish et al., 7 Jan 2026). The architecture and integration details summarized below follow the specifications and empirical findings in the PhysVideoGenerator technical report.

1. V-JEPA 2: Definition, Representational Scope, and Feature Topology

V-JEPA 2 is a temporal joint embedding architecture trained to produce high-level physics-centric feature representations from raw video clips. Its core output is a set of patchwise joint tokens: for each video clip $x$ of shape $3 \times 16 \times 256 \times 256$ (channel × frames × spatial), V-JEPA 2 emits a tensor $p^* \in \mathbb{R}^{2048\times1408}$ , where each "token" aggregates motion features (object velocity, collision likelihood, gravity direction) implicitly over a spatiotemporal patch. The architecture itself is originally introduced and trained by Assran et al. (Assran et al., 11 Jun 2025); PhysVideoGenerator utilizes it as a fixed, non-trainable feature source.

These tokens encode latent physics descriptors and do not split into hand-engineered primitives. Instead, they are leveraged in downstream generative models for tasks requiring consistent, realistic physical dynamics across temporal video window.

2. Integration into Physics-Guided Generative Video Models

PhysVideoGenerator instantiates V-JEPA 2 as a physics-onboarding prior injected into a DiT-based generator (specifically, Latte-1). The workflow consists of:

For each training video clip $x$ , compute $p^* = \text{V-JEPA2}(x)$ once and cache.
During training, at each diffusion step, corrupt video latents to produce $z_t$ .
A dedicated "PredictorP" subnetwork regresses predicted physics tokens $\hat p$ from $z_t$ (plus text and timestep embeddings).
$\hat p$ is then injected into the temporal Transformer attention via a learnable cross-attention sublayer.

This mechanism confers the ability to recover physically salient intermediate states from noisy diffusion latents and directly conditions the video generator on V-JEPA 2 space, providing fine-grained inductive bias for physically plausible synthesis.

3. PredictorP: Physics Token Regression Network

The PredictorP module is a lightweight regression branch tasked with mapping corrupted diffusion latents $z_t$ (shape $4 \times 16 \times 32 \times 32$ ) and projected text/timestep embeddings to predicted physics tokens $\hat p \in \mathbb{R}^{2048\times1408}$ . The pipeline:

Input: $z_t$ is encoded via a 3D ConvNet yielding $h_{\text{vis}}\in\mathbb{R}^{512\times8\times16\times16}$ , then flattened.
Fused features are passed through a 4-layer Transformer encoder alongside the text and time.
A set of learnable queries $Q_{\text{phys}}$ attends via Transformer decoder to the fused context, outputting $\tilde p\in\mathbb{R}^{2048\times512}$ .
Final output $\hat p$ is computed with a linear head to match the target dimension.

The training objective is $\mathcal L_{\text{phy}} = \|\hat p - p^*\|_2^2$ , directly supervising the physics token recovery against V-JEPA 2 ground-truth features.

4. Physics Cross-Attention Sublayer Injection

Physical tokens $\hat p$ are delivered to the temporal Transformer stack of the DiT video denoising model via a specially designed cross-attention block:

For each temporal patch representation $x_{\text{temp}} \in \mathbb{R}^{F\times d}$ (frames × feature dim), compute attention scores against $\hat p$ via learned projections.
Attention output is added to $x_{\text{temp}}$ with a residual connection modulated by a gating scalar $\alpha_{\text{phy}}$ .
This mechanism ensures that temporal modeling is explicitly influenced by patchwise physics features, directly steered by V-JEPA 2 signals as estimated from $z_t$ .

5. Multi-Task Optimization: Diffusion–Physics Joint Training

The overall training paradigm involves joint minimization of DDPM noise-prediction loss and physics regression loss:

$\mathcal L_{\text{total}} = \mathcal L_{\text{diff}} + \lambda_{\text{phy}} \mathcal L_{\text{phy}}$

where $\mathcal L_{\text{diff}}$ is standard denoising MSE, and $\mathcal L_{\text{phy}}$ is the physics token regression loss. $\lambda_{\text{phy}}$ is set to $0.1$ for stable optimization. All diffusion and physics regression gradients are propagated interleaved over training epochs, with V-JEPA 2 features fixed.

6. Technical Challenges Encountered

Challenges reported include:

Memory pressure from high-dimensional physics tokens (2048 × 1408), requiring gradient checkpointing and BF16 precision.
Balancing gradients between physics objectives and the dominant diffusion generator task.
GPU overhead for 3D-convolutions and transformer stacks in PredictorP.

Planned mitigations: compressed token bottlenecks, alternative backbones, and selective ablations to evaluate the physics cross-attention's efficacy.

7. Empirical Validation and Future Directions

Empirical benchmarks establish that:

The physics regression loss $\mathcal L_{\text{phy}}$ converges rapidly in initial epochs, indicating that diffusion latents hold sufficient information for physics recovery.
Training remains stable without oscillatory loss patterns.
Qualitative diagnostics confirm a visibly stronger association between motion patterns and predicted physics tokens in generative outputs.

Future evaluation will target standardized physics correctness scores (VideoPhy-2), optical flow consistency, and T-LPIPS. Directions include integrating pooled/quantized physics tokens, supporting large DiT variants, and performing detailed ablation studies on the necessity and timing of physics cross-attention, along with investigating classifier-free scaling of $\alpha_{\text{phy}}$ at inference.

In sum, V-JEPA 2 serves as a fixed physics feature oracle for self-supervised video generation, whose joint embedding tokens are recoverable from diffusive latents and inject real-world physics priors into the temporal dynamics of generative models via cross-attention and multi-task regression. This architecture enables physically grounded video synthesis without requiring hand-crafted priors, physics simulators, or retraining of backbone models, tuning the generative process toward consistency with latent dynamics captured directly from large-scale video corpus (Satish et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance (2026)

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PhysVideoGenerator.