PredictorP: Physics Token Regression Network
- The paper introduces PredictorP, a novel network that regresses high-level physics tokens from noisy diffusion latents to guide video generation.
- The methodology leverages a modular pipeline combining Conv3D and Transformer layers to fuse visual, text, and timestep embeddings for accurate physics token regression.
- Empirical results demonstrate stable multi-task optimization with a rapidly converging physics regression loss, setting a new benchmark for physics-aware video generation.
PredictorP is a physics token regression network developed as an integral component of the PhysVideoGenerator framework, designed to embed explicit, learnable physics priors into deep video generation models. Addressing common deficiencies in video diffusion models—such as artifacts stemming from inadequate modeling of real-world dynamics—PredictorP regresses high-level physical representations, termed "physics tokens," from noisy diffusion latents. These tokens are subsequently injected via cross-attention into the temporal backbone of a DiT-based generator, steering the generative process toward physical plausibility and temporal consistency (Satish et al., 7 Jan 2026).
1. Functional Overview and Purpose
PredictorP’s central objective is to reconstruct patch-level, high-dimensional physical tokens—obtained from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2)—directly from latent diffused representations produced during the sampling process. Given the following inputs:
- Noisy diffusion latent (from a VAE encoding of video frames),
- Text conditioning embedding ,
- Timestep embedding ,
PredictorP regresses physics tokens which are then injected through a dedicated cross-attention mechanism into every temporal transformer block of the video generator backbone.
This architecture demonstrates that diffusion latents retain sufficient information to recover detailed physical representations, thereby enabling the generative model to benefit from explicit physics guidance.
2. PredictorP Architecture and Layer-by-Layer Specification
The architecture is a modular pipeline combining convolutional and transformer-based components for multimodal fusion and token decoding. The structural flow comprises:
- 3D Convolutional Encoder: Processes through 3–4 Conv3D layers (kernel size , stride for , padding 1, 512 channels), resulting in .
- Flatten & Concatenate: Spatial and temporal axes are flattened into a sequence of length $2048$ (). This sequence is concatenated with the projected text and timestep embeddings, yielding a combined sequence .
- Transformer Encoder (Fusion): The concatenated input is projected to dimension 512 and fed to a 4-layer TransformerEncoder (8 heads, model dim 512, MLP dim 2048), producing .
- Transformer Decoder (Physics Queries): Employs $2048$ learnable queries , decoded via cross-attention against , yielding intermediate representations .
- Linear Projection: Final mapping generates predicted physics tokens .
This layered approach is engineered to recover the high-bandwidth, spatiotemporally structured physical representations necessary for controlling the underlying dynamics in the generative process.
3. Physics Token Representation and Semantics
Physics tokens are patch-level embeddings computed in advance using the V-JEPA 2 encoder. For a 16-frame video clip:
- V-JEPA 2 extracts spatial grid embeddings over 8 frames, generating $2048$ tokens per clip.
- Each token is a $1408$-dimensional vector.
- These vectors encode high-level predictive attributes including object motion patterns, collision events, and gravitational effects.
Aligning the output space of PredictorP with these tokens enables direct supervision and facilitates physically informed conditioning of the generative model, exploiting the semantic content encoded by V-JEPA 2.
4. Training Objectives and Multi-Task Optimization
PredictorP is trained jointly with the video diffusion generator using a composite loss:
- Diffusion Loss (): Standard DDPM noise regression with physics token conditioning,
- Physics Regression Loss (): regression to ground-truth V-JEPA 2 tokens,
- Total Loss (): Weighted sum,
The weighting factor is empirically set to ensure gradient contributions remain balanced. Training progress indicates that both losses decrease steadily, without oscillation or divergence, over 50 epochs, providing evidence for stable multi-task joint optimization.
5. Cross-Attention Integration within the Temporal Generator
Physical guidance is imposed on the generated video sequence via cross-attention mechanisms within the temporal transformer blocks of the DiT-based (Latte) generator:
- For each spatial patch representation (where is the number of frames), standard temporal self-attention is first applied.
- The resulting features are then modulated by cross-attention against the regressed physics tokens :
with as learned projections.
- This cross-attention enables each patch’s temporal evolution to attend to global physical signals, thus “steering” the diffusion backbone toward physically consistent video generation.
6. Empirical Analysis and Experimental Findings
Empirical validation of PredictorP was conducted over 50 epochs on a subset of OpenVid-1M. Both the noise prediction loss () and the physics token regression loss () exhibited smooth convergence without instability. Notably, the physics regression loss converged more rapidly in early epochs, indicating that even partially denoised latents contain sufficient information to reconstruct the V-JEPA 2 physical representations.
Planned ablation studies aim to compare:
- Backbone alone without physics token injection,
- PredictorP present but cross-attention disabled,
- Varying to examine multi-task tradeoffs.
The primary result is the demonstration that PredictorP can be optimized stably and jointly with the diffusion backbone, establishing that the mapping from to is tractable and effective for physics-aware generative modeling (Satish et al., 7 Jan 2026).
7. Implementation Workflow and Pseudocode
The forward computation and integration of PredictorP can be summarized by the following pseudocode, which directly reflects the architectural description:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
def PredictorP(z_t, c_text, t_emb): h_vis = Conv3DEnc(z_t) # → [2048,512] seq = concatenate(h_vis, project(c_text), project(t_emb)) fused = TransEnc(seq) # → [(2048+226+1),512] queries = Q_phys # [2048,512] learnable dec_out = TransDec(queries, fused) # → [2048,512] hat_p = Linear(dec_out) # → [2048,1408] return hat_p def DiffusionBlock(x, hat_p): x1 = TemporalSelfAttn(x) delta = CrossAttn(Q=proj_q(x1), K=proj_k(hat_p), V=proj_v(hat_p)) return x1 + delta for each batch of (z0, c_text, p_gt): t = sample_timestep() eps = sample_noise() zt = forward_diffusion(z0, t, eps) hat_p = PredictorP(zt, c_text, timestep_embed(t)) # run through Latent Diffusion U-Net / DiT backbone, injecting hat_p: eps_hat = DiffusionModel(zt, c_text, hat_p) L_diff = ||eps − eps_hat||² L_phys = ||hat_p − p_gt||² L_total = L_diff + 0.1 * L_phys backprop(L_total) |
This design supports end-to-end training and joint optimization of physics-aware video generation systems by leveraging explicit, high-capacity physics priors regressed from intermediate diffusion states. The feasibility of this approach is established through empirical training dynamics and architectural analysis (Satish et al., 7 Jan 2026).