Papers
Topics
Authors
Recent
Search
2000 character limit reached

PredictorP: Physics Token Regression Network

Updated 19 March 2026
  • The paper introduces PredictorP, a novel network that regresses high-level physics tokens from noisy diffusion latents to guide video generation.
  • The methodology leverages a modular pipeline combining Conv3D and Transformer layers to fuse visual, text, and timestep embeddings for accurate physics token regression.
  • Empirical results demonstrate stable multi-task optimization with a rapidly converging physics regression loss, setting a new benchmark for physics-aware video generation.

PredictorP is a physics token regression network developed as an integral component of the PhysVideoGenerator framework, designed to embed explicit, learnable physics priors into deep video generation models. Addressing common deficiencies in video diffusion models—such as artifacts stemming from inadequate modeling of real-world dynamics—PredictorP regresses high-level physical representations, termed "physics tokens," from noisy diffusion latents. These tokens are subsequently injected via cross-attention into the temporal backbone of a DiT-based generator, steering the generative process toward physical plausibility and temporal consistency (Satish et al., 7 Jan 2026).

1. Functional Overview and Purpose

PredictorP’s central objective is to reconstruct patch-level, high-dimensional physical tokens—obtained from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2)—directly from latent diffused representations produced during the sampling process. Given the following inputs:

  • Noisy diffusion latent ztR4×16×32×32z_t \in \mathbb{R}^{4\times 16 \times 32 \times 32} (from a VAE encoding of 16×256×25616 \times 256 \times 256 video frames),
  • Text conditioning embedding ctextR226×4096c_{\text{text}} \in \mathbb{R}^{226 \times 4096},
  • Timestep embedding tembR512t_{\mathrm{emb}} \in \mathbb{R}^{512},

PredictorP regresses physics tokens p^R2048×1408\hat p \in \mathbb{R}^{2048 \times 1408} which are then injected through a dedicated cross-attention mechanism into every temporal transformer block of the video generator backbone.

This architecture demonstrates that diffusion latents retain sufficient information to recover detailed physical representations, thereby enabling the generative model to benefit from explicit physics guidance.

2. PredictorP Architecture and Layer-by-Layer Specification

The architecture is a modular pipeline combining convolutional and transformer-based components for multimodal fusion and token decoding. The structural flow comprises:

  1. 3D Convolutional Encoder: Processes ztz_t through 3–4 Conv3D layers (kernel size 3×3×33\times3\times3, stride (1,2,2)(1,2,2) for (T,H,W)(T,H,W), padding 1, 512 channels), resulting in hvisR512×8×16×16h_{\mathrm{vis}} \in \mathbb{R}^{512 \times 8 \times 16 \times 16}.
  2. Flatten & Concatenate: Spatial and temporal axes are flattened into a sequence of length $2048$ (8×16×168 \times 16 \times 16). This sequence is concatenated with the projected text and timestep embeddings, yielding a combined sequence [(Flatten(hvis);proj(ctext);proj(temb))][(\mathrm{Flatten}(h_{\mathrm{vis}}); \mathrm{proj}(c_{\text{text}}); \mathrm{proj}(t_{\mathrm{emb}}))].
  3. Transformer Encoder (Fusion): The concatenated input is projected to dimension 512 and fed to a 4-layer TransformerEncoder (8 heads, model dim 512, MLP dim 2048), producing hfusedR(2048+226+1)×512h_{\mathrm{fused}} \in \mathbb{R}^{(2048+226+1)\times512}.
  4. Transformer Decoder (Physics Queries): Employs $2048$ learnable queries QphysR2048×512Q_{\mathrm{phys}} \in \mathbb{R}^{2048 \times 512}, decoded via cross-attention against hfusedh_{\mathrm{fused}}, yielding intermediate representations p~R2048×512\tilde{p} \in \mathbb{R}^{2048 \times 512}.
  5. Linear Projection: Final mapping Lin:R512R1408\mathrm{Lin}:\mathbb{R}^{512} \to \mathbb{R}^{1408} generates predicted physics tokens p^R2048×1408\hat{p} \in \mathbb{R}^{2048\times1408}.

This layered approach is engineered to recover the high-bandwidth, spatiotemporally structured physical representations necessary for controlling the underlying dynamics in the generative process.

3. Physics Token Representation and Semantics

Physics tokens are patch-level embeddings computed in advance using the V-JEPA 2 encoder. For a 16-frame video clip:

  • V-JEPA 2 extracts 16×1616 \times 16 spatial grid embeddings over 8 frames, generating $2048$ tokens per clip.
  • Each token is a $1408$-dimensional vector.
  • These vectors encode high-level predictive attributes including object motion patterns, collision events, and gravitational effects.

Aligning the output space of PredictorP with these tokens enables direct supervision and facilitates physically informed conditioning of the generative model, exploiting the semantic content encoded by V-JEPA 2.

4. Training Objectives and Multi-Task Optimization

PredictorP is trained jointly with the video diffusion generator using a composite loss:

  • Diffusion Loss (Ldiff\mathcal{L}_{\mathrm{diff}}): Standard DDPM noise regression with physics token conditioning,

Et,z0,ϵ[ϵϵθ(zt,t,ctext,p^)22].\mathbb{E}_{t,\,z_0,\,\epsilon}\left[ \|\epsilon - \epsilon_\theta(z_t, t, c_{\text{text}}, \hat p)\|_2^2 \right].

  • Physics Regression Loss (Lphys\mathcal{L}_{\mathrm{phys}}): L2L_2 regression to ground-truth V-JEPA 2 tokens,

Ezt,t[p^pgt22].\mathbb{E}_{z_t,t}\left[\|\hat p - p_{gt}\|_2^2\right].

  • Total Loss (Ltotal\mathcal{L}_{\mathrm{total}}): Weighted sum,

Ltotal=Ldiff+λLphys,λ=0.1.\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{diff}} + \lambda\,\mathcal{L}_{\mathrm{phys}}, \quad \lambda=0.1.

The weighting factor λ\lambda is empirically set to ensure gradient contributions remain balanced. Training progress indicates that both losses decrease steadily, without oscillation or divergence, over 50 epochs, providing evidence for stable multi-task joint optimization.

5. Cross-Attention Integration within the Temporal Generator

Physical guidance is imposed on the generated video sequence via cross-attention mechanisms within the temporal transformer blocks of the DiT-based (Latte) generator:

  • For each spatial patch representation xtempRF×512x_{\mathrm{temp}} \in \mathbb{R}^{F \times 512} (where FF is the number of frames), standard temporal self-attention is first applied.
  • The resulting features x~\tilde x are then modulated by cross-attention against the regressed physics tokens p^\hat p:

xtemp=x~+CrossAttn(Q=Wqx~,K=Wkp^,V=Wvp^)x'_{\mathrm{temp}} = \tilde x + \mathrm{CrossAttn}(Q=W_q\tilde x,\, K=W_k\hat p,\, V=W_v\hat p)

with Wq,Wk,WvR512×512W_q, W_k, W_v \in \mathbb{R}^{512 \times 512} as learned projections.

  • This cross-attention enables each patch’s temporal evolution to attend to global physical signals, thus “steering” the diffusion backbone toward physically consistent video generation.

6. Empirical Analysis and Experimental Findings

Empirical validation of PredictorP was conducted over 50 epochs on a subset of OpenVid-1M. Both the noise prediction loss (Ldiff\mathcal{L}_{\mathrm{diff}}) and the physics token regression loss (Lphys\mathcal{L}_{\mathrm{phys}}) exhibited smooth convergence without instability. Notably, the physics regression loss converged more rapidly in early epochs, indicating that even partially denoised latents ztz_t contain sufficient information to reconstruct the V-JEPA 2 physical representations.

Planned ablation studies aim to compare:

  1. Backbone alone without physics token injection,
  2. PredictorP present but cross-attention disabled,
  3. Varying λ\lambda to examine multi-task tradeoffs.

The primary result is the demonstration that PredictorP can be optimized stably and jointly with the diffusion backbone, establishing that the mapping from ztz_t to p^\hat p is tractable and effective for physics-aware generative modeling (Satish et al., 7 Jan 2026).

7. Implementation Workflow and Pseudocode

The forward computation and integration of PredictorP can be summarized by the following pseudocode, which directly reflects the architectural description:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def PredictorP(z_t, c_text, t_emb):
    h_vis = Conv3DEnc(z_t)                     # → [2048,512]
    seq   = concatenate(h_vis, project(c_text), project(t_emb))
    fused = TransEnc(seq)                      # → [(2048+226+1),512]
    queries = Q_phys                           # [2048,512] learnable
    dec_out = TransDec(queries, fused)         # → [2048,512]
    hat_p  = Linear(dec_out)                   # → [2048,1408]
    return hat_p

def DiffusionBlock(x, hat_p):
    x1 = TemporalSelfAttn(x)
    delta = CrossAttn(Q=proj_q(x1), K=proj_k(hat_p), V=proj_v(hat_p))
    return x1 + delta

for each batch of (z0, c_text, p_gt):
    t    = sample_timestep()
    eps  = sample_noise()
    zt   = forward_diffusion(z0, t, eps)
    hat_p = PredictorP(zt, c_text, timestep_embed(t))
    # run through Latent Diffusion U-Net / DiT backbone, injecting hat_p:
    eps_hat = DiffusionModel(zt, c_text, hat_p)
    L_diff = ||eps  eps_hat||²
    L_phys = ||hat_p  p_gt||²
    L_total = L_diff + 0.1 * L_phys
    backprop(L_total)

This design supports end-to-end training and joint optimization of physics-aware video generation systems by leveraging explicit, high-capacity physics priors regressed from intermediate diffusion states. The feasibility of this approach is established through empirical training dynamics and architectural analysis (Satish et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PredictorP: Physics Token Regression Network.