Papers
Topics
Authors
Recent
2000 character limit reached

Physics-Grounded Video Generation

Updated 4 December 2025
  • The paper employs explicit physical constraints and iterative self-refinement loops to enhance realism, achieving measurable gains in physics compliance.
  • It integrates simulation, neural ODEs, and spectral regularizers to enforce fundamental laws like Newton’s law, momentum, and energy conservation.
  • Practical methods such as negative prompting and reward tuning enable plug-and-play enforcement of physical behavior in diffusion and transformer models.

Physics-grounded constraints in video generation are explicit mechanisms, losses, algorithms, and representational strategies designed to ensure that synthesized videos—especially those generated by modern diffusion models or transformers—obey the laws of real-world dynamics as dictated by classical and continuum physics. Unlike conventional approaches that rely on data-driven pattern recognition, physics-grounded methods introduce formal priors and explicit penalties that target core physical laws, such as Newton’s second law, momentum and energy conservation, and kinematic bounds. Recent research demonstrates that these constraints can be programmatically enforced through self-refinement loops, physical simulation, neural ODEs, spectral regularizers, and reward-based preference tuning, leading to substantial gains in physical plausibility and controllability of synthesized content.

1. Fundamental Physical Laws and Constraints

Physics-grounded video generation begins with codifying the canonical principles that govern real-world motion:

  • Newton’s Second Law: Every object’s trajectory x(t)x(t) must satisfy Fi(t)=md2xidt2\sum F_i(t) = m\,\frac{d^2 x_i}{dt^2}, or, in discrete time with small Δt\Delta t, m(xit+12xit+xit1)/Δt2=Fitm (x_i^{t+1} - 2 x_i^t + x_i^{t-1}) / \Delta t^2 = \sum F_i^t.
  • Conservation of Momentum and Energy: For isolated collisions, m1v1+m2v2=m1v1+m2v2m_1 v_1 + m_2 v_2 = m_1 v_1' + m_2 v_2', and for elastic events, 12m1v12+12m2v22=12m1v12+12m2v22\frac{1}{2} m_1 \|v_1\|^2 + \frac{1}{2} m_2 \|v_2\|^2 = \frac{1}{2} m_1 \|v_1'\|^2 + \frac{1}{2} m_2 \|v_2'\|^2.
  • Kinematic bounds: For free-falling objects, d2x/dt2g||d^2x/dt^2|| \leq g (e.g., g=9.8 m/s2g = 9.8\ \text{m/s}^2).
  • Additional Phenomena: Elasticity (Hooke’s Law), refraction (Snell’s law), and fluid behaviors (Bernoulli’s principle) as necessary for multifaceted video content (Liu et al., 25 Nov 2025, Guo et al., 1 May 2025, Chen et al., 26 Mar 2025).

These constraints serve as the backbone for the guidance, feedback, or loss functions across a broad spectrum of architectures.

2. Iterative Refinement and Multimodal Feedback Loops

One major paradigm for enforcing physics-groundedness is iterative prompt or content refinement via specialized feedback loops. In "Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement" (Liu et al., 25 Nov 2025), the pipeline interleaves a Vision-LLM (VLM) f()f(\cdot), which critiques candidate videos for physics violations, with a LLM g()g(\cdot), which rewrites this critique into an enhanced textual prompt for the generator h()h(\cdot). This multimodal chain-of-thought (MM-CoT) process iteratively updates generation prompts, with the VLM explicitly checking for violations of Newtonian and conservation constraints at each refinement step. The loop halts when the prompt stabilizes under the LLM’s rewriting, producing outputs that increasingly obey physical laws with each iteration.

Ablation tests in (Liu et al., 25 Nov 2025) attribute the greatest improvements to precise VLM detection of F=maF=ma violations and LLM-driven looped rewriting, with up to +6.07+6.07 absolute gain in Physics-IQ [PhyIQ] scores over baseline models.

3. Physics-Integrated Simulation and Neural Approaches

Physics-grounded constraints can be enforced through explicit simulation or differentiable modeling:

  • Classical and Continuum Simulation: Approaches such as PhysGen3D (Chen et al., 26 Mar 2025) and PhysMotion (Tan et al., 26 Nov 2024) reconstruct full 3D scenes from single images, estimate object physical properties (mass, Young's modulus, friction, restitution), and simulate object behavior using the Material Point Method (MPM), which solves the continuum momentum equation ρDvDt=σ+fext\rho \frac{D \mathbf{v}}{Dt} = \nabla \cdot \boldsymbol{\sigma} + \mathbf{f}_{\rm ext} with constitutive updates to stress and deformation gradients. User controls (e.g., initial velocities, materials) directly affect simulation outcomes. Such methods guarantee exact satisfaction of Newtonian dynamics at simulation level and propagate realistic dynamics to video frames via differentiable rendering, with no need for neural end-to-end updates (Chen et al., 26 Mar 2025, Tan et al., 26 Nov 2024).
  • Neural Newtonian Dynamics: In NewtonGen (Yuan et al., 25 Sep 2025), a trainable latent ODE system parameterizes each object’s state (position, velocity, rotation, area, etc.) with linear second-order terms plus a small residual MLP, enforcing Z(t)\mathbf{Z}(t) to evolve according to physics-informed neural ODEs. Training minimizes MSE in this latent space, while user-facing controls directly map to initial states or ODE coefficients, yielding controllable and physically consistent video motion.

These methodologies provide both exact compliance with physics and flexible, user-controllable generation pipelines.

4. Spectral and Frequency-Domain Constraints

Physics-aware models can inject motion plausibility at the latent or feature level by regularizing the spectral content of generated videos. In "Motion aware video generative model" (Xue et al., 2 Jun 2025), motion-specific spectral losses are defined for translation, rotation, and scaling. For example, 3D-DCT planes or quadratic surfaces in frequency space correspond to constant or accelerated translation, while rotational motion produces annular rings and discrete temporal peaks. These spectral losses are combined with a zero-initialized frequency-domain enhancement module inserted into the backbone U-Net of a video diffusion model. Fine-tuning these modules with frequency-domain losses (while keeping the primary backbone fixed or lightly tuned) yields substantial improvements in action recognition, motion accuracy, warping error, and temporal alignment metrics—offering a principled frequency-theoretic path for physics-aware motion control (Xue et al., 2 Jun 2025).

5. Symbolic and Preference-Based Post-Training

Recent work focuses on model-agnostic post-training to steer existing video generators toward more plausible outputs without retraining from scratch.

  • Physics-based Reward Models: PhysCorr (Wang et al., 6 Nov 2025) introduces PhysicsRM, a dual-reward system measuring both intra-object temporal consistency (via DINOv2 embeddings) and inter-object interaction correctness (via physics-aware video QA). PhyDPO fine-tunes any backbone with a contrastive direct preference optimization loss weighted by the PhyScore gap, focusing model updates on physically non-compliant outputs.
  • Real-Data and Annotation-Free Preference Optimization: RDPO (Qian et al., 23 Jun 2025) reverse-samples real videos through the generator, composes preference pairs based on proximity in the generator’s latent space, and fine-tunes using DPO-style objectives on these pairs. Physical consistency is then quantified by smoothness, momentum error, and gravity-fit residuals; the method yields increases in Physics-IQ, subject consistency, and motion smoothness without requiring human-annotated preference data.
  • Prompt-Level Preference Alignment: PhysHPO (Chen et al., 14 Aug 2025) executes hierarchical DPO at four granularities (instance, state, motion, semantic), leveraging a curated dataset of “good” real-world videos, and finds that each level uniquely boosts physics compliance, as measured on VideoPhy and PhyGenBench.

These methods demonstrate that preference alignment and post-hoc reward tuning can distill physical priors into powerful generative architectures.

6. Training-Free and Plug-and-Play Constraint Enforcement

Some methods enforce physics without parameter updates, leveraging runtime guidance:

  • Negative Prompting and Structured Decoupled Guidance: (Hao et al., 29 Sep 2025) generates counterfactual prompts encoding targeted physics violations for the same entities/scene, then uses "Synchronized Decoupled Guidance" (SDG) to push the generator away from the implausible trajectory at each denoising step. This is achieved by normalizing the positive-negative discrepancy direction and running two decoupled latent trajectories, ensuring immediate and persistent suppression of implausible behavior. This plug-and-play approach is effective across any diffusion-style backbone, requires no training, and yields measurable improvements in physics alignment (Hao et al., 29 Sep 2025).
  • Verifiable Rewards via Frozen Utility Models: NewtonRewards (Le et al., 29 Nov 2025) extracts measurable proxies from generated videos (optical flow for velocity, high-level features for mass) via frozen models (e.g., RAFT, V-JEPA-2). Training exclusively targets Newtonian kinematic structure (enforcing constant acceleration) and mass conservation through differentiable rewards, leading to lower trajectory and velocity errors under both in-distribution and OOD conditions, across motion primitives such as free fall and ramp motion.

7. Benchmarks and Quantitative Assessment

Systematic evaluation frameworks now probe the degree to which generative models obey physical laws. T2VPhysBench (Guo et al., 1 May 2025) defines compliance over twelve core laws (grouped by Newtonian, conservation, and phenomenon principles) using multi-annotator human scoring. It finds that commercial and open-source models uniformly score below 0.60 (best performer Wan 2.1: 0.56 for Newton's principles) and that prompt-level physical hints rarely fix violations. Quantitative metrics include Physics-IQ [PhyIQ], Physical Invariance Score (PIS) (Yuan et al., 25 Sep 2025), Motion Smoothness, and domain-specific adherence measures (e.g., spectral motion errors, momentum conservation residuals).

Preference-optimization methods (PhysCorr, PhysHPO, RDPO) also leverage VideoPhy, VBench, Physics-IQ, and human studies across physical realism, semantic alignment, and temporal coherence.

Table: Representative Physical Metrics in Video Generation

Method / Benchmark Physical Law Target Evaluation Metric Top Score / Gain
MM-CoT (Liu et al., 25 Nov 2025) Newton, conservation Physics-IQ +6 pts over baseline
PhysGen3D (Chen et al., 26 Mar 2025) Newton, elasticity Human realism Likert, VBench 3.7/5 (realism), best
NewtonGen (Yuan et al., 25 Sep 2025) Newtonian dynamics Physical Invariance Score Outperforms Sora, Veo3
Spectral (Xue et al., 2 Jun 2025) Translation, rotation Action Rec, Motion Accur. +9 pts, +5 pts
PhysCorr (Wang et al., 6 Nov 2025) Stability, interactions VBench2 (Mechanics, Rationality) +1.75%, +29.99%
PhysHPO (Chen et al., 14 Aug 2025) Multi-level physics VideoPhy, PhyGenBench +4.6 (VideoPhy), +0.07
NewtonRewards (Le et al., 29 Nov 2025) Newtonian kinematics Velocity/Accel. RMSE -8.5% RMSE (ID/OOD)

8. Limitations, Frontiers, and Open Problems

Current methods remain restricted by the coverage of physical phenomena:

  • Rigid-Body vs. Continuum: Most frameworks address rigid-body or single-material systems; accurate multi-object, fluid, or highly deformable bodies (e.g., snow, sand, melting) require more complex simulators or differentiable engines.
  • Prompt & Reasoning Bottlenecks: Chain-of-thought planning, VLM feedback, and negative prompt generation increasingly depend on robust language and vision models. Errors in VLM/LLM physics interpretation propagate through the pipeline, and more complex laws (electromagnetism, thermodynamics, turbulence) pose significant challenges (Liu et al., 25 Nov 2025).
  • Evaluation Scarcity: Ground-truth evaluation for 3D motion, multi-body contact, or real materials is hampered by limited annotated or simulation-derived datasets (Meng et al., 10 Feb 2025).
  • Sim2Real and Robustness: Bridging simulated physics and real-world video remains difficult; universal physics priors for unseen materials or interactions remain an open area.

Advances will likely depend on tight integration of symbolic planners, differentiable simulators, spectral constraints, and scalable, preference-based fine-tuning.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Physics-Grounded Constraints in Video Generation.