Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

LikePhys: Intuitive Physics in Video Diffusion

Updated 20 October 2025
  • LikePhys is a model-based evaluation protocol that quantifies intuitive physics in video diffusion models by using the denoising loss as an ELBO-based likelihood surrogate.
  • It employs paired video simulations with controlled physical law violations to differentiate between plausible and implausible dynamics, measured via the Plausibility Preference Error (PPE).
  • The protocol spans diverse physical domains and scenarios, providing actionable insights into model scalability and the challenges of capturing complex spatio-temporal physics.

LikePhys is a model-based evaluation protocol for quantifying intuitive physics understanding in video diffusion models, leveraging the denoising objective as an ELBO-based likelihood surrogate to discriminate between physically plausible and implausible videos. This method introduces a training-free approach for benchmarking generative models with respect to physics correctness, based on controlled simulation datasets spanning multiple physical domains.

1. Evaluation Principle and Methodology

LikePhys operates by measuring a video diffusion model's ability to reliably assign higher likelihoods to physically valid video sequences, as opposed to impossible counterparts constructed via controlled perturbations. Videos are paired such that each "valid" clip obeys the relevant physical law (e.g., conservation of energy, temporal continuity), and its "invalid" partner displays a specific violation (e.g., nonphysical object motion). The critical metric is the denoising loss:

Ldenoise(θ;xt)=Et,ϵϵϵθ(xt,t)2\mathcal{L}_{\text{denoise}}(\theta; x_t) = \mathbb{E}_{t, \epsilon} \lVert \epsilon - \epsilon_{\theta}(x_t, t) \rVert^2

where ϵθ(xt,t)\epsilon_{\theta}(x_t, t) is the model's predicted noise in the diffusion reversal at time tt. Since this is directly related to model likelihood, LikePhys can infer a "preference" for physics plausibility without explicit labeling.

Videos in each paired set are rendered with strictly matched visual parameters (lighting, camera, textures), ensuring that denoising loss differences are solely attributable to physical plausibility rather than appearance.

2. Plausibility Preference Error (PPE)

The statistic that encapsulates model performance is the Plausibility Preference Error (PPE), defined for MM valid and NN invalid samples as:

PPE=1MNj=1Mk=1N1[Ldenoise(θ;xj+)Ldenoise(θ;xk)]\text{PPE} = \frac{1}{MN} \sum_{j=1}^M \sum_{k=1}^N 1[\mathcal{L}_{\text{denoise}}(\theta; x_j^+) \geq \mathcal{L}_{\text{denoise}}(\theta; x_k^-)]

where xj+x_j^+, xkx_k^- are valid and invalid videos respectively. A low PPE indicates that the model robustly prefers physically plausible content according to its internal density estimation.

LikePhys demonstrates strong alignment between PPE and human preference, with reported Kendall's τ0.44\tau \approx 0.44; performance surpasses prior evaluators such as VideoPhy and Qwen2.5 VL.

3. Physics Benchmark Suite

The experimental protocol spans twelve scenarios in four principal domains:

Domain Scenarios
Rigid Body Mechanics Ball Collision, Ball Drop, Block Slide, Pendulum Oscillation, Pyramid Impact
Continuum Mechanics Cloth Drape, Cloth Waving
Fluid Mechanics Droplet Fall, Faucet Flow, River Flow
Optical Effects Moving Shadow, Orbit Shadow

Each scenario is constructed in Blender, guaranteeing that valid–invalid pairs are visually matched except for controlled physical law violations. This isolates the evaluation to purely physics-related model capacities.

Empirical results reveal that model capacity and context length are directly correlated with improved physics understanding. Transformer-based architectures (DiT) outperform earlier UNet-style designs, particularly as the number of generated frames increases (providing longer-range temporal coordination). PPE consistently declines as the context window grows, confirming that temporal depth enables better modeling of spatio-temporal physical coherence.

Classifier-free guidance (CFG) strength does not significantly affect PPE, suggesting that physics correctness is not sensitive to guidance scale within current settings.

5. Domain-Specific Physics Sensitivity

Analysis across scenarios identifies distinct challenges:

  • Rigid Body and Continuum Mechanics: Moderate PPE values, with better performance on phenomena aligned with static image priors.
  • Fluid Mechanics: Highest error rates, particularly on river flow, marking limitations in modeling nonlinear and chaotic dynamics.
  • Optical Effects: Lower PPE, likely due to tight constraints imposed by vast pretraining on static image corpora.

Physical laws demanding long-range spatio-temporal reasoning (e.g., temporal continuity, conservation of momentum) yield higher errors than those related to geometric invariance, indicating that current models more readily internalize properties inheritable from static pretraining.

6. Comparison to Human Judgments

LikePhys evaluation shows substantial covariation with human preferences. Models with advanced architectural features and larger temporal context approach—but do not match—human discrimination between possible and impossible events. The evaluation protocol succeeds in quantitatively separating physics understanding from mere visual realism.

7. Challenges, Limitations, and Research Opportunities

Current video diffusion models exhibit significant difficulty in generating physically plausible dynamics, especially for complex or chaotic systems and in multi-domain transfer. Difficulties include:

  • Disentangling appearance fidelity from physics correctness in synthetic and real video contexts
  • Ensuring global temporal consistency, particularly over long sequence generation
  • Simultaneous optimization for physical plausibility and perceptual quality, often requiring conflicting objectives

Potential improvements include the development of physics-aware loss functions enforcing explicit conservation laws, integration of multiscale memory architectures to capture long-range dependencies, and expansion of benchmark domains and scenario complexity.


In summary, LikePhys provides a robust, quantitatively interpretable, and human-aligned metric for evaluating intuitive physics in video diffusion models. By leveraging the denoising loss directly as a likelihood-based measure of plausibility, it offers a training-free protocol suitable for benchmark scaling, model analysis, and cross-domain comparison—thereby setting a foundation for systematically advancing physics-aware generative modeling (Yuan et al., 13 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LikePhys.