LikePhys: Intuitive Physics in Video Diffusion

Updated 20 October 2025

LikePhys is a model-based evaluation protocol that quantifies intuitive physics in video diffusion models by using the denoising loss as an ELBO-based likelihood surrogate.
It employs paired video simulations with controlled physical law violations to differentiate between plausible and implausible dynamics, measured via the Plausibility Preference Error (PPE).
The protocol spans diverse physical domains and scenarios, providing actionable insights into model scalability and the challenges of capturing complex spatio-temporal physics.

LikePhys is a model-based evaluation protocol for quantifying intuitive physics understanding in video diffusion models, leveraging the denoising objective as an ELBO-based likelihood surrogate to discriminate between physically plausible and implausible videos. This method introduces a training-free approach for benchmarking generative models with respect to physics correctness, based on controlled simulation datasets spanning multiple physical domains.

1. Evaluation Principle and Methodology

LikePhys operates by measuring a video diffusion model's ability to reliably assign higher likelihoods to physically valid video sequences, as opposed to impossible counterparts constructed via controlled perturbations. Videos are paired such that each "valid" clip obeys the relevant physical law (e.g., conservation of energy, temporal continuity), and its "invalid" partner displays a specific violation (e.g., nonphysical object motion). The critical metric is the denoising loss:

$\mathcal{L}_{\text{denoise}}(\theta; x_t) = \mathbb{E}_{t, \epsilon} \lVert \epsilon - \epsilon_{\theta}(x_t, t) \rVert^2$

where $\epsilon_{\theta}(x_t, t)$ is the model's predicted noise in the diffusion reversal at time $t$ . Since this is directly related to model likelihood, LikePhys can infer a "preference" for physics plausibility without explicit labeling.

Videos in each paired set are rendered with strictly matched visual parameters (lighting, camera, textures), ensuring that denoising loss differences are solely attributable to physical plausibility rather than appearance.

2. Plausibility Preference Error (PPE)

The statistic that encapsulates model performance is the Plausibility Preference Error (PPE), defined for $M$ valid and $N$ invalid samples as:

$\text{PPE} = \frac{1}{MN} \sum_{j=1}^M \sum_{k=1}^N 1[\mathcal{L}_{\text{denoise}}(\theta; x_j^+) \geq \mathcal{L}_{\text{denoise}}(\theta; x_k^-)]$

where $x_j^+$ , $x_k^-$ are valid and invalid videos respectively. A low PPE indicates that the model robustly prefers physically plausible content according to its internal density estimation.

LikePhys demonstrates strong alignment between PPE and human preference, with reported Kendall's $\tau \approx 0.44$ ; performance surpasses prior evaluators such as VideoPhy and Qwen2.5 VL.

3. Physics Benchmark Suite

The experimental protocol spans twelve scenarios in four principal domains:

Domain	Scenarios
Rigid Body Mechanics	Ball Collision, Ball Drop, Block Slide, Pendulum Oscillation, Pyramid Impact
Continuum Mechanics	Cloth Drape, Cloth Waving
Fluid Mechanics	Droplet Fall, Faucet Flow, River Flow
Optical Effects	Moving Shadow, Orbit Shadow

Each scenario is constructed in Blender, guaranteeing that valid–invalid pairs are visually matched except for controlled physical law violations. This isolates the evaluation to purely physics-related model capacities.

4. Model Scaling, Inference Settings, and Performance Trends

Empirical results reveal that model capacity and context length are directly correlated with improved physics understanding. Transformer-based architectures (DiT) outperform earlier UNet-style designs, particularly as the number of generated frames increases (providing longer-range temporal coordination). PPE consistently declines as the context window grows, confirming that temporal depth enables better modeling of spatio-temporal physical coherence.

Classifier-free guidance (CFG) strength does not significantly affect PPE, suggesting that physics correctness is not sensitive to guidance scale within current settings.

5. Domain-Specific Physics Sensitivity

Analysis across scenarios identifies distinct challenges:

Rigid Body and Continuum Mechanics: Moderate PPE values, with better performance on phenomena aligned with static image priors.
Fluid Mechanics: Highest error rates, particularly on river flow, marking limitations in modeling nonlinear and chaotic dynamics.
Optical Effects: Lower PPE, likely due to tight constraints imposed by vast pretraining on static image corpora.

Physical laws demanding long-range spatio-temporal reasoning (e.g., temporal continuity, conservation of momentum) yield higher errors than those related to geometric invariance, indicating that current models more readily internalize properties inheritable from static pretraining.

6. Comparison to Human Judgments

LikePhys evaluation shows substantial covariation with human preferences. Models with advanced architectural features and larger temporal context approach—but do not match—human discrimination between possible and impossible events. The evaluation protocol succeeds in quantitatively separating physics understanding from mere visual realism.

7. Challenges, Limitations, and Research Opportunities

Current video diffusion models exhibit significant difficulty in generating physically plausible dynamics, especially for complex or chaotic systems and in multi-domain transfer. Difficulties include:

Disentangling appearance fidelity from physics correctness in synthetic and real video contexts
Ensuring global temporal consistency, particularly over long sequence generation
Simultaneous optimization for physical plausibility and perceptual quality, often requiring conflicting objectives

Potential improvements include the development of physics-aware loss functions enforcing explicit conservation laws, integration of multiscale memory architectures to capture long-range dependencies, and expansion of benchmark domains and scenario complexity.

In summary, LikePhys provides a robust, quantitatively interpretable, and human-aligned metric for evaluating intuitive physics in video diffusion models. By leveraging the denoising loss directly as a likelihood-based measure of plausibility, it offers a training-free protocol suitable for benchmark scaling, model analysis, and cross-domain comparison—thereby setting a foundation for systematically advancing physics-aware generative modeling (Yuan et al., 13 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference (2025)

LikePhys: Intuitive Physics in Video Diffusion

1. Evaluation Principle and Methodology

2. Plausibility Preference Error (PPE)

3. Physics Benchmark Suite

4. Model Scaling, Inference Settings, and Performance Trends

5. Domain-Specific Physics Sensitivity

6. Comparison to Human Judgments

7. Challenges, Limitations, and Research Opportunities

Whiteboard

Follow Topic

Continue Learning

LikePhys: Intuitive Physics in Video Diffusion

1. Evaluation Principle and Methodology

2. Plausibility Preference Error (PPE)

3. Physics Benchmark Suite

4. Model Scaling, Inference Settings, and Performance Trends

5. Domain-Specific Physics Sensitivity

6. Comparison to Human Judgments

7. Challenges, Limitations, and Research Opportunities

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics