NewtonRewards: Physics-Guided Video Generation
- NewtonRewards is a physics-guided post-training framework that improves video generation by enforcing Newtonian laws via differentiable, rule-based rewards.
- It calculates verifiable rewards using optical-flow for motion and appearance embeddings for mass, ensuring constant acceleration and preventing motion collapse.
- Empirical results on NewtonBench-60K show enhanced visual fidelity and kinematic accuracy across standard Newtonian motion primitives, even under out-of-distribution conditions.
NewtonRewards is a physics-grounded post-training framework designed to improve physical realism in video generation models. It operates by leveraging verifiable, rule-based rewards derived entirely from measurable proxies extracted from generated videos. The system is characterized by its explicit enforcement of Newton's laws without reliance on human or VLM feedback. NewtonRewards demonstrates substantial gains in both physical plausibility and visual fidelity across canonical Newtonian motion primitives, and generalizes effectively under out-of-distribution physical parameters (Le et al., 29 Nov 2025).
1. Measurable Proxies for Physical Quantities
NewtonRewards utilizes two frozen utility models to estimate proxies for physical variables directly from video frames:
- Optical-Flow Model (): Computes per-frame pixel displacement fields, . Under a fixed frame rate (), the velocity proxy is defined as , where .
- Video Encoder (): Produces high-level appearance embeddings, , interpreted as a proxy for the object's effective mass. Objects with similar visual appearances are assumed to possess similar inertial properties.
These proxies are fully differentiable, enabling gradient-based optimization and direct enforcement of physical constraints in pixel space.
2. Verifiable Physics Rewards
NewtonRewards defines two complementary loss functions, computed exclusively from the above-mentioned proxies:
- Newtonian Kinematic Constraint (Constant Acceleration): In accordance with Newton's second law, image-plane acceleration () should remain invariant under constant force. The discrete second difference of the displacement proxies,
yields the kinematic residual:
- Mass-Conservation Reward: Avoids degenerate solutions (velocity collapse) by matching generated appearance embeddings to those of physics-simulated reference clips. Denoting as simulator embeddings,
Minimization of maintains object persistence and non-zero motion, counteracting reward gaming effects such as object freezing.
3. Post-Training Workflow
The framework applies a succinct post-training phase to a pre-trained video diffusion generator (), utilizing the NewtonBench-60K synthetic benchmark:
- Sample noise and conditioning code to generate video .
- Extract via and via .
- Compute the combined physical loss:
- Backpropagate gradients through the generator; utility models remain frozen.
- Update generator parameters using AdamW (typical learning rate ) for thousands of iterations.
The reward signal is entirely derived from pixel-level measurements, ensuring complete verifiability and reproducibility.
4. NewtonBench-60K Benchmark
NewtonBench-60K serves as the evaluation backbone, comprising 60,000 clips generated using Kubric, PyBullet, and Blender. It covers five canonical Newtonian motion primitives:
- Free fall
- Horizontal throw
- Parabolic throw
- Sliding down an inclined ramp (with friction)
- Sliding up an inclined ramp
Training utilizes 50,000 clips (10,000 per primitive). The 10,000-clip test set is partitioned into in-distribution (ID) and out-of-distribution (OOD) bands, stressing model generalization across variables such as drop height ( m vs m), throw speed ( m/s vs m/s), ramp angle ( vs ), and friction ( perturbation). Metrics encompass both visual fidelity (trajectory L2 error, Chamfer distance, IoU) and physical plausibility (velocity RMSE, acceleration RMSE).
5. Empirical Performance and Analysis
NewtonRewards outperforms baseline and prior post-training strategies (e.g., OpenSora SFT, PISA-OF) across all motion primitives and metrics on both ID and OOD splits. Representative results from Table 1 are:
| Method | L2 ↓ | CD ↓ | IoU ↑ | RMSE₍v₎ ↓ | RMSE₍a₎ ↓ | Avg Δ |
|---|---|---|---|---|---|---|
| SFT | 0.1098 | 0.3159 | 0.1103 | 0.2792 | 3.3244 | — |
| PISA-OF | 0.1042 (+5.1%) | 0.2963 (+6.2%) | 0.1179 (+6.9%) | 0.2799 (–0.3%) | 2.7217 (+18.1%) | +7.6% |
| NewtonRewards | 0.0962 (+12.4%) | 0.2930 (+7.2%) | 0.1266 (+14.8%) | 0.2628 (+5.9%) | 3.0432 (+8.5%) | +9.8% |
NewtonRewards maintains an average improvement on the OOD split. Ablation without the mass constraint produces “reward hacking”: the kinematic residual falls, but velocity magnitude collapses by , leading to reduced motion realism.
Qualitatively, NewtonRewards yields straight, smooth trajectories in free-fall, maintains stable contact and deceleration on ramps, and uniquely recovers correct ballistic arcs in throws. Residual field diagnostics show near-zero mean for , whereas baselines exhibit structured violations indicative of unphysical motion.
6. Strengths, Limitations, and Future Research
Strengths:
- Direct, explicit enforcement of Newton's laws via verifiable, deterministic rewards.
- Generalizable approach: any physical law and proxy model can yield a differentiable reward.
- Robustness across diverse single-object motion primitives; stable performance under OOD scenarios.
- Lightweight post-training procedure—no simulator–generator coupling at inference.
Limitations:
- Dependency on frozen metric proxies (optical flow, appearance embeddings), which may introduce noise.
- Limited to simple, single-object physical scenarios; does not address collisions, multi-body, or deformable object dynamics.
- Appearance-based mass proxy is coarse, lacking sensitivity to true mass or density variations.
Suggested Directions:
- Extension to multi-object scenarios, enforcing collision and momentum conservation.
- Integration of additional physical constraints (energy conservation, rotational dynamics, drag).
- Refinement of proxy models (e.g., inclusion of depth, segmentation, learned 3D pose).
- Application to real-world videos via fine-tuned proxy models.
- Exploration of continuous ODE- or Hamiltonian-network constraints within latent spaces.
This suggests that physics-aware video generation can be achieved without human supervision by grounding learning signals in deterministic, pixel-level measurements and rule-based rewards. A plausible implication is that broader classes of generative models may benefit from similar verifiable, physics-guided constraints, closing the gap between visual realism and actual physical plausibility in synthetic videos (Le et al., 29 Nov 2025).