NewtonBench-60K: Physics-Based Video Benchmark

Updated 3 December 2025

NewtonBench-60K is a benchmark dataset of 60K synthetic videos featuring five canonical Newtonian motion primitives, offering clear insights into physical realism.
It uses a multi-stage simulation pipeline with PyBullet for dynamics, Kubric for asset orchestration, and Blender for high-fidelity rendering with consistent parameters.
The dataset employs rigorous evaluation metrics such as trajectory error, Chamfer distance, and velocity/acceleration RMSE to quantitatively assess spatial, temporal, and kinematic accuracy.

NewtonBench-60K is a large-scale benchmark dataset for evaluating video generation models in the context of physical realism, specifically Newtonian motion. Developed to facilitate explicit, physics-grounded analysis, NewtonBench-60K introduces five canonical Newtonian Motion Primitives (NMPs) with rigorous parameterization, high-fidelity rendering, and strict evaluation protocols, positioning it as a foundation for reproducible, quantitative research on the intersection of video generation and physical law enforcement (Le et al., 29 Nov 2025).

1. Dataset Composition and Structure

NewtonBench-60K comprises 60,000 synthetic video clips, partitioned into 50,000 training videos and 10,000 held-out benchmark videos. Each of the five Newtonian Motion Primitives represents a distinct class of object motion governed by classical physics:

Free Fall (NMP-F)
Horizontal Throw (NMP-TH)
Parabolic Throw (NMP-TP)
Ramp Sliding Down (NMP-RD)
Ramp Sliding Up (NMP-RU)

Each primitive is realized with 12,000 total clips, stratified into 10,000 for training and 2,000 held out for evaluation. The design ensures consistent scale per primitive and strict asset separation between train and test data.

2. Data Generation Pipeline

The generation of NewtonBench-60K utilizes a multi-stage simulation-rendering stack:

Physics Simulation: Rigid-body dynamics are computed with PyBullet under constant gravity ( $g=9.81\,\mathrm{m/s^2}$ ).
Scene Orchestration: Kubric manages asset sampling (objects from Google Scanned Objects, backgrounds from HDRI maps), spatial positioning, and camera setup.
Rendering: Scenes are rendered with Blender, using HDRI lighting and a fixed orthographic side view. The image’s vertical axis aligns with the gravity vector.

Parameter domains are uniformly sampled within defined ranges for each primitive. For example, free-fall height $h$ spans $[0.5, 1.5]$ m, while parabolic throws sample initial speeds $v_0\in[2,6]$ m/s and launch angles $\theta\in[15^\circ,75^\circ]$ . Ramp primitives sample angles $\alpha\in[15^\circ,45^\circ]$ with a fixed friction coefficient $\mu=0.06$ .

3. Video Specifications

All video clips are rendered at $512\,{\times}\,512$ pixel resolution, with uniform length $T=32$ frames and a fixed frame rate of $16$ fps. The camera is orthographic and side-mounted to eliminate perspective distortion and maintain a consistent, physics-aligned visual frame of reference. The vertical axis matches the direction of gravity, facilitating precise kinematic analysis.

4. Data Splitting and Out-of-Distribution Protocol

The dataset adheres to a rigorous partitioning scheme:

Training Set: 50,000 videos (10,000 per primitive).
Benchmark/Test Set: 10,000 videos, with 2,000 per primitive further subdivided as follows:
- In-Distribution (ID): 1,000 videos, parameter ranges matching training domain.
- Out-of-Distribution (OOD): 1,000 videos, sampling held-out parameter ranges (e.g., $v_0$ outside training regime for throws, steeper ramp angles, or $\mu$ perturbed by $\pm25\%$ ).

Model selection is based on the ID split within the benchmark set. There is no separate validation split. Asset pools (objects, backgrounds) are strictly disjoint between training and held-out subsets, ensuring no cross-contamination.

5. Statistical Analysis of Key Parameters

Since all generative parameters are drawn from uniform distributions, their statistical properties are analytically tractable. For each primitive, means and variances are precisely specified:

Parameter	Distribution	Mean	Variance
Height (Free Fall)	$h\sim\mathcal{U}[0.5,1.5]$	$1.0$	$0.0833$
Speed ( $v_0$ )	$v_0\sim\mathcal{U}[2,6]$	$4.0$	$1.3333$
Angle ( $\theta$ )	$\theta\sim\mathcal{U}[15^\circ,75^\circ]$	$45^\circ$	$300\;(\text{deg}^2)$
Ramp Angle ( $\alpha$ )	$\alpha\sim\mathcal{U}[15^\circ,45^\circ]$	$30^\circ$	$75\;(\text{deg}^2)$
Friction ( $\mu$ )	$\mu = 0.06$ (fixed)	$0.06$	$0$

This strict control enables interpretable distributional shifts and statistical comparisons under both ID and OOD settings.

6. Evaluation Metrics

NewtonBench-60K defines metrics that decompose model performance along spatial accuracy, temporal consistency, and physical realism. Letting $\Delta t=1/16$ s be the inter-frame time step and $\mathbf{c}_t\in\mathbb{R}^2$ the per-frame object centroid, the benchmarks are:

Trajectory Position Error (L2):

$\mathrm{L2}_{\mathrm{traj}} = \frac{1}{T}\sum_{t=1}^T \|\mathbf{c}_t^{\mathrm{gen}} - \mathbf{c}_t^{\mathrm{gt}}\|_2$

Chamfer Distance (CD) of binary masks per frame:

$\mathrm{CD} = \frac{1}{T} \sum_{t=1}^T \left( \sum_{p\in P_t}\min_{q\in Q_t}\|p-q\|_2^2 + \sum_{q\in Q_t}\min_{p\in P_t}\|q-p\|_2^2 \right)$

Intersection over Union (IoU):

$\mathrm{IoU} = \frac{1}{T} \sum_{t=1}^T \frac{|M_t^{\mathrm{gen}}\cap M_t^{\mathrm{gt}}|}{|M_t^{\mathrm{gen}}\cup M_t^{\mathrm{gt}}|}$

Velocity RMSE:

$\mathbf{v}_t = \frac{\mathbf{c}_{t+1} - \mathbf{c}_t}{\Delta t}, \quad \mathrm{RMSE}_v = \sqrt{\frac{1}{T-1}\sum_{t=1}^{T-1} \|\mathbf{v}_t^{\mathrm{gen}} - \mathbf{v}_t^{\mathrm{gt}}\|_2^2}$

Acceleration RMSE:

$\mathbf{a}_t = \frac{\mathbf{v}_{t+1}-\mathbf{v}_t}{\Delta t} = \frac{\mathbf{c}_{t+2}-2\mathbf{c}_{t+1}+\mathbf{c}_t}{\Delta t^2}$

$\mathrm{RMSE}_a = \sqrt{\frac{1}{T-2}\sum_{t=1}^{T-2}\|\mathbf{a}_t^{\mathrm{gen}} - \mathbf{a}_t^{\mathrm{gt}}\|_2^2}$

Constant-Acceleration Residual (Motion Smoothness):

$\mathcal{R}_{\mathrm{kin}} = \frac{1}{T-2}\sum_{t=1}^{T-2}\|\boldsymbol{\phi}_{t+1} - 2\boldsymbol{\phi}_t + \boldsymbol{\phi}_{t-1}\|_2^2$

where $\boldsymbol{\phi}_t$ denotes the RAFT-estimated optical flow. This residual quantifies deviation from Newtonian constant-acceleration dynamics.

These metrics provide both object-centric and mask-based evaluation, covering spatial precision, mask overlap, kinematic accuracy, and motion smoothness.

7. Usage Protocol and Best Practices

Benchmark usage is governed by standardized evaluation and ablation protocols:

Benchmark Splits: Report results separately for the 5,000 ID and 5,000 OOD videos, with breakdowns per NMP and overall averages. Always present ID vs. OOD performance gaps.
Conditioning: Inputs may be text + first 4 video frames or text only. For model fairness, all generated samples must be seeded with the same initial noise.
Mask Extraction: Ground-truth video masks are derived from the renderer; generated video masks are extracted using SAM2 with identical prompts. This is mandatory for all mask-based metric computation.
Optical Flow: Optical flow fields $\boldsymbol{\phi}_t$ are computed with RAFT for measuring velocity and kinematic residuals.
Reporting Standards: Metrics are averaged over at least three sampling seeds (due to diffusion stochasticity); error bars denote standard deviation.
Ablations: When evaluating physical reward schemes, ablation studies must disentangle kinematic and mass conservation term effects to detect and prevent reward-hacking collapse.

Adherence to these guidelines ensures the comparability and reproducibility of model performance claims on NewtonBench-60K. The benchmark thus constitutes a controlled experimental environment for advancing physics-aware video generation (Le et al., 29 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to NewtonBench-60K.