WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

Published 29 Jan 2026 in cs.CV | (2601.21282v1)

Abstract: Recent advances in generative foundational models, often termed "world models," have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a two-tiered benchmark that disentangles intuitive physics from precise parameter estimation to diagnose video generative models.
The paper reveals that current WFMs show high variance and reliance on training priors, failing to accurately model key physical parameters.
The paper highlights the necessity for architectural innovations to achieve robust physical grounding in next-generation generative models.

Concept-Specific Diagnostic Benchmarking of Physical Scene Understanding in World Foundation Models

Introduction

The evolution of world foundation models (WFMs) for video generation raises persistent questions about their fidelity in simulating authentic physical processes. "WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models" (2601.21282) presents a rigorous, disentangled diagnostic benchmark expressly tailored to quantify and localize the physical reasoning capabilities and deficiencies of video-generative WFMs. This work identifies and addresses a significant deficiency in prevailing benchmarks: the entanglement of multiple physics concepts in single tests and the excessive reliance on coarse or binary metrics, which impede nuanced model diagnosis. WorldBench thus establishes a two-tiered framework—intuitive physics and parameter estimation—facilitating in-depth, concept-specific model assessment and highlighting the current generation’s deficiencies in both generalization and parameter adherence.

Benchmark Structure and Methodology

WorldBench is specifically architected to evaluate WFMs’ capacity for concept-level physical scene understanding via video prediction tasks. The benchmark comprises two principal subsets: one targeting intuitive physics concepts and the other evaluating precise estimation of physical parameters. Synthetic and real videos are leveraged, with all simulated videos generated via Kubric (PyBullet for physics, Blender for rendering).

Figure 1: Overview of the generation and evaluation process: video continuation generation (top), evaluation pipeline using object segmentations (bottom).

The evaluation protocol involves feeding WFMs several initial frames and requiring them to predict video continuations under a constrained setup designed to test single physics concepts. Post-prediction, the video sequences are passed to the SAM2 segmenter for object tracking; predicted segmentations are compared framewise to ground truth via metrics such as foreground mIoU and background RMSE.

Intuitive Physics Module

This module targets four fundamental concepts: motion physics, support relations, object permanence, and scale/perspective. Each scenario is systematically constructed, with multiple videos per scenario undergoing controlled randomization in object type, initial conditions, and material. This design isolates the core property under assessment and avoids confounding factors.

Figure 2: Qualitative examples for the Motion Physics scenario of the intuitive physics subset—demonstrating object collision and subsequent dynamics.

Figure 3: Qualitative examples for the Support Relations scenario—testing balance and stability after a roll-down and collision.

Figure 4: Qualitative examples for Object Permanence—evaluating WFMs' ability to track occluded objects correctly.

Physical Parameter Estimation Module

Designed for explicit, engineering-grade testing, this subset isolates the capacity to simulate and infer specific material constants or physical laws (e.g., gravitational acceleration, friction coefficients, viscosity). For each controlled scenario, physical parameters are systematically varied, and object trajectories are designed to minimize ambiguity. The calibration protocol employs checkerboard-based pose estimation and careful depth standardization, enabling accurate 3D object localization and parameter extraction via curve-fitting.

Figure 5: Pipeline for physical parameter estimation: checkerboard and SAM2-based 3D object localization, parameter curve fitting.

Figure 6: Samples from the friction coefficient scenario: a steel block's trajectory on material-varied ramps.

Experimental Results and Empirical Findings

State-of-the-art WFMs (notably, various Cosmos variants and several prominent I2V models) were comprehensively evaluated. Strong empirical observations were established:

High Variance and Non-Adherence to Parameters: Models displayed substantial rollout-to-rollout variance in parameter estimation, particularly for gravity—a key finding demonstrated quantitatively in extensive evaluation tables (see paper). Most models followed qualitatively plausible object trajectories but with incorrect physical magnitude (e.g., downward accelerations deviating from $9.8 \mathrm{\, m/s^2}$ ).
Reliance on Training Priors: WFMs generated prototypical trajectories and interactions for familiar objects and scenarios but failed to generalize for unseen materials or less frequent phenomena (plastic for friction, honey for viscosity). Scene realism did not equate to parameter realism.
Limited Improvement with Scale and Modality: Across both synthetic and real video domains, models' performance was consistent, indicating the issue stems from a lack of physical abstraction, not solely from domain shift.
Scenario Duration and Familiarity: Models performed better on scenes with longer, slower interactions or those matching frequent training data distributions (e.g., rolling a ball down a ramp), contrasting sharply with rapid, rare, or occlusion-rich events.
Baselines for VLMs: On the language-based subset (True/False or MC questions about continuation and physics), all tested SOTA VLMs achieved only marginally above-chance accuracy, reinforcing the limitation of current multimodal architectures in physical understanding.
Figure 7: mIoU and background RMSE decline over rollout time, reflecting compounding prediction errors and physical drift.

Implications for Physical AI and Diagnostic Evaluation

WorldBench’s contributions transcend benchmarking: the explicit disentanglement of physical concepts exposes the lack of genuine parameterization in current WFMs and provides actionable diagnostic granularity. The video-first, concept-targeted design makes it uniquely aligned to contemporary model architectures, as previous benchmarks require binary selection or confound multiple concepts per sample.

From an application perspective, the benchmark unmasks a critical roadblock for physically grounded generative models as simulators—to be reliable as synthetic data sources or for autonomous planning, WFMs must internalize true physical constants and not merely replicate plausible video patterns. The demonstrated limits further imply that simply scaling video datasets or models will not automatically confer physical grounding; architectural advances or explicit physical inductive biases may be necessary.

Figure 8: Model-specific qualitative failures: autoregressive model distorts shapes, diffusion model hallucinates unrealistic features.

Figure 9: SAM2 sustaining object identities through occlusion; supporting robust mask-based metric computation.

Future Prospects

The WorldBench diagnostic framework provides an extensible foundation for future physical reasoning benchmarks. Its modularity permits the addition of new physical concepts (e.g., optics, collision mechanics) and could incorporate more complex multi-object or interactive scenarios. Integrating more diverse sensors/modalities or supporting fine-grained intervention-based physical probing remains important future work. In model development, progress in physically grounded neural architectures or hybrid data-driven/analytical training may be catalyzed by such rigorous, isolating diagnostics.

Conclusion

WorldBench (2601.21282) represents a substantive advancement in model benchmarking for physical scene understanding: it forgoes entangled, coarse metrics in favor of meticulously designed concept-specific, video-continuation-based evaluations, and reveals the pronounced limitations of current WFMs in both physical abstraction and parameter adherence. This diagnostic granularity is a necessary step for advancing toward physically reliable generative models and sets a baseline for both model and benchmark evolution in physical AI research.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (10)

Collections

YouTube

Show All Videos

WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

Summary

Concept-Specific Diagnostic Benchmarking of Physical Scene Understanding in World Foundation Models

Introduction

Benchmark Structure and Methodology

Intuitive Physics Module

Physical Parameter Estimation Module

Experimental Results and Empirical Findings

Implications for Physical AI and Diagnostic Evaluation

Future Prospects

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (10)

Collections

YouTube