Physical realism of video-based world models versus traditional simulators

Determine whether action-conditioned video-based world models learned from real-robot video-action datasets provide more realistic physical dynamics than traditional physics-based simulators for robot manipulation, given the presence of hallucinations in video models.

Background

Real-robot reinforcement learning is constrained by the high cost and risk of physical interaction, motivating alternatives such as supervised learning from demonstrations and training in software-based physics simulators. Recent work proposes world models—action-conditioned video generation models trained on real-robot data—that can approximate execution outcomes and may close the visual sim-to-real gap.

However, video-based world models can hallucinate, casting doubt on their physical fidelity. The paper explicitly notes uncertainty about whether such models offer more realistic physics than traditional simulators and focuses on end-to-end performance evaluation instead, leaving the question of physical realism open.

References

However, it is unclear whether video world model offers more realistic physics than traditional simulators due to hallucinations (Yang et al., 2024).

World-Gymnast: Training Robots with Reinforcement Learning in a World Model  (2602.02454 - Sharma et al., 2 Feb 2026) in Section 1 (Introduction)