YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Published 28 May 2026 in cs.CV | (2605.30346v1)

Abstract: As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that current video generation models fail to capture causal structure, leading to inconsistent physical and temporal dynamics.
It critiques existing benchmarks for neglecting causal inference and highlights the need for metrics like arrow-of-time and counterfactual tests.
The study advocates hybrid architectures that integrate explicit causal reasoning to enhance video simulation realism and world modeling.

YoCausal: Causality in Video Generation and World Modeling

Introduction

"YoCausal: How Far is Video Generation from World Model? A Causality Perspective" (2605.30346) addresses the gap between state-of-the-art generative video models and world models, specifically focusing on their ability to capture causal structures present in real-world dynamics. The paper approaches this from a causality lens, dissecting core principles and identifying how current video generation architectures and benchmarks relate to world model desiderata. The essay below explores its main contributions, theoretical framing, methods, and implications for both vision-based generative modeling and broader AI research.

Theoretical Framework

Drawing inspiration from causal inference literature (cf. Pearl [neuberg2003causality]), the paper establishes the necessity for generative video models to capture underlying causal mechanisms rather than mere temporal consistency or statistical correlations. The authors delineate the distinction between three classes of models:

Video Generation Models: Typically trained for photorealistic synthesis or conditional generation, relying on statistical priors from data distributions (e.g., video diffusion [ho2022video], transformer-based generation [peebles2023scalable]).
World Models: Architected to encode, predict, and simulate the laws of physics and causal dynamics of agents, objects, and environments (see Ha & Schmidhuber [ha2018world]; LeCun et al. [lecun2022path]).
Hybrid Architectures: Models aiming for both visual realism and causal consistency, often via explicit or implicit constraints, supervised pretraining, or specialized regularization.

YoCausal asserts that most current video generation systems lack explicit causal reasoning capabilities and are evaluated primarily on metrics agnostic to causal correctness. The paper utilizes a causality perspective to redefine evaluation criteria, motivating a paradigm shift in model design and benchmarking.

Methodological Contributions

The paper systematically reviews contemporary video generation architectures, highlighting their limitations for causal world modeling:

Diffusion and Transformer Video Models: Analysis reveals that advances in scaling laws, photorealism, and text conditioning (e.g., Sora [liu2024sora], Stable Video Diffusion [blattmann2023stable], HunyuanVideo [kong2024hunyuanvideo], Open-Sora [zheng2024open]) fail to address causal structure learning, often producing plausible but physically inconsistent trajectories.
Benchmarks and Evaluation: Existing video generation benchmarks (VBench [huang2024vbench], WorldModelBench [li2025worldmodelbench], IntPhys [riochet2018intphys], Videophy [bansal2024videophy], Morpheus [zhang2025morpheus]) are critiqued for content bias [ge2024content], lack of temporal causality tests, or insufficient coverage of counterfactual reasoning.
Violation-of-Expectation and Arrow-of-Time: Inspired by developmental cognitive science and intuitive physics [spelke1992origins, baillargeon1985object], YoCausal proposes the necessity for benchmarks to include violation-of-expectation scenarios and arrow-of-time tasks (cf. Pickup et al. [pickup2014seeing], Wei et al. [wei2018learning], Xue et al. [xue2025seeing]) to directly probe causal inference in generative models.

The authors synthesize methods for constructing and evaluating generative models against those standards: (1) generating videos with manipulated causal structure, (2) designing metrics for causal consistency, and (3) integrating counterfactual and intervention-based test cases.

Empirical Findings

While extensive numerical results are not presented, the paper references empirical studies from related work that support the major claims:

State-of-the-art commercial video diffusion models (e.g., Sora [liu2024sora], HunyuanVideo [kong2024hunyuanvideo]) excel in photorealism and compositionality but regularly produce physically impossible or causally incoherent samples, as demonstrated in action-centric benchmarks [bansal2025videophy].
Physical commonsense benchmarks (Morpheus [zhang2025morpheus] and Videophy [bansal2024videophy]) show that models yield low scores on physics and causality metrics even when achieving high FVD or realism scores.
Arrow-of-time recognition tasks and temporal causality probes reveal that most diffusion and transformer models’ outputs lack chronologically consistent dynamics, further substantiating the inadequacy of standard evaluation metrics for causal reasoning.

Implications and Speculation

Practical Impact

The findings have immediate ramifications for video generation applications requiring causal reasoning, including robotics planning, embodied AI, augmented reality, and simulation environments. Without causal grounding, generated videos are unsuitable for downstream tasks involving prediction, control, or agent-based reasoning.

Theoretical Impact

The causality-first perspective shifts research focus from content and realism to mechanistic simulation. This necessitates new architectures integrating explicit causal structure learning, possibly via physics-informed latent spaces, structured intervention procedures, or hybrid supervised self-supervised regimes (cf. World Models [ha2018world], Physion [bear2021physion], CRAFT [ates2022craft]).

Future Directions

Benchmarking: Development of compositional, intervention-based, and violation-of-expectation benchmarks targeting both visual and causal faithfulness.
Model Architectures: Progression towards hybrid models interfacing generative diffusion with simulation engines, counterfactual prediction modules, or explicit causal graphs.
Evaluation Metrics: Establishment of metrics quantifying causal consistency, intervention pass rates, and alignment with physical laws rather than content bias or surface-level realism.
Scaling Laws: Inquiry into scaling laws regarding causal structure learning and transfer, as current scaling research (cf. diffusion transformers [liang2024scaling]) focuses mostly on fidelity and diversity.

Conclusion

"YoCausal: How Far is Video Generation from World Model? A Causality Perspective" (2605.30346) rigorously argues that video generation models remain fundamentally limited in their capacity for world modeling due to lack of explicit causal reasoning. The paper calls for causality-centric benchmarks, model architectures, and evaluation metrics that bridge the gap between current generative video synthesis and true world simulators. Its perspective sets an agenda for advancing both practical and theoretical frontiers in machine vision, generative modeling, and embodied AI.

Markdown Report Issue