- The paper introduces WorldReasonBench, a benchmark that rigorously evaluates video generation models as world-state predictors by assessing both temporal dynamics and causal reasoning.
- It employs a two-stage QA pipeline and multi-dimensional scoring—including ScorePR and human-aligned reward modeling—to measure reasoning accuracy, temporal consistency, and visual aesthetics.
- Empirical results demonstrate that closed-source models outperform open-source ones by a factor of two, highlighting challenges in logic reasoning and information-based tasks.
WorldReasonBench: Stress Testing Video Generators as World-State Predictors
Benchmark Motivation and Contributions
WorldReasonBench is a comprehensive benchmark designed to evaluate video generation models not solely in terms of visual realism but as predictors of future world states. The central premise is that modern generative video models are increasingly tasked with simulating plausible world dynamics; thus, benchmarking must transition from assessing perceptual quality to rigorous stress testing of reasoning and temporal evolution.
The benchmark comprises 436 curated test cases, each pairing an image with an action or instruction and requiring a video that evolves the initial state into a plausible future trajectory. Evaluation is multi-faceted: (1) process-aware reasoning verification using structured QA for temporal and causal failure analysis, (2) multi-dimensional quality assessment measuring reasoning correctness, temporal consistency, and visual aesthetics, and (3) the WorldRewardBench, which supplies ~6K expert-annotated preference pairs over 1.4K videos, supporting reward-model calibration and human alignment.
Reasoning Taxonomy and Benchmark Design
WorldReasonBench organizes reasoning into four top-level dimensions—World Knowledge, Human-Centric, Logic Reasoning, and Information-Based—spanning 22 granular subcategories. Test cases target diverse scenarios: physical transitions, social interactions, process timelines, quantitative math, data preservation, and creative transformations. Each case is annotated with 5–7 QA pairs stratified into four question types (factual, reasoning, detail, temporal), with difficulty levels enabling fine-grained diagnostic analysis.
Benchmark construction leverages VLM-assisted prompt and QA generation, followed by stringent human audit to ensure answerability, correctness, and uniqueness. WorldRewardBench then aggregates human preferences via calibrated scoring of reasoning quality, continuity, and aesthetics, facilitating both point-wise and pair-wise reward-model evaluation.
Evaluation Methodology
The evaluation protocol consists of:
- Process-aware Reasoning Verification: A two-stage QA pipeline checks if models reach the correct final state through plausible process transitions. Outcome accuracy (static content) and dynamic-phase scores (temporal/mechanistic reasoning) are contrasted via the reasoning gap metric, exposing "outcome hacking" (visually plausible but process-deficient generations).
- Multi-dimensional Quality Assessment: Models are scored on a 1–5 scale in three axes: reasoning correctness (primary), temporal consistency, and visual aesthetics (tertiary). Aggregation prioritizes reasoning quality, aligning with human annotation protocols and supporting reward model training.
- Human-Aligned Calibration: WorldRewardBench’s preference pairs are modeled with Bradley-Terry with ties, yielding human Elo rankings. Automatic metrics (ScorePR, AccQA) attain high rank correlation with human annotations (Spearman ρ=0.955), markedly outperforming pairwise VLM judges (e.g., ρ=0.804).
Empirical Results and Analysis
Closed-source video generators (Seedance2.0, Veo3.1-Fast, Sora2, Kling, Wan2.6) substantially outperform all open-source systems, with a roughly two-fold gap in both reasoning (ScorePR: 32.4–39.8 vs. 14.4–17.9) and quality scores (S(v): 50.3–59.4 vs. 21.3–30.5). This gap is robust under bootstrap inference; no open-source model’s confidence interval overlaps with any closed-source system. Category-level analysis reveals:
- Logic Reasoning and Information-Based reasoning are persistently challenging. In Logic Reasoning, even the best closed-source ScorePR is only 31.7, and open-source models score below 14. Information-Based tasks (data reading, visual editing, process timelines) are bottlenecks for both generators and reward models.
- World Knowledge and Human-Centric reasoning are relatively easier; closed-source models score above 35 in these dimensions (S(v) up to 80.1).
- Temporal and Mechanism Failures: Visual plausibility frequently masks deficits in process reasoning; outcome hacking is pervasive, especially among open-source models. The process-completeness ratio (Sdyn/AccQA) attributes failures primarily to dynamic phases.
Prompt-Side Guidance and Reasoning Assistance
Open-source generators exhibit larger gains from explicit transition hints (relative QA accuracy increases of 56–85%), indicating significant reliance on textual guidance rather than latent reasoning. Closed-source systems benefit less (+29%), suggesting more robust internal modeling of world dynamics.
Human Preference Calibration and Metrics Alignment
Process-aware QA metrics show tight calibration with human preferences, reinforcing their use over traditional visual judges, which conflate quality with realism. The proposed ScorePR metric penalizes outcome-hacking and maintains rank orderings consistent with human annotation protocols.
Reward-model evaluation of pairwise and pointwise protocols demonstrates that direct pairwise comparison is optimal for agreement, while pointwise scoring benefits training calibration. Information-Based cases are the hardest for reward models, corroborating generator-side findings.
Practical and Theoretical Implications
WorldReasonBench’s results challenge prevailing assumptions about the world modeling capabilities of generative video systems. Despite visually convincing outputs, both closed- and open-source generators are incomplete as true world-state predictors; mechanism-level, temporal, and information preservation failures are dominant. The findings have several implications:
- Benchmarking visual realism is insufficient; structured reasoning and process validation are necessary for auditing world model competence.
- Prompt engineering and reward modeling must account for process-phase failures, especially as open-source models close the visual fidelity gap.
- Human calibration data is essential; reward models aligned to expert preferences facilitate safer deployment and more accurate evaluation across commercial and open-source releases.
- Logic Reasoning and Information-Based reasoning remain unsolved; progress in these domains will be pivotal for next-generation world simulators, impacting downstream applications in video QA, agentic simulation, and multimodal RL.
Future Directions
WorldReasonBench and WorldRewardBench will support community-driven extension. Key priorities include:
- Expansion of taxonomy to multi-agent, counterfactual, and multi-stage event chains.
- Compositional benchmarking against numerical physics and structured symbolic reasoning tasks.
- End-to-end training of reward models from expert preference data and reward-driven fine-tuning of generators.
- Broader cross-family judge evaluation, addressing residual biases on close pairs and model clusters.
Conclusion
WorldReasonBench establishes a rigorous, human-aligned standard for next-generation video generator evaluation, reframing video prediction as world-state reasoning under physical, social, logical, and informational constraints. The benchmark exposes a persistent gap between visual plausibility and mechanistic reasoning, with closed-source and open-source systems differing by a factor of two in both reasoning and quality. Logic Reasoning and Information-Based cases are the most discriminative and challenging dimensions. WorldRewardBench enables calibrated, preference-driven reward modeling. The tools and data released will be instrumental in advancing world-aware video generation and reward-model design (2605.10434).