Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks (2511.15065v1)

Published 19 Nov 2025 in cs.CV and cs.AI

Abstract: Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.

Summary

The paper introduces VR-Bench to evaluate video models' spatial reasoning via maze-solving tasks, benchmarking dynamic visual and temporal planning.
It demonstrates that video models using chain-of-frame reasoning outperform text-based methods in efficiently solving complex maze tasks.
The study shows scaling effects, as increased inference sampling boosts accuracy and generalization across diverse maze variations.

Reasoning via Video: Evaluation of Video Models through Maze-Solving Tasks

Introduction

The paper "Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks" (2511.15065) investigates the potential of video models to engage in spatial reasoning via video generation. With recent advancements in video models capable of coherent motion dynamics and high-fidelity video generation, this research seeks to explore their reasoning capabilities, akin to LLMs evolving from text generation to reasoning. The paper introduces VR-Bench, a benchmark designed to evaluate video models on maze-solving tasks, offering insights into their spatial planning and dynamic visual reasoning.

Dataset and Benchmark Design

VR-Bench Composition

VR-Bench includes five distinct maze types: Regular Maze, Trapfield, Irregular Maze, Sokoban, and 3D Maze, capturing both 2D and 3D spatial reasoning tasks. These mazes are procedurally generated, spanning diverse visual styles and difficulty levels, totaling 7,920 videos. Each video pairs with a trace reasoning task requiring models to determine optimal paths, challenging their spatial perception and planning abilities.

Figure 1: Variations of difficulty level and maze texture.

Reasoning Paradigm

Unlike traditional reasoning methods expressed in textual formats, video models employ a chain-of-frame (CoF) reasoning paradigm, generating sequential visual frames that encode spatial dynamics and temporal causality. This approach utilizes the inherent continuity of video data, making it a suitable substrate for multimodal reasoning.

Evaluation Metrics

VR-Bench evaluates video models across several metrics:

Exact Match (EM): Assessing complete trajectory correctness against the optimal path.
Success Rate (SR): Measuring task completion by reaching designated goals.
Precision Rate (PR): Quantifying the accuracy of steps along the trajectory.
Step Deviation (SD): Evaluating path-length efficiency relative to the optimal solution.
Figure 2: Bad case visualization and VLM-as-judge schematic.

Additionally, rule compliance and structural fidelity are evaluated using VLM-score and Maze Fidelity (MF), examining motion continuity, structural consistency, and interactional rationale.

Experimental Analysis

Model Performance

The paper compares various video models, both proprietary and open-source, against VLMs under VR-Bench tasks. Wan-R1, a fine-tuned video model, demonstrates top-tier performance across nearly all evaluation metrics, achieving significant efficiency and accuracy improvements compared to baseline models.

Figure 3: Model performance (PR and SR) on Irregular Maze and Trapfield across difficulty levels. Each curve represents a baseline, while the dashed and dotted lines indicate VLM and Video Model averages.

Reasoning Efficacy

Video models exhibit superior reasoning capabilities under complex task conditions, outperforming text-based reasoning methods. This advantage is attributed to their ability to leverage temporal visual context and spatial continuity, facilitating higher information density and more stable reasoning processes.

Scaling Effects and Generalization

The research highlights the test-time scaling effect, where increased sampling during inference leads to substantial performance gains, akin to self-consistency observed in LLMs. Video models explore diverse reasoning paths, improving reliability and accuracy.

Figure 4: Performance on Irregular Maze using Wan-R1 under test-time scaling. Results are shown across different sampling numbers ( $K \in {1,4,8,12,16}$ ) and difficulty levels.

Wan-R1 demonstrates strong difficulty and type generalization, showing robust reasoning abilities across unseen maze variations and textures, emphasizing its scalability in diverse environments.

Conclusion

This paper presents an in-depth evaluation of video models' reasoning abilities through the VR-Bench benchmark, showcasing their potential in spatial reasoning tasks. The findings affirm video models' proficiency in dynamic visual reasoning, outperforming text-based methods, and highlight their scalability through test-time scaling effects and generalization across diverse scenarios.

Future research could extend VR-Bench to broader complex reasoning tasks and explore embodied reasoning settings, enhancing video models' application in interactive environments.