Reasoning Capabilities of Video Generative Models vs. LLMs

Determine whether contemporary video generative models, including image-to-video generation systems, can exhibit reasoning capabilities similar to large language models, i.e., whether these models possess comparable step-by-step reasoning abilities beyond visual fidelity and temporal coherence.

Background

LLMs have demonstrated strong step-by-step reasoning abilities, while recent video generative models are beginning to tackle tasks requiring physical plausibility and logical consistency. Veo 3’s introduction of chain-of-frames reasoning suggests a pathway toward visual reasoning in image-to-video generation.

Despite this progress, existing benchmarks largely assess visual fidelity, temporal smoothness, and physical plausibility, leaving higher-order visual reasoning under-evaluated. This motivates a dedicated benchmark to assess whether video generative models can truly match the reasoning capabilities associated with LLMs.

References

However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to LLMs.

— TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models (2511.13704 - Chen et al., 17 Nov 2025) in Abstract

Reasoning Capabilities of Video Generative Models vs. LLMs

Background

References

Related Problems