Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 169 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

VideoThinkBench: Evaluating Video Reasoning Models

Updated 8 November 2025
  • VideoThinkBench is a unified benchmark designed to evaluate the temporal, multimodal reasoning abilities of video generation models.
  • It categorizes tasks into vision-centric puzzles and text-centric challenges, enabling precise measurements of visual inference and logical reasoning.
  • The framework employs automated checks on video frames and audio transcripts, enhancing evaluation reliability through self-consistency and aggregation techniques.

The Video Thinking Benchmark (VideoThinkBench) is a systematic evaluation suite developed to probe the reasoning capabilities of video generation models, specifically under the "Thinking with Video" paradigm. Unlike traditional benchmarks predicated on text or static images, VideoThinkBench leverages the temporal and dynamic nature of video to create a unified assessment platform for both visual and textual reasoning. Its goal is to establish whether video-generation models can function as general, multimodal reasoners by synthesizing perception, logical inference, and action within temporally structured tasks.

1. Conceptual Foundations and Design Rationale

VideoThinkBench is motivated by two intrinsic limitations of earlier reasoning paradigms: static images cannot capture dynamic or procedural knowledge, and separating vision and text impedes a truly unified multimodal understanding. The benchmark is constructed to address these limitations by:

  • Enabling illustration of step-by-step reasoning and imagination through video generation.
  • Encompassing both vision-centric and text-centric reasoning tasks in a single evaluation scaffold.
  • Directly assessing if a model’s output video visualizes not just an answer, but also the full reasoning trajectory.

This design philosophy positions VideoThinkBench as a diagnostic tool for evaluating reasoning capacities in ab initio video generation models, with an emphasis on temporal abstraction, multimodal fusion, and procedural logic (Tong et al., 6 Nov 2025).

2. Benchmark Structure and Task Taxonomy

VideoThinkBench is organized into two primary task categories:

  1. Vision-Centric Tasks: Require visual reasoning via drawing, geometric construction, or pattern induction, and can often be automatically checked via frame inspection.
    • Subtypes/Examples:
      • Eyeballing Puzzles: Estimation of geometric properties (e.g., incenter, midpoint, collinearity) via construction and marking, with options such as "Which point is the incenter?"
      • Visual Puzzles: Adaptations of PuzzleVQA, requiring color-filling, region marking, or shape drawing.
      • ARC-AGI-2: Abstract pattern completion or transformation, often with few-shot demonstrations.
      • Mazes: Navigational path generation within parametric grid mazes.
  2. Text-Centric Tasks: Pose language-based problems (e.g., arithmetic or general knowledge), requiring that the solution be constructed and revealed in the video, typically by drawing or writing.
    • Subtypes/Examples:
      • Math Benchmarks: GSM8K, MATH-500, AIME (multi-step grade-school to advanced problems).
      • General Knowledge: BBH, MMLU, GPQA, SuperGPQA, and multimodal logic tasks (e.g., MMMU, MathVista) recast for video-based answer encoding.

Each task or benchmark sample is mapped to a programmatically verifiable schema—vision-centric tasks rely on geometric or color-matching of final video frames, while text-centric tasks leverage OCR and audio transcription.

A representative dataset breakdown: | Task Group | #Samples | |--------------------|----------| | Vision-Centric | 2,696 | | Text-Centric | 1,453 | | Total | 4,149 |

3. Evaluation Methodology

Evaluation is multi-modal and channel-aware:

  • For Vision-Centric Tasks, three assessment protocols are deployed:
    • Audio Channel: The spoken answer, transcribed via whisper-1.
    • Last Frame: Machine inspection of the final frame for correct drawing/marking or writing.
    • Major Frame: Majority-vote of predictions over every 5th frame to handle temporal ambiguity.
    • Automated scripts analyze position, color, trajectory, and geometric features.
  • For Text-Centric Tasks, both video (visual answer extraction) and audio (speech-to-text) channels are evaluated, using both intersection ("V∩A" requires both to be correct) and union ("V∪A") scoring.
    • LLM-as-a-judge (GPT-4o) is used to verify correctness for open-ended problems.
  • For Multiple-Choice VLM Baselines, models are prompted to provide explicit answer labels, which are parsed and compared.

For vision tasks such as visual puzzles, a deviation metric is computed:

Diff=(x,y)Puzzle Areaδ(Pixelgen(x,y),Pixelgt(x,y))Diff = \sum_{(x, y) \in \mathrm{Puzzle\ Area}} \delta(\mathrm{Pixel}_{gen}(x, y), \mathrm{Pixel}_{gt}(x, y))

Here, δ\delta is an RGB or binary error.

4. Empirical Results and Diagnostic Findings

Across both categories, evaluation tables are provided for major state-of-the-art systems, example (selected, major frame evaluation):

Task Sora-2 Gemini 2.5 Pro Claude Sonnet 4.5
Eyeballing 40.2 26.5 35.1
Visual Color 67.0 73.9 85.6
ARC-AGI-2 1.3 4.9 13.6

On text-centric math tasks (audio accuracy): | Dataset | Sora-2 | Gemini | GPT5 High | Claude | |--------------|--------|--------|-----------|--------| | GSM8K | 98.9 | 94.8 | 97.2 | 90.0 | | MATH-500 | 92 | — | — | — | | MMMU | 69.2 | — | — | — |

Findings:

  • Video generation models (e.g., Sora-2) achieve high accuracy (>90%) in grade-school to advanced math and general knowledge tasks when allowed to generate answers via drawings and verbal utterances.
  • Performance on vision-centric tasks varies: these models are competitive or even superior for geometric point estimation, but less so on abstract pattern induction (ARC-AGI-2).
  • Self-consistency and temporal aggregation (major frame voting) substantially boost Sora-2’s accuracy (e.g., Arc Connect puzzle: 68% to 90% with 5-retry aggregation).
  • In-context learning (few-shot) yields measurable accuracy gains, especially for abstract pattern tasks.

5. Diagnostic Analyses: Reasoning and Modality Insights

Multiple analyses elucidate reasoning provenance and robustness:

  • Reasoning Source Analysis: For math problems, performance on perturbed/test-leaked sets remains stable, indicating genuine procedural reasoning rather than memorization.
  • Prompt Rewriter Effect: Control experiments (Wan2.5) reveal that video model reasoning often emerges from an internal LLM/VLM module that generates step-wise instructions for video synthesis.
  • Reasoning Quality: While end answers may be correct, only a minority (<14%) of Sora-2’s written chain-of-thought are “fully correct” in manual inspection.
  • Self-Consistency: Aggregation over frames/generations consistently improves accuracy, mirroring the test-time scaling observed in pure LLMs.
  • Modality Cross-validation: Discrepancies between video and audio channels highlight the necessity for unified multimodal evaluation; intersection scoring penalizes inconsistent outputs.

6. Implications for Unified Multimodal Reasoning

VideoThinkBench provides evidence that "Thinking with Video" is a viable and diagnostic paradigm for unified multimodal reasoning:

  • Dynamic manipulation (drawing, writing, constructing) in the video domain enables assessment of reasoning types (spatial, procedural, abstract, symbolic) that static-image or text-only benchmarks cannot.
  • Direct video generation allows verifiable inspection of the reasoning process, not just the answer, bridging the gap between “showing” and “telling.”
  • The benchmark supports robust, reproducible evaluation and cross-model comparison, with fully automated checking for most tasks.
  • Analysis indicates that reasoning performance in current video models is tightly coupled to the architecture’s LLM component, rather than the generative engine per se.

7. Prospects and Future Directions

VideoThinkBench’s comprehensive structure and diagnostic regime set a precedent for subsequent benchmarks targeting video-based, multimodal reasoning:

  • The paradigm is extensible to systematically include spatial, temporal, inductive, and symbolic tasks at arbitrary complexity.
  • Performance gains from self-consistency and in-context learning in Sora-2 suggest that general video reasoning models can benefit from aggregation and prompt engineering techniques analogous to those in text-based models.
  • Persistent performance gaps in abstract pattern reasoning and multi-step written CoT generation highlight areas demanding new model architectures, training recipes, or fusion interfaces.
  • Future iterations may integrate real-world video, additional tasks emphasizing counterfactual or causal reasoning, and expand to cross-embodiment tasks (vision, language, action).

VideoThinkBench thus establishes a rigorous, multimodal foundation for unified reasoning assessment in AI, supporting both vision-centric and language-centric inference within the generative video paradigm, and providing an empirical springboard for the development of more cognitively aligned, human-like AI systems (Tong et al., 6 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Video Thinking Benchmark (VideoThinkBench).