TiViBench: Hierarchical I2V Benchmark

Updated 3 July 2026

TiViBench is a hierarchical benchmark for assessing the higher-order reasoning capabilities of image-to-video generation models across diverse challenges.
It organizes evaluation into four reasoning dimensions using 24 scenarios at varying difficulty levels, employing both quantitative and qualitative metrics.
VideoTPO, a test-time optimization strategy, iteratively refines video generation prompts via self-analysis to improve logical and perceptual consistency.

TiViBench is a hierarchical benchmark designed to assess the higher-order reasoning capabilities of image-to-video (I2V) generation models. Unlike prior benchmarks focused primarily on visual fidelity or temporal coherence, TiViBench systematically probes for forms of reasoning analogous to those demonstrated in LLMs, including structural, visual, symbolic, and planning tasks. It defines 24 diverse video generation scenarios instantiated at multiple levels of difficulty and provides a robust protocol for quantitative and qualitative evaluation. The framework also introduces VideoTPO, a test-time strategy for optimizing reasoning performance by leveraging self-analysis from vision-LLMs, laying the foundation for advancements in think-in-video reasoning for generative models (Chen et al., 17 Nov 2025).

1. Benchmark Structure and Reasoning Dimensions

TiViBench organizes assessment across four primary hierarchical reasoning dimensions:

Structural Reasoning & Search: Evaluates whether a video generative model can traverse, search, or extrapolate within abstract structures. Tasks include graph traversal, maze solving, sorting numbers, temporal ordering, rule extrapolation, and game-move prediction. The underlying rationale is to test whether a model implicitly learns environment topology and can generate coherent solution paths.
Spatial & Visual Pattern Reasoning: Probes the ability to detect, complete, or extend spatial and temporal visual patterns. Representative tasks include shape fitting, connecting colors, pattern recognition, odd-one-out identification, counting objects, and visual analogy. Such tasks assess perceptual grouping, symmetry, repetitive pattern recognition, and quantitative reasoning.
Symbolic & Logical Reasoning: Focuses on abstract symbol manipulation and rule-following. Examples include Sudoku completion, arithmetic operations, symbolic deduction, visual deduction (e.g., fill-in-the-blank), transitive inference, and game-rule application. This dimension tests capabilities beyond raw pixel-level understanding, demanding abstraction and formal rule handling.
Action Planning & Task Execution: Assesses multi-step, temporally coherent physical tasks. These include tool use, robot navigation, goal-directed planning, multi-step manipulation, visual instruction following, and game-strategy planning. Tasks simulate causally entangled, real-world processes.

Each of the 24 base scenarios is instantiated at three graded difficulty levels—Easy, Medium, Hard—varying the number of steps, rules to infer, or amount of visual distraction, yielding a total of 595 benchmark samples.

Dimension	Example Scenario	Easy	Medium	Hard
Structural	Maze Solve	5×5	7×7	10×10
Spatial	Shape Fitting	3 pieces	5 pieces	7 pieces
Symbolic	Sudoku	4×4 grid	6×6 grid	9×9 grid
Planning	Robot Navigation	3 steps	5 steps	8 steps

2. Evaluation Protocols and Metrics

Evaluation is divided into quantitative and qualitative methodologies:

Pass@k: The fraction of tasks for which at least one out of $k$ sampled videos is correct:

$\text{Pass@}k = \frac{\text{# tasks where ≥1 of k outputs is correct}}{\text{Total # of tasks}}$

Accuracy (Pass@1): Fraction of tasks solved correctly by the top output:

$\text{Accuracy} = \frac{\text{Number of correctly solved tasks}}{\text{Total tasks}}$

Final-State Validation: Compares the final frame or extracted facts against ground truth using OpenCV-based checks (e.g., digit grids, color segmentation), DINO/DINO-X feature similarity, or SSIM for structural fidelity.
Process-and-Goal Consistency: Tracks sequences for logical trajectory compliance using DINO-X tracking, bounding-box grounding, or vision-LLM-based QA, ensuring intermediates obey intended task rules.

Scoring is reported per dimension, deconstructing results by difficulty level without aggregating to a single scalar metric. An aggregate score, if desired, can be formulated as:

$S_{\rm aggregate} = \sum_{d\in\{\text{Struct,Spatial,Symb,Plan}\}} w_d \left(\frac{\text{Pass@1}_d(\text{Easy})+\text{Pass@1}_d(\text{Med})+\text{Pass@1}_d(\text{Hard})}{3}\right)$

with $w_d$ as per-dimension weights (uniform in default reporting). However, TiViBench primarily analyzes each reasoning dimension independently.

3. Model Performance and Comparative Results

TiViBench has been used to evaluate both commercial and open-source I2V models:

Commercial Models: Sora 2, Veo 3.1, Kling 2.1
Open-Source Models: CogVideoX1.5, HunyuanVideo, Wan2.1, Wan2.2

Pass@1 Overall Accuracy:

Sora 2: 27.9%
Veo 3.1: 26.1%
Kling 2.1: 11.6%
Wan2.2: 9.4%
Wan2.1: 8.4%
HunyuanVideo: 4.0%
CogVideoX1.5: 2.0%

Sora 2 exhibits the highest overall reasoning performance, with notable strengths in action planning/execution (38.2% Pass@1), spatial pattern reasoning (31.8%), symbolic logic (22.0%), and structural search (18.7%). Open-source models show latent potential as indicated by Pass@5 improvements (e.g., Wan2.2: 9.4% → 16.5%). Failure analyses reveal particular challenges in tasks such as maze solving, temporal ordering, odd-one-out identification, and Sudoku completion, attributed to explicit rule violations, loss of fine-grained features, and difficulty with intermediate state tracking. Qualitative assessment (Appendix E) provides side-by-side video montages of both successful and failed model outputs.

4. VideoTPO: Test-Time Preference Optimization

VideoTPO is a test-time reasoning enhancement strategy inspired by preference optimization. Rather than updating model weights, VideoTPO iteratively refines generation prompts by leveraging self-analysis via a vision-LLM (VLM), such as GPT-4o:

Input: image I; initial prompt P_0; max steps T
for t = 0 … T-1:
  # 1. Generate candidates
  V1, V2 ← I2V_generate(I, P_t)
  # 2. Self‐analysis
  L_t ← VLM_analyze(V1, V2, P_t)
  # 3. Textual gradient / suggestions
  G_t ← VLM_suggest(P_t, L_t)
  # 4. Prompt update
  P_{t+1} ← VLM_refine(P_t, G_t)
end
return best of {V1, V2} under final P_T

At each iteration, the VLM analyzes generated videos and the current prompt, then provides qualitative loss feedback (

\mathcal{L}_t

) and prompt-rewriting suggestions (

\mathcal{G}_t

), leading to improved prompt versions for subsequent generation cycles. The optimization objective is to maximize a latent “preference score,” where pairwise VLM judgments approximate expected task accuracy:

$\max_{P'} \; \mathbb{E}_{V\sim \text{I2V}(I;P')} [\text{accuracy}(V)]$

Candidate reranking is driven purely by self-analysis, with no external reward model or additional data required.

5. Limitations and Prospects for Advancing Video Reasoning

TiViBench analyses expose several limitations intrinsic to current I2V approaches:

Rule Encoding Deficits: Models frequently violate explicit task constraints (e.g., traversing maze walls) due to the absence of an internal symbolic reasoning or rule engine.
Loss of Fine Visual Details: Video autoencoder compressions diminish features critical for symbolic manipulation tasks (e.g., small Sudoku digits).
Insufficient Process Supervision: Generated videos can achieve correct final states by following logically invalid trajectories.

Proposed future research directions include hybridizing rule modules or planners with diffusion backbones, implementing frame-level reinforcement learning to enforce both process and goal consistency, pretraining on larger/more diverse reasoning-centric datasets, and extending TiViBench to encompass multi-agent interactions and continuous control tasks.

TiViBench, coupled with VideoTPO, provides a systematic protocol for evaluating and incrementally improving the reasoning abilities of video generation models, supporting rigorous advancement toward video models with genuine higher-order reasoning capacities (Chen et al., 17 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TiViBench.

TiViBench: Hierarchical I2V Benchmark

1. Benchmark Structure and Reasoning Dimensions

2. Evaluation Protocols and Metrics

3. Model Performance and Comparative Results

4. VideoTPO: Test-Time Preference Optimization

5. Limitations and Prospects for Advancing Video Reasoning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TiViBench: Hierarchical I2V Benchmark

1. Benchmark Structure and Reasoning Dimensions

2. Evaluation Protocols and Metrics

3. Model Performance and Comparative Results

4. VideoTPO: Test-Time Preference Optimization

5. Limitations and Prospects for Advancing Video Reasoning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research