T2V-CompBench: Video Synthesis Benchmark

Updated 11 October 2025

T2V-CompBench is a systematic framework that rigorously assesses text-to-video models on complex compositional prompts involving multiple objects, attributes, and actions.
It employs MLLM-based, detection-based, and tracking-based metrics to evaluate spatial, dynamic, and numerical fidelity across seven distinct compositional categories.
Benchmarking reveals commercial models often outperform open-source ones, highlighting persistent challenges in dynamic attribute transitions and spatial reasoning.

Text-to-video compositional benchmarking ("T2V-CompBench") refers to the systematic assessment of text-to-video (T2V) generative models with respect to their ability to synthesize videos that accurately bind multiple objects, attributes, actions, and spatial–temporal relationships—well beyond simple prompt-to-video evaluation. The T2V-CompBench framework introduces rigorous multi-category, multi-metric evaluation with a comprehensive dataset and measurement protocol that addresses the limitations of prior video generation benchmarks by targeting complex compositionality, dynamic attribute control, spatial reasoning, object interaction, and quantitative fidelity.

1. Rationale and Benchmark Overview

T2V-CompBench was proposed to systematically assess whether current T2V models can produce coherent and semantically faithful video outputs from compositional prompts—i.e., prompts involving multiple interacting objects, evolving attributes, specified motions, and nuanced spatial relationships (Sun et al., 19 Jul 2024). Earlier benchmarks treated video-text alignment and aesthetic quality as their primary criteria, typically using simplistic prompts featuring single objects or actions. T2V-CompBench fills the gap by covering compositional complexity, which is critical for high-fidelity video synthesis in natural and artificial scenes.

The benchmark is constructed with 1,400 text prompts, each designed using carefully crafted templates across seven functional categories: consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. Prompts are verified by human annotators for correctness and diversity.

2. Compositional Categories

Seven compositional categories are engineered to capture distinct challenges in video generation:

Category	Evaluation Focus	Example Prompt
Consistent Attribute	Attribute preservation per object across time	"A blue car drives past a white fence on a sunny day."
Dynamic Attribute	Attribute change over time	"A timelapse of a leaf turning from green to bright red."
Spatial Relationship	Object positions in space	"A bird flies above a branch to the left of a house."
Motion Binding	Object motion vectors and directions	"A cat jumps forward while a ball rolls backwards."
Action Binding	Correct action-object pairings	"A dog runs and a cat jumps in the garden."
Object Interaction	Physical/social interactions	"Two boys shake hands while a dog chases a ball."
Generative Numeracy	Quantitative object counts	"Three apples bounce near a green basket."

These categories were selected to probe both static consistency (attributes, spatial relations) and dynamic reasoning (motion, actions, interactions, numeracy).

3. Evaluation Metrics and Protocol

T2V-CompBench introduces a suite of compositional evaluation metrics, designed per category type:

A. MLLM-based Metrics

Multimodal large-LLMs (MLLMs) such as LLaVA and Grid-LLaVA are used to interpret and score compositional alignment. For example, in consistent/dynamic attribute binding, image grids sampled from the video are input to the MLLM, which then provides chain-of-thought-based descriptions. Scoring functions compare these machine-generated descriptions with prompt-extracted phrases parsed by GPT-4.

Dynamic attribute binding is assessed via D-LLaVA by comparing initial and final states against expected transitions.

B. Detection-Based Metrics

Object detectors (GroundingDINO) identify object locations and counts across frames. For spatial relationships, 2D and 3D positional rules are evaluated using predicted masks and depth maps; generative numeracy is directly scored by counting objects of the specified class.

Spatial verification is formalized as: Given object centers $(x_1, y_1)$ and $(x_2, y_2)$ , object1 is considered "left of" object2 if $x_1 < x_2$ and $|x_1 - x_2|$ exceeds a threshold.

C. Tracking-Based Metrics

For motion binding, DOT-based tracking computes foreground and background motion vectors from segmented regions (GroundingSAM provides segmentation masks). The metric assesses whether the main object’s motion matches the textual description in both magnitude and direction.

Metric outputs for several categories exhibit strong correlation coefficients (Kendall’s $\tau$ , Spearman’s $\rho$ ) with human evaluators, demonstrating their effectiveness in capturing compositional accuracy.

4. Model Benchmarking Results and Analysis

Twenty T2V models (13 open-source, 7 commercial) are benchmarked. Commercial models consistently outperform open-source models, especially in complex categories like dynamic attribute binding and generative numeracy, which require modeling nuanced temporal changes and precise counting across frames.

Consistent attribute binding and spatial relationships are also challenging, with frequent failures in accurate object positioning and attribute maintenance, indicating the need for more robust layout-awareness in model design.

Action binding and motion binding reveal difficulties in assigning correct activities to multiple entities and in preserving true-to-text motion trajectories. The model evaluations pinpoint deficiencies not evident with coarse or single-object benchmarks.

5. Limitations and Identified Challenges

T2V-CompBench exposes several open challenges:

Compositional Complexity: Faithful multi-object, multi-attribute video synthesis remains highly challenging.
Temporal Dynamics: Models often fail in dynamic attribute transition and motion direction control.
Metric Integration: No single metric reliably evaluates all compositional dimensions, necessitating category-specialized scoring.
Generalization: Out-of-distribution or imaginative prompts frequently induce failures; models lack generalized compositional world knowledge (Wang et al., 9 Oct 2025).

A plausible implication is that future benchmarks may require further integration of world logic and temporal causality evaluation, as proposed in subsequent works.

6. Future Directions

The findings point toward several research priorities:

Improving Compositional Reasoning: Mechanisms such as enhanced layout planners and richer language-to-visual conditioning must be developed.
Unified Metrics: There is a need for robust multimodal evaluators capable of simultaneously scoring spatial and temporal alignment.
Extending Duration: Models should maintain compositional consistency over longer sequences and varied scenarios.
Addressing Social Impact: Mitigation strategies for fake content generation and hallucinations are increasingly important, as highlighted by error taxonomy in hallucination-focused benchmarks (Rawte et al., 16 Nov 2024).

7. Broader Impact and Benchmark Integration

T2V-CompBench provides both a dataset and a diagnostic evaluation framework that enables comprehensive analysis of current and upcoming T2V generation systems. Researchers gain actionable insights into specific model deficiencies across compositional categories, guiding more sophisticated architectural and training approaches.

Its multimodal metric suite and careful prompt categorization serve as a direct blueprint for real-world video synthesis system evaluation. The platform also inspires future benchmarks that could probe temporal causality and world knowledge with techniques such as event-level QA, longest common subsequence temporal scoring (Wang et al., 9 Oct 2025), and fine-grained object-centric annotation.

In conclusion, T2V-CompBench represents a critical advance for benchmarking compositional video generation, rigorously measuring multi-object, multi-attribute, and dynamic scenario adherence. It sets the current standard for assessing both fidelity and complexity in text-to-video synthesis, informing both foundational research and practical deployment in video foundation models.