SeqBench: Benchmarking Sequential Narrative Coherence
- SeqBench is a benchmarking framework that evaluates sequential narrative coherence in text-to-video generation by analyzing event dependencies and temporal ordering.
- It offers a purpose-built dataset with diverse narrative complexities, granular annotations, and comprehensive evaluations across eight state-of-the-art models.
- The DTG metric innovatively measures narrative coherence through dynamic temporal graphs, highlighting model limitations and guiding future improvements.
SeqBench is a contemporary benchmarking framework introduced to rigorously evaluate sequential narrative coherence in text-to-video (T2V) generation models (Tang et al., 14 Oct 2025). Unlike prior benchmarks that mostly emphasize frame-level visual fidelity, SeqBench directly targets the core capability of models to construct temporally coherent stories across composed video sequences. It provides a domain-relevant dataset, granular annotation protocols, and a novel automatic evaluation metric capable of capturing long-range dependencies and accurate temporal ordering, thereby enabling reproducible, model-agnostic assessment of sequential reasoning within generated video content.
1. Motivation and Scope
The emergence and rapid progress of text-to-video models have revealed a significant gap between visual synthesis capabilities and robust narrative coherence. Conventional benchmarks evaluate results using static image metrics (e.g., frame quality, aesthetic appeal), failing to address whether generated videos maintain logical consistency across multiple events, object states, and interactions. SeqBench was created to systematically address this deficit by focusing on narrative coherence—testing a model’s ability to produce videos where each subsequent event logically follows prior ones, object states evolve consistently, and timing/order relationships are physically plausible (Tang et al., 14 Oct 2025). This approach reflects real-world storytelling requirements and exposes the sequential inferential limitations of existing T2V architectures.
2. Dataset Construction and Annotation Protocols
SeqBench offers a purpose-built dataset designed to span a wide spectrum of narrative complexity. It comprises 320 text prompts that are divided into four high-level thematic categories—Animal, Human, Object, and Imaginary. Each category is further split across 32 subcategories dictated by combinations of narrative difficulty and temporal order:
- Difficulty Levels:
- Single Subject–Single Action (SSSA)
- Single Subject–Multi Action (SSMA)
- Multi Subject–Single Action (MSSA)
- Multi Subject–Multi Action (MSMA)
- Temporal Ordering Variants:
- Strictly Sequential (SS)
- Flexible Order (FO)
- Simultaneous (SI)
For each prompt, videos are generated using eight state-of-the-art T2V models. Each video undergoes dual annotation: human evaluators score visual quality and, critically, narrative coherence, following explicit rubrics to ensure the benchmark’s reliability and suitability for diagnostic research.
Table: SeqBench Dataset Structure
Category | Narrative Difficulty | Temporal Orderings |
---|---|---|
Animal, Human, | SSSA, SSMA, MSSA, MSMA | SS, FO, SI |
Object, Imaginary |
The dataset totals 2,560 videos, establishing broad coverage and enabling detailed cross-model comparison under controlled narrative criteria.
3. Dynamic Temporal Graphs (DTG) Metric for Automated Evaluation
A defining contribution of SeqBench is the introduction of the Dynamic Temporal Graphs (DTG) metric, engineered to measure sequential narrative coherence efficiently and at scale. The DTG formalism proceeds as follows:
- Each text prompt is decomposed into a sequence of events ; each reflects a state transition or an action.
- Dependencies Pre() encode which prior events must logically precede . This yields a directed acyclic graph superstructure expressing temporal and logical constraints.
- Let indicate whether the checkpoint for event is satisfied in the generated video.
- The coherence score is calculated via:
where is the indicator function enforcing dependency filtering—events are counted as coherent only if all their predecessors are correctly realized.
Empirical experiments demonstrate the DTG metric's strong agreement with human annotation, achieving Spearman values up to 0.85, thus validating its utility as a proxy for reliable large-scale benchmarking.
4. Systematic Evaluation and Observed Model Limitations
SeqBench was applied to eight representative T2V models, resulting in direct comparative analysis across 2,560 annotated videos. Systematic evaluation uncovered several generalizable deficiencies:
- Inconsistent Object States: Models frequently fail to preserve the identity and attributes of objects as they undergo sequential actions; e.g., an object may appear changed between events without logical cause.
- Physically Implausible Multi-object Interactions: When two or more subjects must interact, models inadequately encode the dependency chain, leading to implausible or contradictory outcomes.
- Temporal Ordering Violations: Models often produce events out of sequence, leading to narratives where required logical transitions (e.g., cause/effect, action/reaction) are omitted or reversed.
- Timing and Duration Artifacts: Difficulty in sustaining realistic pacing across long sequences, with actions often truncated, repeated, or misaligned.
Of note, the best-performing evaluated model (Kling 2.0) obtained an average narrative coherence score of approximately 0.252. This measured outcome demonstrates the sector-wide gap between frame quality and multi-step storytelling competency.
5. Implications for Model Architecture and Training
Findings from SeqBench carry important implications for future T2V model design. Success in visual synthesis can no longer serve as a sole proxy for storytelling achievement; instead, substantial advances in model architecture are needed to address long-range dependencies and state consistency. Potential strategies involve:
- Integrating explicit temporal modeling components or learned memory modules to maintain event chains and object histories.
- Employing feedback from DTG-like coherence evaluation throughout the training process to penalize narrative inconsistency.
- Recasting the training paradigm to emphasize chaining and temporal logic, ensuring models are robust not just to visually complex prompts but also to demanding narrative scripts.
This suggests that architectural innovations and workflow refinements directed at sequential reasoning will be required to close the narrative coherence gap documented by SeqBench.
6. Accessibility, Further Resources, and Future Directions
SeqBench is hosted at https://videobench.github.io/SeqBench.github.io/, where researchers can find supplementary materials, code, dataset visualizations, and detailed experimental results. The benchmark framework is poised for extensibility—future research may leverage its annotation protocol and DTG metric for new experiments, alternate domains, or next-generation model training. Anticipated improvements include enhanced memory modeling, richer temporal grammar constructs, and possibly direct use of DTG outputs as differentiable learning signals.
In sum, SeqBench establishes the first standardized methodology for evaluating sequential narrative coherence in T2V generation, revealing the persistent limitations of current models and charting a path toward meaningful advances in sequential reasoning for integrated multimodal generation systems (Tang et al., 14 Oct 2025).