Evaluation of Text-to-Video Generation Models: A Dynamics Perspective (2407.01094v1)

Published 1 Jul 2024 in cs.CV

Abstract: Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at https://github.com/MingXiangL/DEVIL.

PDF HTML Abstract

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

The paper, "Evaluation of Text-to-Video Generation Models: A Dynamics Perspective," addresses a pivotal yet often neglected aspect of text-to-video (T2V) generation—dynamics. In the evolving landscape of video generation, evaluating the sophistication and effectiveness of T2V models is becoming increasingly crucial. Existing evaluation frameworks largely focus on temporal consistency and continuity of content. However, the intrinsic dynamics of video content, which are imperative for assessing visual vividness and fidelity to text prompts, remain underexplored. The authors propose a structured evaluation protocol named DEVIL, concentrating on this neglected dimension.

Key Contributions

1. Dynamics Evaluation Protocol

The authors introduce DEVIL, an evaluation protocol that emphasizes dynamics as a key metric for assessing T2V models. It includes three core metrics:

Dynamics Range: Reflects the model's capability to generate videos exhibiting both subtle and dramatic temporal variations.
Dynamics Controllability: Measures the model's efficacy in modulating video dynamics according to text prompts.
Dynamics-based Quality: Evaluates video quality across varying dynamics, addressing the common bias observed where higher dynamics result in lower quality scores.

2. Benchmark and Dynamics Scores

DEVIL includes a newly established benchmark, comprising text prompts designed to reflect various dynamics grades. These prompts are categorized using both GPT-4 and manual refinement. The dynamics scores at multiple temporal granularities—inter-frame, inter-segment, and video—are introduced to thoroughly assess video characteristics. This methodology enhances existing approaches by providing a comprehensive evaluation of video dynamics aligned with human perception.

Numerical Results and Insights

The experiments conducted demonstrate that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, signaling high alignment with human evaluation. Notably, the authors find that existing datasets and models often exhibit biased dynamics distributions. Typically, these models generate low-dynamic content to achieve higher quality scores, indicating a loophole in current evaluation metrics. For instance, top-ranking models like GEN-2 and Pika demonstrate high dynamics alignment but restrict themselves to low-dynamic outputs, as verified by the dynamics analysis.

Theoretical and Practical Implications

The findings of this paper have significant implications—practically, they indicate a need for more diverse datasets that better represent dynamic variability. Theoretically, the methodology offers a robust framework for evaluating and interpreting dynamics in generated video content, providing a more nuanced understanding of model performance. The proposed metrics and benchmark pave the way for future development and refinement of T2V models, encouraging the generation of more dynamic and realistic video content.

Conclusion and Future Directions

In conclusion, the paper introduces a comprehensive framework for evaluating the dynamic dimension in T2V generation, emphasizing its importance in evolving standards of quality and realism. Future research could explore the granularity of dynamics grades for even more refined assessments and integrate these dynamics-based evaluations with other emerging technologies to further enhance video generation models. The protocol serves as a foundation for fostering advancements in T2V generation, urging the community towards more holistic evaluation standards.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Mingxiang Liao (2 papers)
Hannan Lu (7 papers)
Xinyu Zhang (296 papers)
Fang Wan (44 papers)
Tianyu Wang (152 papers)
Yuzhong Zhao (18 papers)
Wangmeng Zuo (279 papers)
Qixiang Ye (110 papers)
Jingdong Wang (236 papers)

Citations (6)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - MingXiangL/DEVIL: Evaluating dynamics capability of T2V generation models with DEVIL protocols. (335 stars)