Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards A Better Metric for Text-to-Video Generation (2401.07781v1)

Published 15 Jan 2024 in cs.CV
Towards A Better Metric for Text-to-Video Generation

Abstract: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.

Overview of the Text-to-Video Score (T2VScore)

Evaluating machine-generated videos from textual descriptions remains a complex task. The Text-to-Video Score (T2VScore) seeks to refine the assessment process by focusing on text-video alignment and video quality.

Evaluating Text-Video Alignment

T2VScore emphasizes the importance of alignment between the content of a video and the initiating text prompt, evaluating how accurately the video reflects the prompt's description. This aspect is known as Text-Video Alignment, and it's one of the two core criteria addressed by T2VScore.

To assess text-video alignment, the T2VScore employs a process that begins with decomposing the prompt into semantic elements and then formulates questions answered by advanced LLMs. This question-answering approach ensures a more detailed and nuanced evaluation, capturing temporal dynamics and specific elements that could be overlooked by less granular metrics.

Measuring Video Quality

The second core aspect is the evaluation of video quality, which extends beyond the mere textual alignment to include the structural and technical integrity of the video itself. The novel evaluation pipeline integrates a technical expert, adept at detecting distortions and artifacts, with a semantic expert, proficient in apprehending content coherence. Their combined judgments offer a robust and nuanced quality score, informed by the varied aspects that contribute to overall video fidelity.

The TVGE Dataset

To support the development and fine-tuning of T2VScore, the authors introduce the Text-to-Video Generation Evaluation (TVGE) dataset. This resource gathers an extensive array of human judgments on generated videos, offering a key benchmark that can aid in calibrating the T2VScore's effectiveness against human perception.

Verification through Experiments

Experiments utilizing the TVGE dataset demonstrate T2VScore's significant correlation with human judgment, outperforming baseline metrics. Its two components, focused on alignment and quality, each address distinct and essential dimensions of the generated content, confirming the need for a dual-pronged approach in accurate text-to-video generation evaluation.

The T2VScore provides a comprehensive metric for the evaluation of text-to-video generation, offering a more refined tool for developers and researchers to gauge the quality and relevance of generated video content.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Jay Zhangjie Wu (14 papers)
  2. Guian Fang (9 papers)
  3. Haoning Wu (68 papers)
  4. Xintao Wang (132 papers)
  5. Yixiao Ge (99 papers)
  6. Xiaodong Cun (61 papers)
  7. David Junhao Zhang (19 papers)
  8. Jia-Wei Liu (20 papers)
  9. Yuchao Gu (26 papers)
  10. Rui Zhao (241 papers)
  11. Weisi Lin (118 papers)
  12. Wynne Hsu (32 papers)
  13. Ying Shan (252 papers)
  14. Mike Zheng Shou (165 papers)
Citations (23)