Overview of the Text-to-Video Score (T2VScore)
Evaluating machine-generated videos from textual descriptions remains a complex task. The Text-to-Video Score (T2VScore) seeks to refine the assessment process by focusing on text-video alignment and video quality.
Evaluating Text-Video Alignment
T2VScore emphasizes the importance of alignment between the content of a video and the initiating text prompt, evaluating how accurately the video reflects the prompt's description. This aspect is known as Text-Video Alignment, and it's one of the two core criteria addressed by T2VScore.
To assess text-video alignment, the T2VScore employs a process that begins with decomposing the prompt into semantic elements and then formulates questions answered by advanced LLMs. This question-answering approach ensures a more detailed and nuanced evaluation, capturing temporal dynamics and specific elements that could be overlooked by less granular metrics.
Measuring Video Quality
The second core aspect is the evaluation of video quality, which extends beyond the mere textual alignment to include the structural and technical integrity of the video itself. The novel evaluation pipeline integrates a technical expert, adept at detecting distortions and artifacts, with a semantic expert, proficient in apprehending content coherence. Their combined judgments offer a robust and nuanced quality score, informed by the varied aspects that contribute to overall video fidelity.
The TVGE Dataset
To support the development and fine-tuning of T2VScore, the authors introduce the Text-to-Video Generation Evaluation (TVGE) dataset. This resource gathers an extensive array of human judgments on generated videos, offering a key benchmark that can aid in calibrating the T2VScore's effectiveness against human perception.
Verification through Experiments
Experiments utilizing the TVGE dataset demonstrate T2VScore's significant correlation with human judgment, outperforming baseline metrics. Its two components, focused on alignment and quality, each address distinct and essential dimensions of the generated content, confirming the need for a dual-pronged approach in accurate text-to-video generation evaluation.
The T2VScore provides a comprehensive metric for the evaluation of text-to-video generation, offering a more refined tool for developers and researchers to gauge the quality and relevance of generated video content.