EvalCrafter: Benchmarking and Evaluating Large Video Generation Models (2310.11440v3)

Published 17 Oct 2023 in cs.CV

Abstract: The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services have been developed to generate high-quality videos. However, these methods often use a few metrics, e.g., FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a novel framework and pipeline for exhaustively evaluating the performance of the generated videos. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation, which is based on an analysis of real-world user data and generated with the assistance of a LLM. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics. To obtain the final leaderboard of the models, we further fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed human alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.

PDF Abstract

Evaluation of Large Video Generation Models Using EvalCrafter

The paper presents EvalCrafter, a detailed benchmarking framework aimed at evaluating the capabilities of large video generation models, specifically in the context of text-to-video (T2V) generation. This framework addresses the growing complexity and variability in evaluating such models, which have become more advanced and accessible to users. EvalCrafter is noteworthy for its comprehensive approach in considering multiple aspects of video generation, as well as its alignment of evaluation metrics with human preferences.

Framework Overview

The EvalCrafter framework stands out by providing a structured pipeline to assess T2V models. This includes creating a diverse set of 700 prompts derived from real-world user data and processed through a LLM. The evaluation metrics are categorized into four primary dimensions: visual quality, text-video alignment, motion quality, and temporal consistency. To achieve a final evaluation score, user opinions are integrated into the objective measures, ensuring that the metrics reflect human judgment effectively.

Evaluation Metrics

The paper introduces a suite of 17 objective metrics to appraise the generated videos thoroughly:

Visual Quality: Assessed through metrics like VQA for aesthetic and technical aspects, as well as the Inception Score (IS) to gauge the diversity and quality of generated content.
Text-Video Alignment: Using CLIP-Score for prompt-related consistency, SD-Score for conceptual fidelity relative to state-of-the-art models, and various content-specific scores such as Detection-Score and OCR-Score.
Motion Quality: Includes metrics for action recognition and motion amplitude to verify the quality and appropriateness of movements within the videos.
Temporal Consistency: Evaluations like warping error and semantic consistency (CLIP-Temp) are utilized to ensure smooth transitions and stable outputs across frames.

Human Alignment

To bring the evaluation closer to human expectations, EvalCrafter employs a regression model that aligns objective scores with user ratings from a paper involving various T2V models. This alignment not only validates the chosen metrics but also highlights the importance of certain attributes over others, indicating user preference for facets like visual quality over strict text-video correspondence.

Findings and Implications

The results from EvalCrafter reveal several critical insights:

Single-dimensional evaluation is inadequate for modern T2V models, necessitating a multi-faceted approach.
Resolution and film-like qualities do not necessarily equate to higher user satisfaction, emphasizing subtle motion over grand gestures.
Evaluation metrics must continually adapt as models evolve, underscoring the need for adaptive benchmarks.

The paper's findings also illustrate that despite advancements, current T2V models still encounter challenges like generating coherent text within videos or executing complex motion accurately. The evaluation framework's reliance on pre-trained models reflects the broader trend in artificial intelligence to use existing robust models to assess newer innovations.

Conclusion and Future Work

This paper sets a precedent for evaluating large generative models, particularly those involving T2V generation. EvalCrafter's comprehensive benchmark offers valuable insights into model performance across various dimensions, paving the way for more nuanced and user-aligned assessments. Future work could expand the benchmark's scale and refine metrics further to accommodate continued advancements in generative technology, potentially incorporating end-to-end evaluation models that learn from larger and more varied datasets.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Yaofang Liu (11 papers)
Xiaodong Cun (61 papers)
Xuebo Liu (54 papers)
Xintao Wang (132 papers)
Yong Zhang (660 papers)
Haoxin Chen (12 papers)
Yang Liu (2253 papers)
Tieyong Zeng (71 papers)
Raymond Chan (21 papers)
Ying Shan (252 papers)

Citations (78)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos