Face Consistency Benchmark for GenAI Video (2505.11425v1)

Published 16 May 2025 in cs.CV and cs.MM

Abstract: Video generation driven by artificial intelligence has advanced significantly, enabling the creation of dynamic and realistic content. However, maintaining character consistency across video sequences remains a major challenge, with current models struggling to ensure coherence in appearance and attributes. This paper introduces the Face Consistency Benchmark (FCB), a framework for evaluating and comparing the consistency of characters in AI-generated videos. By providing standardized metrics, the benchmark highlights gaps in existing solutions and promotes the development of more reliable approaches. This work represents a crucial step toward improving character consistency in AI video generation technologies.

Summary

Face Consistency Benchmark for GenAI Video: A Critical Framework

In the continual evolution of AI-driven video generation, one pressing issue has emerged: maintaining the consistency of characters, especially their facial features, across sequences. The paper "Face Consistency Benchmark for GenAI Video" presents a framework specifically designed to evaluate facial consistency in AI-generated videos, addressing this critical challenge with a targeted approach. The paper introduces the Face Consistency Benchmark (FCB), filling a notable gap in current evaluation tools which primarily focus on motion quality, temporal consistency, and video realism.

Benchmark Framework

The proposed Face Consistency Benchmark employs widely-used face recognition models to assess the preservation of identity and facial expressions across video sequences. Key models integrated into the framework include VGG-Face, Facenet, Facenet512, ArcFace, SFace, and GhostFaceNet, chosen for their proven robustness in facial feature extraction. To facilitate their use, the DeepFace library is utilized, ensuring accurate and reliable facial consistency evaluations.

Evaluation Methodology

The research employs four text-to-video generation models in their analysis. The selected models - HunyuanVideo, Vchitect-2.0, CogVideoX1.5-5B, and Runway Gen-3 - are recognized for their high performance on existing benchmarks such as VBench. The FCB evaluation process consists of two distinct modes:

Frame-to-Model Comparison: Measurement of similarity between all frames in a video and a selected representative frame as a reference model.
Random Frame Pair Comparison: Assessment of coherence by comparing 200 random pairs of frames within each video.

Both modes employ cosine distance as the metric, with lower distances indicating greater consistency.

Experimental Findings

Results indicate significant discrepancies in facial consistency between AI-generated and real videos. While models like HunyuanVideo and Runway Gen-3 perform relatively better, they still lag markedly behind real video consistency. This inconsistency underscores the limitations of current generative models in maintaining facial coherence, highlighting a crucial area for continued development.

Implications and Future Directions

The Face Consistency Benchmark offers a specialized tool for AI video model evaluation, promoting improvements in character realism across applications requiring high-quality animation. Future research might extend this benchmark to support multi-character landscapes and complete body coherence evaluations, enhancing the ability to measure realistic interactions and overall dynamics in video generation.

Conclusion

The FCB provides an essential methodology for evaluating AI video generation in terms of character facial consistency. By utilizing state-of-the-art recognition models, the benchmark facilitates accurate assessments of generative capabilities, pushing forward advancements in AI video quality. The work laid out in this paper represents a substantial contribution towards solving the facial consistency challenge, and further iterative developments are anticipated to refine and broaden the scope of this evaluation framework.