Face Consistency Benchmark for GenAI Video: A Critical Framework
In the continual evolution of AI-driven video generation, one pressing issue has emerged: maintaining the consistency of characters, especially their facial features, across sequences. The paper "Face Consistency Benchmark for GenAI Video" presents a framework specifically designed to evaluate facial consistency in AI-generated videos, addressing this critical challenge with a targeted approach. The paper introduces the Face Consistency Benchmark (FCB), filling a notable gap in current evaluation tools which primarily focus on motion quality, temporal consistency, and video realism.
Benchmark Framework
The proposed Face Consistency Benchmark employs widely-used face recognition models to assess the preservation of identity and facial expressions across video sequences. Key models integrated into the framework include VGG-Face, Facenet, Facenet512, ArcFace, SFace, and GhostFaceNet, chosen for their proven robustness in facial feature extraction. To facilitate their use, the DeepFace library is utilized, ensuring accurate and reliable facial consistency evaluations.
Evaluation Methodology
The research employs four text-to-video generation models in their analysis. The selected models - HunyuanVideo, Vchitect-2.0, CogVideoX1.5-5B, and Runway Gen-3 - are recognized for their high performance on existing benchmarks such as VBench. The FCB evaluation process consists of two distinct modes:
- Frame-to-Model Comparison: Measurement of similarity between all frames in a video and a selected representative frame as a reference model.
- Random Frame Pair Comparison: Assessment of coherence by comparing 200 random pairs of frames within each video.
Both modes employ cosine distance as the metric, with lower distances indicating greater consistency.
Experimental Findings
Results indicate significant discrepancies in facial consistency between AI-generated and real videos. While models like HunyuanVideo and Runway Gen-3 perform relatively better, they still lag markedly behind real video consistency. This inconsistency underscores the limitations of current generative models in maintaining facial coherence, highlighting a crucial area for continued development.
Implications and Future Directions
The Face Consistency Benchmark offers a specialized tool for AI video model evaluation, promoting improvements in character realism across applications requiring high-quality animation. Future research might extend this benchmark to support multi-character landscapes and complete body coherence evaluations, enhancing the ability to measure realistic interactions and overall dynamics in video generation.
Conclusion
The FCB provides an essential methodology for evaluating AI video generation in terms of character facial consistency. By utilizing state-of-the-art recognition models, the benchmark facilitates accurate assessments of generative capabilities, pushing forward advancements in AI video quality. The work laid out in this paper represents a substantial contribution towards solving the facial consistency challenge, and further iterative developments are anticipated to refine and broaden the scope of this evaluation framework.