CSFD Score: Face Consistency in AI Videos
- CSFD Score is a metric that quantifies the consistency of a main character’s facial features across multiple scenes in AI-generated videos.
- It computes pairwise face similarity using face detection, facial encoding via pretrained models, and cosine similarity.
- Higher CSFD values indicate improved character identity stability, addressing the issue of character drift in narrative video generation.
The Cross-Scene Face Distance Score (CSFD Score) is a quantitative metric designed to measure the consistency of character facial features across multiple scenes in generated long-form, multi-scene videos. It addresses a previously unmet need in video generation evaluation by offering an explicit assessment of character identity stability—an aspect critical for maintaining narrative coherence in AI-generated cinematic content.
1. Foundational Definition and Motivation
The CSFD Score quantifies the degree of similarity between a main character’s face as it appears in different scenes of a video. Traditional metrics such as Fréchet Inception Distance (FID), Inception Score (IS), and Fréchet Video Distance (FVD) assess image and video quality, diversity, or overall temporal coherence, but they do not directly measure whether a character’s facial identity remains consistently represented across disparate scenes. Inconsistencies in facial appearance, often termed "character drift," undermine narrative clarity and user perception. The CSFD Score was introduced to enable targeted, interpretable evaluation of this form of consistency, an essential aspect of long-form storytelling and multi-agent video generation frameworks (Xie et al., 21 Aug 2024).
2. Metric Computation Methodology
The calculation process for the CSFD Score is as follows:
- Face Detection: For each keyframe (one per scene), the system detects and crops the main character’s face using a facial landmark localization algorithm (notably 68-point localization is referenced for precise cropping).
- Face Encoding: Each extracted face is encoded via a pretrained facial recognition model, such as those based on OpenAI CLIP’s Vision Transformer (ViT).
- Similarity Computation: Each pair of faces across all keyframes is compared using a similarity function, typically cosine similarity applied to the facial embeddings.
- Averaging: The CSFD Score is computed as the mean similarity across all unique pairs:
where is the similarity score.
Pseudocode representation from Algorithm 1:
1 2 3 4 5 6 7 8 |
total ← 0
count ← n*(n-1) / 2
For i = 1 to n:
For j = i+1 to n:
similarity ← CFS(F_i, F_j)
total ← total + similarity
averageScore ← total / count
Return averageScore |
dlib, face-recognition, CLIP ViT), and presupposes that each keyframe contains one reliably detectable face.
3. Practical Applications and Evaluation Protocols
The CSFD Score is applied as an evaluation benchmark for long-form, multi-scene video generation models, particularly those that employ multi-agent or keyframe-iteration frameworks (such as DreamFactory (Xie et al., 21 Aug 2024)). The operational workflow is:
- Extract faces from all keyframes of generated videos.
- Calculate pairwise similarities as detailed above.
- Aggregate and report the resulting score alongside related metrics to assess both facial and stylistic consistency.
Interpretation of results is straightforward: higher CSFD values correspond to greater cross-scene facial consistency (i.e., lower identity drift). In experimental protocols, CSFD is usually accompanied by metrics such as Cross-Scene Style Consistency Score (CSSC) and average CLIP score for comprehensive evaluation.
4. Comparison to Traditional and Related Metrics
| Metric | Measured Property | Limitation for Cross-Scene Consistency |
|---|---|---|
| FID/IS/CLIP Score | Visual quality, image-text alignment | Not identity/consistency-sensitive (only evaluates per-frame) |
| FVD/KVD | Overall video coherence | Aggregated feature distributions; not character-specific |
| CSFD Score | Character face consistency | Direct, interpretable, scene-aware identity assessment |
Traditional metrics excel at evaluating fidelity, diversity, or text-image alignment but do not specifically evaluate the temporal coherence of character identities across scenes. The CSFD Score thus fills this methodological gap by focusing specifically on cross-scene character stability.
5. Empirical Results and Observations
Experimental results reported in (Xie et al., 21 Aug 2024) show that models incorporating multi-agent collaboration and keyframe iteration methods achieve markedly higher CSFD Scores than direct script-to-video baselines:
| Model | CSFD Score | CSSC Score | av-CLIP Score |
|---|---|---|---|
| DreamFactory(GPT4)+Dalle-e3 | 0.89 | 0.97 | 0.31 |
| GPT4-Script+Dalle-e3 | 0.77 | 0.85 | 0.29 |
| GPT4-Script+Diffusion | 0.75 | 0.83 | 0.28 |
| GPT4-Script+Midjourney | 0.68 | 0.66 | 0.26 |
Scores above 0.5 indicate significant consistency, with DreamFactory’s approach demonstrating the highest CSFD (0.89), linking architectural choices directly to improvements in face consistency across scenes.
6. Current Limitations and Future Directions
The CSFD Score, while targeted and informative, exhibits several limitations:
- Single-character focus: It presumes one main face per keyframe; extension to multi-character settings is noted as future work.
- Reliance on detection robustness: Failures in face detection or encoding due to occlusions or extreme stylization can degrade metric reliability.
- Restriction to facial identity: The score does not account for other character attributes (e.g., body shape, clothing, or contextual cues).
Suggested avenues for improvement include expanding the methodology to robustly handle multiple characters per scene, enhancing facial detection under occlusion and low resolution, developing analogous metrics for body or clothing consistency, and incorporating temporal tracking schemes to move beyond pairwise static comparisons.
7. Significance and Ongoing Development
The introduction of the CSFD Score signals a methodological shift in the evaluation of AI-generated video, particularly for long-form, narrative-driven content. Its role as a diagnostic and benchmarking tool enables researchers to isolate and address the issue of character drift, providing objective grounds for advancement in both model development and evaluation standards. As generative frameworks evolve toward more complex cinematic outputs, the CSFD and its potential future extensions are expected to remain central to character-focused video assessment (Xie et al., 21 Aug 2024).