CSFD Score: Face Consistency in AI Videos

Updated 4 July 2025

CSFD Score is a metric that quantifies the consistency of a main character’s facial features across multiple scenes in AI-generated videos.
It computes pairwise face similarity using face detection, facial encoding via pretrained models, and cosine similarity.
Higher CSFD values indicate improved character identity stability, addressing the issue of character drift in narrative video generation.

The Cross-Scene Face Distance Score (CSFD Score) is a quantitative metric designed to measure the consistency of character facial features across multiple scenes in generated long-form, multi-scene videos. It addresses a previously unmet need in video generation evaluation by offering an explicit assessment of character identity stability—an aspect critical for maintaining narrative coherence in AI-generated cinematic content.

1. Foundational Definition and Motivation

The CSFD Score quantifies the degree of similarity between a main character’s face as it appears in different scenes of a video. Traditional metrics such as Fréchet Inception Distance (FID), Inception Score (IS), and Fréchet Video Distance (FVD) assess image and video quality, diversity, or overall temporal coherence, but they do not directly measure whether a character’s facial identity remains consistently represented across disparate scenes. Inconsistencies in facial appearance, often termed "character drift," undermine narrative clarity and user perception. The CSFD Score was introduced to enable targeted, interpretable evaluation of this form of consistency, an essential aspect of long-form storytelling and multi-agent video generation frameworks (Xie et al., 21 Aug 2024).

2. Metric Computation Methodology

The calculation process for the CSFD Score is as follows:

Face Detection: For each keyframe (one per scene), the system detects and crops the main character’s face using a facial landmark localization algorithm (notably 68-point localization is referenced for precise cropping).
Face Encoding: Each extracted face is encoded via a pretrained facial recognition model, such as those based on OpenAI CLIP’s Vision Transformer (ViT).
Similarity Computation: Each pair of faces $(F_i, F_j)$ across all $n$ keyframes is compared using a similarity function, typically cosine similarity applied to the facial embeddings.
Averaging: The CSFD Score is computed as the mean similarity across all $\binom{n}{2}$ unique pairs:

$\text{CSFD} = \frac{1}{\binom{n}{2}} \sum_{i=1}^{n} \sum_{j=i+1}^n \text{CFS}(F_i, F_j)$

where $\text{CFS}(F_i, F_j)$ is the similarity score.

Pseudocode representation from Algorithm 1:

total ← 0
count ← n*(n-1) / 2
For i = 1 to n:
  For j = i+1 to n:
    similarity ← CFS(F_i, F_j)
    total ← total + similarity
averageScore ← total / count
Return averageScore

This methodology relies on the robustness of the underlying face detection and encoding libraries (e.g., dlib, face-recognition, CLIP ViT), and presupposes that each keyframe contains one reliably detectable face.

3. Practical Applications and Evaluation Protocols

The CSFD Score is applied as an evaluation benchmark for long-form, multi-scene video generation models, particularly those that employ multi-agent or keyframe-iteration frameworks (such as DreamFactory (Xie et al., 21 Aug 2024)). The operational workflow is:

Extract faces from all keyframes of generated videos.
Calculate pairwise similarities as detailed above.
Aggregate and report the resulting score alongside related metrics to assess both facial and stylistic consistency.

Interpretation of results is straightforward: higher CSFD values correspond to greater cross-scene facial consistency (i.e., lower identity drift). In experimental protocols, CSFD is usually accompanied by metrics such as Cross-Scene Style Consistency Score (CSSC) and average CLIP score for comprehensive evaluation.

Metric	Measured Property	Limitation for Cross-Scene Consistency
FID/IS/CLIP Score	Visual quality, image-text alignment	Not identity/consistency-sensitive (only evaluates per-frame)
FVD/KVD	Overall video coherence	Aggregated feature distributions; not character-specific
CSFD Score	Character face consistency	Direct, interpretable, scene-aware identity assessment

Traditional metrics excel at evaluating fidelity, diversity, or text-image alignment but do not specifically evaluate the temporal coherence of character identities across scenes. The CSFD Score thus fills this methodological gap by focusing specifically on cross-scene character stability.

5. Empirical Results and Observations

Experimental results reported in (Xie et al., 21 Aug 2024) show that models incorporating multi-agent collaboration and keyframe iteration methods achieve markedly higher CSFD Scores than direct script-to-video baselines:

Model	CSFD Score	CSSC Score	av-CLIP Score
DreamFactory(GPT4)+Dalle-e3	0.89	0.97	0.31
GPT4-Script+Dalle-e3	0.77	0.85	0.29
GPT4-Script+Diffusion	0.75	0.83	0.28
GPT4-Script+Midjourney	0.68	0.66	0.26

Scores above 0.5 indicate significant consistency, with DreamFactory’s approach demonstrating the highest CSFD (0.89), linking architectural choices directly to improvements in face consistency across scenes.

6. Current Limitations and Future Directions

The CSFD Score, while targeted and informative, exhibits several limitations:

Single-character focus: It presumes one main face per keyframe; extension to multi-character settings is noted as future work.
Reliance on detection robustness: Failures in face detection or encoding due to occlusions or extreme stylization can degrade metric reliability.
Restriction to facial identity: The score does not account for other character attributes (e.g., body shape, clothing, or contextual cues).

Suggested avenues for improvement include expanding the methodology to robustly handle multiple characters per scene, enhancing facial detection under occlusion and low resolution, developing analogous metrics for body or clothing consistency, and incorporating temporal tracking schemes to move beyond pairwise static comparisons.

7. Significance and Ongoing Development

The introduction of the CSFD Score signals a methodological shift in the evaluation of AI-generated video, particularly for long-form, narrative-driven content. Its role as a diagnostic and benchmarking tool enables researchers to isolate and address the issue of character drift, providing objective grounds for advancement in both model development and evaluation standards. As generative frameworks evolve toward more complex cinematic outputs, the CSFD and its potential future extensions are expected to remain central to character-focused video assessment (Xie et al., 21 Aug 2024).

PDF Markdown Chat (Pro)

References (1)

DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework (2024)

Follow Topic

Get notified by email when new papers are published related to Cross-Scene Face Distance Score (CSFD Score).