Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

Published 23 Apr 2026 in cs.CV | (2604.21478v1)

Abstract: Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can't achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP's sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces the Cross-AUC metric and SFAM framework to address cross-domain score misalignment in deepfake detection.
It employs patch-level image-text alignment and facial region-specific experts to improve local and global detection robustness.
Empirical results on multiple benchmarks show that SFAM achieves higher cross-domain AUC with lower variance compared to baselines.

Cross-Domain Evaluation in Face Forgery Detection: Semantic Fine-Grained Alignment and Mixture-of-Experts

Introduction

Face forgery detection remains a core research area amidst the proliferation of robust generative models that can synthesize extremely realistic fake facial content. While recent advances in CNN and ViT-based models have substantially improved within-dataset detection, generalization to unseen forgeries and datasets with shifting biases continues to be a bottleneck. The prevailing practice in the community is to report intra-dataset AUC, which obfuscates underlying cross-domain weaknesses as significant shifts exist in the detection score distributions between datasets, compromising real-world applicability.

To address this, the paper "Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts" (2604.21478) delivers two principal contributions: (1) the Cross-AUC metric, specifically measuring cross-domain comparability and exposing previously hidden robustness defects, and (2) the SFAM (Semantic Fine-grained Alignment and Mixture-of-Experts) framework, which integrates patch-level image-text alignment and a facial region-specific expert mixture to enhance cross-domain discriminative performance. The empirical findings decisively demonstrate the superiority of the proposed approach across standard benchmarks.

Theoretical Motivation and Cross-AUC Metric

Detection methods are commonly benchmarked using the Area Under the ROC Curve (AUC), reported within a single dataset. This approach, however, is fundamentally limited for open-world deployment—as it is invariant under monotonic score transformations, it fails to guarantee that detection scores are calibrated or even comparable across domains. Due to data set-specific biases (e.g., acquisition pipelines, post-processing), detectors can maintain strong intra-dataset separation while experiencing major inter-dataset distributional shifts, rendering practical deployment unreliable.

This issue is highlighted by the score distribution analysis across FaceForensics++, WDF, and Celeb-DF datasets, where intra-dataset separability contrasts sharply with inconsistent cross-domain score alignments.

Figure 1: Prediction score distributions of authentic and forged samples across FF++, WDF, and Celeb-DF datasets, showing intra-dataset separability but severe cross-domain distribution shifts.

To address this, the Cross-AUC metric is introduced. Instead of evaluating only on (real, fake) pairs within a single domain, Cross-AUC averages AUC values across all pairs where real samples from dataset $i$ are contrasted with fake samples from dataset $j$ $(i \neq j)$ , directly penalizing models whose decision scores are non-comparable between datasets. This exposes previously unmeasured vulnerabilities and sets a higher bar for generalization.

Semantic Fine-Grained Alignment and Mixture-of-Experts Framework

The SFAM architecture builds on a frozen CLIP backbone and introduces two critical modules: 1) Patch-Level Image-Text Alignment (PaITA) and 2) Facial Region Mixture-of-Experts (FaRMoE).

Architecture Overview

The overall workflow combines mask-guided hybrid data augmentation, region-aligned feature specialization, and local patch-level alignment imposed via explicit loss constraints.

Figure 2: The workflow of the proposed SFAM framework.

Mask-Guided Hybrid Data Augmentation

Mask-guided hybrid augmentation exploits facial region masks derived from landmarks to generate synthetic images where selected regions (e.g., eyes, nose, mouth) are authentically or synthetically forged, using combinations of real-fake and self-blended image pairs. This mechanism explicitly supervises the model to focus on manipulated areas while widening the diversity of forgeries.

Facial Region Mixture-of-Experts (FaRMoE)

To mitigate the limitation of global feature sharing in ViT/CLIP, FaRMoE partitions facial images into anatomically meaningful regions and assigns each to a specialized expert MLP. FaRMoE replaces keys in self-attention layers for certain patches with region-specific features, encouraging extraction of organ–localized artifacts.

Figure 3: The structure of vision encoder with our proposed FaRMoE module.

Patch-Level Image-Text Alignment (PaITA)

PaITA extends CLIP’s global image-text contrastive training to a local, mask-supervised paradigm. Patch embeddings from the vision backbone are paired with local textual prompts corresponding to real/forged attributes, and innovative ranking losses are imposed: intra-image (forged > authentic patch scores within an image) and cross-image (forged patch > corresponding real patch across matched real/fake pairs). This fine-grained alignment provides strong gradients for artifact detection at a local level.

Loss Formulation and Training

The complete loss combines global classification, intra-image ranking, and inter-image ranking. Impactful settings for ranking loss weights are empirically validated, with $\lambda_1=0.3$ and $\lambda_2=0.2$ maximizing both intra-dataset and Cross-AUC performance.

Experimental Evaluation

Datasets and Experimental Rigor

The evaluation leverages FaceForensics++ as the training domain and tests on the broad Celeb-DF, DFDCP, DFDC, and UADFV benchmarks, encompassing diverse manipulation techniques and post-processing artifacts. Metrics include AUC, Cross-AUC, Cross-AUC minimum, and standard deviation to capture average and worst-case generalization characteristics.

AUC and Cross-AUC Analysis

While leading baselines such as Forensics Adapter achieve intra-dataset AUC above 0.90, their Cross-AUC minimum often collapses below 0.67. In contrast, SFAM reports the highest Cross-AUC average (0.885) and minimum (0.747) with the lowest performance standard deviation (0.066). The minimal difference between SFAM’s AUC and Cross-AUC demonstrates genuine cross-domain universality, rather than overfitting to dataset-confirmed artifacts.

Ablation Studies

Systematic ablations confirm the orthogonality and necessity of each SFAM module. The mask-guided augmentation and PaITA provide the principal generalization improvements, elevating Cross-AUC average from 0.740 (raw CLIP) to 0.881, while FaRMoE further increases organ-specific discrimination without sacrificing generalization or stability.

Feature Embedding Visualization

Feature space analysis using t-SNE demonstrates that while baselines such as CLIP, Effort, and Forensics Adapter suffer from data set–localized clustering (implying reliance on spurious correlations and dataset biases), SFAM learns a representation in which real/fake separation is maintained and samples from different domains overlap more tightly, supporting robust cross-domain inference.

Figure 4: t-SNE visualization results of four methods.

Implications and Future Directions

Theoretical impact: The Cross-AUC metric sets a new community standard for evaluating deepfake detection models, revealing issues with score calibration and intra-dataset overfitting that were hidden by previous protocols.

Practical relevance: Models like SFAM reduce catastrophic performance drops in real-world deployment, where the origin and corruption type of forgeries are unknown. The modular integration of mask guidance, patch-level alignment, and mixture-of-experts represents a blueprint for similar improvements in other cross-domain forensic tasks.

Future research may extend Cross-AUC to more diverse synthetic and authentic domains, investigate dynamic expert routing conditioned on wider attributes (beyond facial anatomy), and generalize the principle of fine-grained alignment in multimodal transformers. The framework is also extensible to audio and video modalities and can be integrated with explainability and interpretability pipelines.

Conclusion

This work demonstrates that traditional closed-set evaluation metrics mask critical generalization faults in face forgery detectors. By introducing the Cross-AUC metric and designing SFAM—a framework coupling patch-level image-text alignment with facial region mixture-of-experts—the research achieves robust, stable, and universally applicable deepfake detection. The approach outlines an effective pathway for generalizable forensics, well suited to the rigorous demands of deployment in unconstrained real-world media analysis.

Markdown Report Issue