Multi-Face Forgery Detection

Updated 31 January 2026

Multi-face forgery detection is the task of identifying manipulated faces within images or videos containing multiple subjects, emphasizing contextual inconsistencies.
State-of-the-art methods integrate relational modeling, contrastive learning, and attention-based architectures to accurately classify and localize fake elements.
Evaluations on benchmarks like OpenForensics and FFIW-10K show high accuracy yet highlight challenges with small face detection and diverse manipulation methods.

Multi-face forgery detection refers to the task of detecting and localizing manipulated faces within multimedia containing two or more human faces per image or video frame. Unlike single-face detection—which assesses the authenticity of an isolated face—multi-face forgery detection addresses far more complex, realistic, and adversarial scenarios, where only a subset of faces may be fake, diverse manipulation pipelines are present, and context is essential for discrimination. This task has emerged as critical due to the evolution of face swapping, deepfake technologies, and the proliferation of multi-person visual media.

1. Problem Formulation and Motivation

Traditional face forgery detection assumes each image or video contains at most one face and typically employs a convolutional neural network (CNN), such as Xception, EfficientNet, or M2TR, to classify authenticity on a per-face basis (Lin et al., 2023). The multi-face variant generalizes detection to scenarios where an unknown number of faces, $n$ , are present in each image $X = \{x_i\}_{i=1}^n$ , and may result from different deepfake or face-manipulation methods. The objective is to infer, for each face $x_i$ , whether $y_i \in \{0,1\}$ (real/fake).

The motivation for modeling facial relationships arises from empirical observations that, in realistic scenes, faces are jointly subject to the same capture conditions (camera, lighting, compression). Forgeries in a scene may thus display contextually inconsistent artifacts with respect to their companions, especially in multi-face settings. Mutual constraints—such as global scene consistency and anomaly detection among face embeddings—provide powerful cues for pinpointing forgeries.

2. Benchmark Datasets and Task Structure

The multi-face forgery detection task relies on highly-annotated benchmarks with diverse forgeries and multi-face annotations. The primary datasets and their attributes are:

Dataset	# Images/Videos	Avg. Faces/Image	Annotations	Notable Features
OpenForensics	115,325	2.9	Per-face labels, masks, landmarks	GAN-synthesis, Poisson blending, in-the-wild, robust test splits (Le et al., 2021)
FFIW-10K	10,000 videos	3.15	Face/video labels, tracks	High-fidelity swaps, temporal tracking, auto-quality control (Zhou et al., 2021)
OFV2, FFIW-MF	79k / 84k train	2+	Masks, image labels	Cross-domain, GAN and deepfake, localization (Miao et al., 2024)

OpenForensics (Le et al., 2021) defines detection as the per-face prediction of forgery status and segmentation as per-face mask estimation. FFIW-10K (Zhou et al., 2021) introduces long-form video analysis, with multi-face multi-tracklet annotations and multiple face-swapping pipelines (DeepFaceLab, FSGAN, FaceSwap).

Evaluation on these datasets is conducted using protocols such as per-face AUC/ACC, AP@[.50:.05:.95], detection segmentation metrics (mask IoU, oLRP), and video-level aggregation.

3. Representative Architectures and Core Methodologies

3.1 Two-stage and End-to-End Detectors

Generic two-stage architectures (e.g., Mask R-CNN, BlendMask, SOLOv2) and derived single-stage models (YOLACT++) localize faces and classify each as real or fake (Le et al., 2021). These backbones are employed with standard losses:

Detection: $\mathcal{L}_\text{cls}$ (binary cross-entropy)
Bounding-box: $\mathcal{L}_\text{bbox}$ (smooth $L_1$ )
Mask/segmentation: $\mathcal{L}_\text{seg}$ (pixelwise BCE/Dice)

3.2 Joint Relationship and Feature Learning

The FILTER framework (Lin et al., 2023) exemplifies state-of-the-art relational modeling. It comprises:

A facial relationships learning module: Computes a cosine self-similarity matrix $S_{ij} = \frac{f_i \cdot f_j}{\|f_i\| \|f_j\|}$ across all $n$ faces, then uses 1x1 convolution and Transformer encoder to yield contextually-enhanced features $F = \{F_i\}$ .
Local classifier: Predicts per-face probability $\hat{y}_i$ using $F_i$ and cross-entropy loss $\mathcal{L}_\text{local}$ .
Metric learning: Pull–push losses ( $\mathcal{L}_\text{pull}$ clusters same-class, $\mathcal{L}_\text{push}$ separates between-class).
Global aggregation: Pooling of $\{F_i\}$ for image-level consistency, with loss $\mathcal{L}_\text{global}$ .
Overall objective: $\mathcal{L} = \mathcal{L}_\text{global} + \lambda_1\mathcal{L}_\text{local} + \lambda_2\mathcal{L}_\text{pull} + \lambda_3\mathcal{L}_\text{push}$ .

FILTER delivers marked accuracy improvements (AUC, ACC) over non-relational baselines.

3.3 End-to-End Contrastive and Attention-based Models

COMICS (Zhang et al., 2023) eliminates stage separation by integrating face localization, forgery detection, and pixelwise mask prediction in a single unified pipeline, built on instance segmentation detectors. The critical advances are:

Coarse-grained contrastive learning (CCL): Contrastive objectives at the proposal/embedding level across augmentations and pyramid layers, using FlatNCE variants for same-/different-class pulling/pushing. Dynamic prototypes per class are maintained at each spatial scale.
Fine-grained contrastive learning (FCL): Per-pixel contrast through within-mask (face-background) and between-mask (real-real, real-fake) objectives, processing masked feature maps; background and face regions are explicitly contrasted to enhance discriminability of subtle artifacts.
Frequency Enhanced Attention (FEA): High-pass SRM filters and spatial attention bolster the representation of forensic traces.
The network is trained with an aggregated objective: detection, CCL, and two FCL losses.

COMICS demonstrates AP gains up to 20.7 points over BlendMask in challenging settings.

3.4 Unified Detection-Localization Networks

MoNFAP (Miao et al., 2024) introduces the Forgery-aware Unified Predictor (FUP) and a Mixture-of-Noises Module (MNM):

FUP uses a multi-scale pyramid of Forgery-aware Transformers with token-based learning to simultaneously output global image classification (real/fake) and dense pixelwise masks.
MNM augments features at each scale via a mixture-of-experts over four noise-focused extractors, directly introducing forensically relevant residual cues.
Masked cross-attention mechanisms enforce locality and provide fine-grained localization.
The full loss function incorporates image-, pixel-, aux mask-, and importance (MiNE) loss terms.

MoNFAP sets new intra- and cross-dataset benchmarks, attaining e.g., 99.10% ACC, 94.82% F1-f, and 90.15% IoU-f on OFV2 test (Miao et al., 2024).

4. Training, Evaluation Protocols, and Robustness

Training strategies leverage extensive augmentation for robustness: geometric (crop, flip, rotation), appearance (saturation, block occlusion, noise), and domain-randomized (color, blur, weather perturbation) (Le et al., 2021, Zhang et al., 2023). Loss balancing and hyperparameter selection (e.g., pull–push weights, importance loss) are tuned via validation sets.

Testing utilizes COCO-style AP measures, per-face accuracy, mask IoU, and others. On OpenForensics, BlendMask achieves 87.0% AP_det (test-dev), while BlendMask+COMICS achieves 88.2% (dev) and 74.6% (challenge) (Zhang et al., 2023). FILTER achieves up to 99.88% AUC and 99.00% ACC (dev) (Lin et al., 2023).

Ablation studies across these models indicate significant performance drops when relational, contrastive, or noise modules are ablated, confirming their necessity (Lin et al., 2023, Zhang et al., 2023, Miao et al., 2024).

5. Challenges, Limitations, and Open Questions

Small face detection remains a bottleneck; state-of-the-art methods (e.g., COMICS, MoNFAP) yield lower AP_S compared to AP_M or AP_L (Le et al., 2021, Zhang et al., 2023).
Occlusion and domain shifts present robustness issues—performance drops by ~40% AP on OpenForensics test-challenge splits (Le et al., 2021).
Forgeries beyond face swapping (expression editing, attribute transfer, full-body manipulation) and multimodal cues (audio-visual consistency) are largely unsolved.
Generalization to unseen forgery types or social media–degraded imagery is an ongoing problem. Cross-dataset tests show severe performance reduction for most methods except those with explicit noise or contrastive modeling (MoNFAP, COMICS).
Explainability and false positives: Even strong methods may flag genuine faces with color/lighting anomalies or miss minimally manipulated subregions (Miao et al., 2024).
Scaling to extreme face counts ( $>10$ faces per image) challenges the relational modeling and computational tractability (Lin et al., 2023).

6. Directions for Future Research

Research trends point toward:

Incorporating spatial (layout), temporal (video), and identity-specific cues for relational reasoning (Lin et al., 2023).
Adaptive graph or mask sparsification to manage computational costs with large face counts (Lin et al., 2023).
Joint detection and verification systems, as well as continual learning to counter evolving synthesis pipelines (Le et al., 2021).
More robust fusion of frequency, noise, and artifact cues (e.g., via learned mixtures or adversarial augmentation) (Miao et al., 2024).
Exploring multi-modal and cross-modal forensics (e.g., audio–visual, transcript–visual parity).
Full integration into real-time on-device pipelines, with efficiency and interpretability constraints for practical deployment (Le et al., 2021).

7. Summary Table of Notable Approaches and Benchmarks

Method	Architecture	Relational/Contrastive	Localization Capable	OpenForensics AP_dev	FFIW ACC (%)	Key Modules
BlendMask (Le et al., 2021)	Single-stage instance seg	No	Yes (mask)	87.0	—	Standard det+seg
FILTER (Lin et al., 2023)	Two-stage+Transformer+agg	Yes	No	99.82 (AUC)	82.5	Relational sim matrix, MLP
COMICS (Zhang et al., 2023)	End-to-end proposal+mask+FCL	Yes (bi-grained)	Yes (mask)	88.2	—	CCL, FCL, FEA
MoNFAP (Miao et al., 2024)	Unified Transformer+noise	Yes (masked attention)	Yes (mask)	99.10 (ACC)	92.86	FUP, MNM, MoNE, FAT layers

Ensembles and hybrids (e.g., FILTER+M2TR) yield further incremental gains. Cross-domain generalization remains most effective when explicit noise or distributional contrast is exploited.

This overview synthesizes the foundational principles, leading methodologies, empirically validated benchmarks, unresolved challenges, and clear trajectories in multi-face forgery detection, as documented in (Lin et al., 2023, Zhou et al., 2021, Zhang et al., 2023, Le et al., 2021), and (Miao et al., 2024).

Markdown Upgrade to Chat

References (5)

Exploiting Facial Relationships and Feature Aggregation for Multi-Face Forgery Detection (2023)

OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild (2021)

Face Forensics in the Wild (2021)

Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization (2024)

COMICS: End-to-end Bi-grained Contrastive Learning for Multi-face Forgery Detection (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Face Forgery Detection.