OpenForensics: Multi-Face Forgery & Segmentation

Updated 22 November 2025

OpenForensics is a large-scale dataset comprising over 115K images and 334K annotated face instances that support both forgery detection and instance segmentation.
It features detailed annotations, including bounding boxes, pixel-level masks, and 68-point facial landmarks, systematically generated through a rigorous synthesis and validation pipeline.
Benchmark experiments with various models reveal challenges in handling small faces and severe perturbations, highlighting opportunities for improved deepfake countermeasures.

OpenForensics is a large-scale dataset explicitly constructed for multi-face forgery detection and instance segmentation in complex, unconstrained visual environments (“in-the-wild”). It is motivated by the need for robust, fine-grained countermeasures to synthetic face manipulation in social media and other real-world contexts, where multiple faces of varying scales and occlusions appear in diverse, natural scenes. The dataset provides face-wise detailed annotations and supports rigorous benchmarking across both detection and segmentation tasks, filling a critical gap in deepfake forensics research (Le et al., 2021).

1. Design and Construction

OpenForensics comprises 115,325 images and 334,136 annotated face instances, split into 160,476 real and 173,660 forged faces. Pristine images are sourced from Google Open Images and are filtered to exclude non-human faces, leading to a curated set of 45,473 pristine images. Forged faces are generated by synthesizing non-target face swaps using GAN latent-vector manipulation—specifically, identity latent modification and face generation with StyleGAN or Adversarial Latent Autoencoders. The synthesis pipeline includes face extraction, quality control, pose/expression alignment, Poisson blending with color-adaptive facial-landmark masks, and a spoofing validation using an XceptionNet classifier to ensure realism and classifier evasion. This pipeline yields high-resolution (512 × 512) forged faces with varied identities and naturalistic blending (Le et al., 2021).

2. Annotation, Protocols, and Splits

Each face instance in OpenForensics is annotated with a bounding box, pixel-level face mask, binary forgery mask, binary class (real or fake), and 68-point facial landmarks. An in-house annotation tool with semi-automatic, landmark-based mask smoothing supports batch annotation with refinement. The dataset is divided into four canonical splits:

Split	Images	Real Faces	Forged Faces
Training	44,122	85,392	65,972
Validation	7,308	4,786	10,566
Test-Dev	18,895	21,071	28,679
Test-Challenge	45,000	49,218	68,452

The Test-Challenge set introduces cross-domain stressors through augmented manipulations—including color/edge transforms, block-wise distortion, compression noise, blur/sharpen, and synthetic weather effects—distributed evenly across three difficulty levels (easy/medium/hard) (Le et al., 2021).

3. Statistical Properties

OpenForensics exhibits extensive scale, diversity, and granularity:

Face Count and Scale: Total faces per image average 2.9; approximately 14% are small (diagonal $<32$ px), 58% medium ( $32\leq d<96$ px), and 28% large ( $d\geq 96$ px).
Mask Coverage: 12% of faces have $<$ 1k px area, 63% span 1–10k px, and 25% $>$ 10k px.
Spatial Bias: 70% of face centroids lie within the central 50% of each image.
Scene/Resolution Diversity: Indoor scenes compose 63.7%; outdoor, 36.3% (inferred via the Places2 classifier). Image dimensions range from $200\times200$ to $4,000\times3,000$ , with a median of $800\times600$ . Scenes include streets, homes, offices, sports venues, and crowded locations (Le et al., 2021).

4. Task Definitions and Metrics

OpenForensics enables two primary tasks:

Multi-Face Forgery Detection: Given image $I$ , detect $N$ face instances $\{b_i\}$ and assign each a class $c_i\in\{\mathrm{real},\mathrm{fake}\}$ . A detector yields $\langle\hat{b}_i, p_i\in[0,1], \hat{c}_i\rangle$ . Optimization uses the composite detection loss $L_{det} = L_{cls}(p_i, c_i) + \lambda\cdot L_{reg}(\hat{b}_i, b_i)$ , where $L_{cls}$ is cross-entropy and $L_{reg}$ is smooth- $L_1$ (Le et al., 2021).
Multi-Face Forgery Segmentation: For each detected face, output a mask $\hat{M}_i\in\{0,1\}^{H\times W}$ , trained with element-wise binary cross-entropy: $L_{mask} = -\sum_{x,y}[M_i(x,y)\log\hat{M}_i(x,y) + (1-M_i(x,y))\log(1-\hat{M}_i(x,y))]$ .

Evaluation employs COCO-style mean Average Precision (AP@[IoU=0.50:0.05:0.95]) for both detection and mask outputs, augmented by scale-specific AP ( $AP_S$ , $AP_M$ , $AP_L$ ) and optimal LRP (oLRP) metrics for error decomposition (localization, false positives, false negatives) (Le et al., 2021).

5. Benchmarking and Experimental Methodology

The dataset benchmarks eleven state-of-the-art methods spanning two-stage (Mask R-CNN, Mask Scoring R-CNN), single-stage (YOLACT, YOLACT++, CenterMask, BlendMask), anchor-free (PolarMask, SOLO, SOLOv2), and conditional/implicit models (MEInst, CondInst, RetinaMask). All models utilize an FPN-ResNet50 backbone pre-trained on ImageNet. Inputs are resized to a maximum of $1333$ px (shorter side $800$ px). Training is performed on a single Tesla P100 with 32 GB RAM, using SGD with a learning rate of $0.02$ and momentum $0.9$, over 12 epochs with staged decay (Le et al., 2021).

6. Quantitative and Qualitative Results

On the Test-Development set, BlendMask achieves the highest detection AP ($87.0$) and mask AP ($89.2$), with minimum oLRP. Mask R-CNN obtains AP of $79.2$ and mask AP of $83.6$. Two-stage models exhibit lower false positives but higher false negatives, while single-stage/dense models better localize small faces. Scale breakdowns reveal substantially higher AP for medium/large faces compared to small faces (e.g., BlendMask $AP_S=32.7$ , $AP_M=86.3$ , $AP_L=88.0$ ) (Le et al., 2021). On the Test-Challenge set with unseen augmentations, overall AP drops by approximately $35$ points across methods (e.g., BlendMask $AP=53.9$ , $AP_{\mathrm{mask}}=54.0$ ); YOLACT++ demonstrates enhanced robustness (AP drop slightly less pronounced). Qualitative analysis illustrates that current models struggle with extreme occlusions, severe color distortions, and blurred manipulations, especially for tiny faces ( $<16$ px diagonal) (Le et al., 2021).

7. Research Challenges and Future Opportunities

Findings indicate that dataset scale and scenario complexity expose significant weaknesses in current detectors, particularly for small faces and under cross-domain perturbation. Anchor-free and dense-mask methods (e.g., BlendMask, SOLO) excel under standard conditions but are less robust to severe artifact introduction. Two-stage detectors better minimize false negatives but are computationally costlier. Prominent research avenues include: robust small-face detection (e.g., multiscale feature fusion, specialized detection heads), forgery boundary refinement (integrating edge detection), enhanced explainability (visual localization of forgery cues), and domain adaptation (perturbation-aware augmentation or adversarial training to increase generalization) (Le et al., 2021).

OpenForensics establishes a comprehensive, richly annotated platform for advancing the state of multi-face forgery detection and segmentation under real-world, unconstrained conditions, and offers a foundation for future progress in both deepfake prevention and general face analysis (Le et al., 2021).

PDF Markdown Chat (Pro)

References (1)

OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild (2021)

Follow Topic

Get notified by email when new papers are published related to OpenForensics Dataset.