Papers
Topics
Authors
Recent
2000 character limit reached

OpenForensics: Multi-Face Forgery & Segmentation

Updated 22 November 2025
  • OpenForensics is a large-scale dataset comprising over 115K images and 334K annotated face instances that support both forgery detection and instance segmentation.
  • It features detailed annotations, including bounding boxes, pixel-level masks, and 68-point facial landmarks, systematically generated through a rigorous synthesis and validation pipeline.
  • Benchmark experiments with various models reveal challenges in handling small faces and severe perturbations, highlighting opportunities for improved deepfake countermeasures.

OpenForensics is a large-scale dataset explicitly constructed for multi-face forgery detection and instance segmentation in complex, unconstrained visual environments (“in-the-wild”). It is motivated by the need for robust, fine-grained countermeasures to synthetic face manipulation in social media and other real-world contexts, where multiple faces of varying scales and occlusions appear in diverse, natural scenes. The dataset provides face-wise detailed annotations and supports rigorous benchmarking across both detection and segmentation tasks, filling a critical gap in deepfake forensics research (Le et al., 2021).

1. Design and Construction

OpenForensics comprises 115,325 images and 334,136 annotated face instances, split into 160,476 real and 173,660 forged faces. Pristine images are sourced from Google Open Images and are filtered to exclude non-human faces, leading to a curated set of 45,473 pristine images. Forged faces are generated by synthesizing non-target face swaps using GAN latent-vector manipulation—specifically, identity latent modification and face generation with StyleGAN or Adversarial Latent Autoencoders. The synthesis pipeline includes face extraction, quality control, pose/expression alignment, Poisson blending with color-adaptive facial-landmark masks, and a spoofing validation using an XceptionNet classifier to ensure realism and classifier evasion. This pipeline yields high-resolution (512 × 512) forged faces with varied identities and naturalistic blending (Le et al., 2021).

2. Annotation, Protocols, and Splits

Each face instance in OpenForensics is annotated with a bounding box, pixel-level face mask, binary forgery mask, binary class (real or fake), and 68-point facial landmarks. An in-house annotation tool with semi-automatic, landmark-based mask smoothing supports batch annotation with refinement. The dataset is divided into four canonical splits:

Split Images Real Faces Forged Faces
Training 44,122 85,392 65,972
Validation 7,308 4,786 10,566
Test-Dev 18,895 21,071 28,679
Test-Challenge 45,000 49,218 68,452

The Test-Challenge set introduces cross-domain stressors through augmented manipulations—including color/edge transforms, block-wise distortion, compression noise, blur/sharpen, and synthetic weather effects—distributed evenly across three difficulty levels (easy/medium/hard) (Le et al., 2021).

3. Statistical Properties

OpenForensics exhibits extensive scale, diversity, and granularity:

  • Face Count and Scale: Total faces per image average 2.9; approximately 14% are small (diagonal <32<32 px), 58% medium (32d<9632\leq d<96 px), and 28% large (d96d\geq 96 px).
  • Mask Coverage: 12% of faces have <<1k px area, 63% span 1–10k px, and 25% >>10k px.
  • Spatial Bias: 70% of face centroids lie within the central 50% of each image.
  • Scene/Resolution Diversity: Indoor scenes compose 63.7%; outdoor, 36.3% (inferred via the Places2 classifier). Image dimensions range from 200×200200\times200 to 4,000×3,0004,000\times3,000, with a median of 800×600800\times600. Scenes include streets, homes, offices, sports venues, and crowded locations (Le et al., 2021).

4. Task Definitions and Metrics

OpenForensics enables two primary tasks:

  • Multi-Face Forgery Detection: Given image II, detect NN face instances {bi}\{b_i\} and assign each a class ci{real,fake}c_i\in\{\mathrm{real},\mathrm{fake}\}. A detector yields b^i,pi[0,1],c^i\langle\hat{b}_i, p_i\in[0,1], \hat{c}_i\rangle. Optimization uses the composite detection loss Ldet=Lcls(pi,ci)+λLreg(b^i,bi)L_{det} = L_{cls}(p_i, c_i) + \lambda\cdot L_{reg}(\hat{b}_i, b_i), where LclsL_{cls} is cross-entropy and LregL_{reg} is smooth-L1L_1 (Le et al., 2021).
  • Multi-Face Forgery Segmentation: For each detected face, output a mask M^i{0,1}H×W\hat{M}_i\in\{0,1\}^{H\times W}, trained with element-wise binary cross-entropy: Lmask=x,y[Mi(x,y)logM^i(x,y)+(1Mi(x,y))log(1M^i(x,y))]L_{mask} = -\sum_{x,y}[M_i(x,y)\log\hat{M}_i(x,y) + (1-M_i(x,y))\log(1-\hat{M}_i(x,y))].

Evaluation employs COCO-style mean Average Precision (AP@[IoU=0.50:0.05:0.95]) for both detection and mask outputs, augmented by scale-specific AP (APSAP_S, APMAP_M, APLAP_L) and optimal LRP (oLRP) metrics for error decomposition (localization, false positives, false negatives) (Le et al., 2021).

5. Benchmarking and Experimental Methodology

The dataset benchmarks eleven state-of-the-art methods spanning two-stage (Mask R-CNN, Mask Scoring R-CNN), single-stage (YOLACT, YOLACT++, CenterMask, BlendMask), anchor-free (PolarMask, SOLO, SOLOv2), and conditional/implicit models (MEInst, CondInst, RetinaMask). All models utilize an FPN-ResNet50 backbone pre-trained on ImageNet. Inputs are resized to a maximum of $1333$ px (shorter side $800$ px). Training is performed on a single Tesla P100 with 32 GB RAM, using SGD with a learning rate of $0.02$ and momentum $0.9$, over 12 epochs with staged decay (Le et al., 2021).

6. Quantitative and Qualitative Results

On the Test-Development set, BlendMask achieves the highest detection AP ($87.0$) and mask AP ($89.2$), with minimum oLRP. Mask R-CNN obtains AP of $79.2$ and mask AP of $83.6$. Two-stage models exhibit lower false positives but higher false negatives, while single-stage/dense models better localize small faces. Scale breakdowns reveal substantially higher AP for medium/large faces compared to small faces (e.g., BlendMask APS=32.7AP_S=32.7, APM=86.3AP_M=86.3, APL=88.0AP_L=88.0) (Le et al., 2021). On the Test-Challenge set with unseen augmentations, overall AP drops by approximately $35$ points across methods (e.g., BlendMask AP=53.9AP=53.9, APmask=54.0AP_{\mathrm{mask}}=54.0); YOLACT++ demonstrates enhanced robustness (AP drop slightly less pronounced). Qualitative analysis illustrates that current models struggle with extreme occlusions, severe color distortions, and blurred manipulations, especially for tiny faces (<16<16 px diagonal) (Le et al., 2021).

7. Research Challenges and Future Opportunities

Findings indicate that dataset scale and scenario complexity expose significant weaknesses in current detectors, particularly for small faces and under cross-domain perturbation. Anchor-free and dense-mask methods (e.g., BlendMask, SOLO) excel under standard conditions but are less robust to severe artifact introduction. Two-stage detectors better minimize false negatives but are computationally costlier. Prominent research avenues include: robust small-face detection (e.g., multiscale feature fusion, specialized detection heads), forgery boundary refinement (integrating edge detection), enhanced explainability (visual localization of forgery cues), and domain adaptation (perturbation-aware augmentation or adversarial training to increase generalization) (Le et al., 2021).

OpenForensics establishes a comprehensive, richly annotated platform for advancing the state of multi-face forgery detection and segmentation under real-world, unconstrained conditions, and offers a foundation for future progress in both deepfake prevention and general face analysis (Le et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to OpenForensics Dataset.