FantasyIDiap: ID Document Forgery Benchmark
- FantasyIDiap is a high-fidelity corpus of identity document images offering pixel-level annotations to support both binary forgery detection and precise localization of altered regions.
- The dataset is designed with dual-task annotations for classification and segmentation, enabling joint optimization for digital manipulations like face swapping and text inpainting.
- Statistical uniformity across 10 languages and three acquisition devices ensures reliable evaluation and cross-modal generalization in realistic KYC settings.
The FantasyIDiap dataset is a high-fidelity corpus of identity document images specifically constructed to enable robust research into detection and localization of synthetic manipulations—principally face swapping and text inpainting—under realistic, multimodal acquisition conditions. It offers pixel-level annotations and supports both classification and segmentation tasks, establishing itself as a benchmark for evaluating joint detection-localization models in a multilingual, device-balanced KYC (Know Your Customer) context (Naseeb et al., 19 Jan 2026).
1. Dataset Design and Construction
FantasyIDiap comprises 2,358 JPEG images sourced from 786 physically printed and captured bona-fide (unaltered) identity documents and 1,572 digitally manipulated counterparts, maintaining a real-to-attack ratio of 1:2. Each bona-fide image originates from a unique ID card, with synthetic manipulations applied to create attack samples. The attack scenarios use two digital manipulation methods: “digital_1” (face-swap) and “digital_2” (text inpainting). The resulting set is curated to uniformly represent 10 languages—Russian, Ukrainian, Persian, Hindi, Arabic, French, English, Portuguese, Chinese, and Turkish—and three acquisition modalities: Huawei Mate 30 smartphone, Apple iPhone 15 Pro, and flatbed scanner, with each device contributing equally to the dataset.
Language and device distributions are controlled for statistical uniformity, with language coefficient of variation (CV) at 0.176 and device CV at 0.000, ensuring even representation across evaluation splits (Naseeb et al., 19 Jan 2026).
All images are JPEGs acquired at a mean resolution of 2,650 × 1,670 px. Bona-fide images average 2,692 × 1,686 px (1.15 MB), while manipulated images average 2,639 × 1,650 px (0.41 MB).
2. Annotation Protocol and Task Support
Every FantasyIDiap image is annotated for two supervised learning objectives:
- Binary detection: A label (0 for bona-fide, 1 for manipulated) designates whether the document is authentic or synthetically altered.
- Localization/segmentation: A pixel-level ground-truth mask delineates the exact region (face or text) subjected to manipulation, supporting precise evaluation of localization algorithms.
This dual-level annotation supports joint optimization of detection (classification) and segmentation objectives in deep learning models.
3. Data Split Regime and Summary Statistics
Data is partitioned into training (70%), validation (15%), and held-out test (15%) sets, yielding splits of 1,650, 354, and 354 images, respectively. Class balance and language/device uniformity are rigorously maintained per split, supporting statistically meaningful cross-device and cross-lingual generalization analysis.
| Split | Real | Manipulated | Total |
|---|---|---|---|
| Train | 550 | 1,100 | 1,650 |
| Validation | 118 | 236 | 354 |
| Test | 118 | 236 | 354 |
Mean image dimensions and data distributions per split are preserved to match the overall dataset, minimizing confounds in generalization experiments.
4. Evaluation Metrics
FantasyIDiap supports rigorous joint evaluation of detection and localization models using established metrics:
- Classification accuracy:
- Area Under the ROC Curve (AUC): Measures separability of real vs. manipulated classes.
- F\textsubscript{1}-score:
- Dice similarity coefficient (segmentation):
where is the predicted mask and the ground truth.
These metrics enable comprehensive quantitative analysis of both the model's detection (classification) and localization (segmentation) performance.
5. Application in Deep Manipulation Detection Research
FantasyIDiap has been central to the evaluation of TwoHead-SwinFPN—a unified, dual-head architecture that jointly performs binary manipulation detection and precise localization via a Swin Transformer backbone with Feature Pyramid Network (FPN), UNet-style decoder, and CBAM attention modules. The model is trained using an uncertainty-weighted multi-task loss combining focal loss for detection and a blend of Dice, auxiliary, and boundary losses for segmentation (Naseeb et al., 19 Jan 2026):
- Joint Optimization: The architecture uses binary labels and pixel-level masks for end-to-end training.
- Performance on Test Set:
- Accuracy: 84.31%
- AUC: 90.78%
- F\textsubscript{1}-score (binary): 88.61%
- Mean Dice score (segmentation): 57.24%
- Cross-Device Generalization:
- Huawei Mate 30: 85.2% accuracy / 89.5% F\textsubscript{1}
- iPhone 15 Pro: 84.1% / 88.2%
- Scanner: 83.8% / 87.9%
Ablation studies report that architectural choices such as Swin-Large backbone (+3.3% accuracy), FPN (+1.3%), CBAM attention (+1.1%), and uncertainty weighting (+0.6% F\textsubscript{1}, +1.1% Dice) systematically improve results.
6. Relation to the FantasyID Baseline and Other Benchmarks
FantasyIDiap is closely aligned with the broader FantasyID initiative (Korshunov et al., 28 Jul 2025), which introduces a larger (3,743-image) public dataset for ID document forgery detection, emphasizing pristine bona-fide exemplars and digital-only face/text attacks across multiple languages and acquisition modalities. However, FantasyIDiap distinguishes itself by providing pixel-level masks for segmentation, which are absent in the original FantasyID, and is constructed around a specific experimental protocol for joint detection-localization (Naseeb et al., 19 Jan 2026).
FantasyID benchmarks report significant challenges for state-of-the-art detectors even under controlled conditions: at a fixed operational threshold (FPR = 10%), false negative rates approach 50% for several contemporary models (TruFor, MMFusion, UniFD, FatFormer). The difficulty stems from subtle manipulation methods—face swaps with Gaussian-blur blending, localized text edits—and the high-resolution naturalism of document imagery, suggesting FantasyIDiap's high value for advancing robust detection techniques (Korshunov et al., 28 Jul 2025).
7. Limitations, Challenges, and Future Extensions
Key limitations of FantasyIDiap, as inherited from the underlying design philosophy, include the exclusive focus on digital injection attacks (no analog print→attack→recapture scenario), controlled lighting/acquisition settings (no off-angle or adverse illumination), and absence of presentation-attack variants (e.g., screen versus paper presentation).
This suggests that while the dataset is substantial for digital forgery detection and localization, robustness to broader adversarial conditions or physical forgeries remains an open research vector. Suggested future directions identified in related work include extending annotations to complex attack types (global morphing, pattern edits) and including more adverse acquisition conditions for comprehensive domain robustness (Korshunov et al., 28 Jul 2025).
References
- "TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents" (Naseeb et al., 19 Jan 2026)
- "FantasyID: A dataset for detecting digital manipulations of ID-documents" (Korshunov et al., 28 Jul 2025)