SMDD: Synthetic Morphing Attack Detection Dataset
- SMDD is a suite of synthetic datasets that generate bona fide and morphed facial images using advanced generative models to support scalable morphing attack detection research.
- It employs StyleGAN2-based synthesis, landmark and GAN-driven morph generation, and simulated print/scan processes to mimic real-world operational conditions while ensuring privacy.
- The dataset design enforces strict identity-disjoint splits and ISO/IEC 30107-3 compliant evaluation protocols, facilitating reproducible assessments in both single-image and differential MAD scenarios.
The Synthetic Morphing Attack Detection Development Dataset (SMDD) is a suite of datasets and methodologies designed to provide privacy-preserving, large-scale, and diverse benchmarks for training and evaluating face morphing attack detection (MAD) algorithms. Its core premise is the synthetic generation of both bona fide and attacked images, thereby obviating the legal and ethical constraints of handling real biometric data. SMDD, as formalized in a series of works, comprises multiple generations and variants, each contributing technical innovations in synthetic data generation, morphing procedures, simulated print-scan (PS) processes, and evaluation protocols. The dataset has become foundational for research in single-image and differential morphing attack detection, supporting both classical and deep learning paradigms.
1. Definition, Motivations, and Privacy Considerations
SMDD defines a class of datasets where both bona fide ("real") face images and morphing attacks are generated synthetically—primarily using StyleGAN2-based architectures—circumventing the handling of sensitive personal data. This approach enables unrestricted data sharing for academic purposes and directly addresses legal barriers outlined in the General Data Protection Regulation (GDPR), which require extensive compliance documentation and limitations for real biometric datasets. By forgoing any real or even demographically labeled identities, SMDD datasets are intrinsically privacy-friendly and can be distributed under simple academic licenses (Damer et al., 2022).
The creation of SMDD was motivated by:
- The lack of large, public, diverse morphing attack datasets.
- The manual and laborious nature of print/scan data creation.
- The need for more realistic training data representing operational use cases (e.g., passport and border control pipelines) (Raja et al., 2020, Tapia et al., 2024).
2. Synthetic Image and Morph Generation Pipelines
a. Bona Fide Image Synthesis
Bona fide images are synthesized using StyleGAN2 (or StyleGAN2-ADA) architectures, typically pretrained on FFHQ 1024×1024 images. Synthetic samples are generated by sampling latent variables , with images . Quality assurance employs metrics such as CR-FIQA or VGGFace2-based diversity filtering, with only high-quality, identity-diverse images retained (Damer et al., 2022, Zhang et al., 2024).
In more advanced pipelines, latent "neutralization" is applied to remove variation in pose, illumination, or expression by subtracting projections along specific directions in latent space, using SVM hyperplanes for the corresponding factors (Zhang et al., 2024).
b. Morph Generation
Morphing between synthetic identities is achieved via:
- Landmark-based OpenCV pipelines: Dlib 68-point landmarks, Delaunay triangulation, affine warping, and blend weights (typically ), filtering artifacts through both automatic and manual steps.
- GAN-based morphs (latest releases): Latent interpolations in StyleGAN2 or use of architectures like MIPGAN-II, followed by network-in-the-loop refinement (Zhang et al., 2024).
Morphs are generated exclusively from synthetic parents, with no inclusion of real faces or associated metadata in the primary SMDD releases.
c. Simulating Print/Scan and Real-World Variation
To address operational nuances, extended SMDD pipelines introduce digital simulation of print-scan artifacts:
- GAN-based print/scan simulation using paired (pix2pix) or unpaired (CycleGAN) architectures. Inputs are digital and physically printed/scanned face pairs, with network outputs simulating high-fidelity PS textures at 600 dpi. Losses comprise adversarial, -reconstruction, and (for CycleGAN) cycle-consistency terms (Tapia et al., 2024).
- Texture-transfer simulation via handcrafted extraction of scanner/printer artifacts from color-patch sweeps, which are then added pixel-wise to digital faces, controlling for artifact blending (Tapia et al., 2024).
Parameters such as scan/paper type, device diversity, and augmentation protocols (jitter, rotation, recompression) are incorporated to maximize generalization.
3. Dataset Composition, Scale, and File Organization
The seminal SMDD (2022) version contains 50,000 bona fide and 30,000 attack samples, all at 1024×1024 resolution, split equally into development (training) and evaluation partitions. Morphed images are generated through established pairs with full reproducibility (pair_lists), maintaining no real-world demographic labeling (Damer et al., 2022).
Second-generation datasets ("SynMorph" [Editor's term: SMDDv2]) scale to 2,350 identities (1,175 M, 1,175 F), providing over 250,000 non-morph and 115,000 morph samples per morph engine (landmark-based LMA-UBO and GAN-based MIPGAN-II) (Zhang et al., 2024). Splits are maintained strictly identity-disjoint for train/dev/test; mated samples support both single-image and differential MAD evaluation.
A summary of SMDD and SynMorph scale:
| Version | Bona Fide | Morph Samples | Morph Types | Real Images | Notes |
|---|---|---|---|---|---|
| SMDD (2022) | 50,000 | 30,000 | OpenCV-LMA | None | 1024x1024, full synthetic, LMA only |
| SynMorph (2024) | ~500,000 | ~230,000 | LMA-UBO, MIPGAN | None | Pose/illum-neutralization, D-MAD ready |
In all open versions, file organization is systematic (e.g., train/bona_fide, train/morphed, eval/bona_fide), with reproducible source-pair documentation.
4. Evaluation Protocols and MAD Benchmarking
SMDD is designed to be compatible with ISO/IEC 30107-3 standards, providing for both single-image (S-MAD) and differential (D-MAD) attack detection scenarios:
- S-MAD: Evaluates classifiers distinguishing bona fide from morphed images based solely on enrollment images, with metrics including APCER, BPCER, and EER.
- D-MAD: Admits a reference (e.g., trusted capture at a gate) and compares against possibly morphed enrollments (Raja et al., 2020, Zhang et al., 2024).
Performance is reported for baselines including Inception-MAD, MixFaceNet, PW-MAD (U-Net), MorphHRNet (HRNet), and classical feature-based SVMs. The datasets enable algorithmic validation on real-world operational datasets (FRLL-Morph, MAD22, etc.), highlighting generalization performance. Example results (Damer et al., 2022):
- MixFaceNet trained on SMDD achieves EER of 4.39% on FRLL-OpenCV morphs, surpassing analogously trained classifiers on real datasets.
Protocols enforce strict identity-disjoint evaluation and transparent reporting across operating points, including BPCER at fixed APCER thresholds.
5. Strengths, Limitations, and Impact
a. Advantages
- Privacy: Full synthetic composition ensures compliance-free distribution and use, as the dataset contains no personal data.
- Scale: SMDD and SynMorph surpass real-data alternatives by orders of magnitude, especially in the number of unique identities and morphs (Damer et al., 2022, Zhang et al., 2024).
- Diversity (Synthetic): Multiple morphing operators, quality-controlled latent editing, and in new versions, print/scan and mated-sample simulation, enhance coverage.
- Realism: Digital PS simulation (GAN-based, texture-transfer) closes the domain gap towards physically printed/scanned attack samples (Tapia et al., 2024).
b. Limitations
- Demographic Representation Absence: No explicit modeling of age, gender, ethnicity in purely synthetic sets, so subgroup fairness and biases cannot be directly measured.
- Morph Process Bias: Early releases (e.g., OpenCV-LMA only) inadequately model the diversity of real-world morphing pipelines (e.g., GAN, diffusion), leading to reduced cross-attack generalization. Methods trained solely on SMDD may underperform on GAN-based or highly curated attacks (Paulo et al., 28 Jan 2026).
- Print/Scan Fidelity: While simulation pipelines improve scalability, their fidelity compared to real P&S devices may not capture all possible operational artifacts unless carefully modeled (Tapia et al., 2024).
6. Expansion, Platform Integration, and Community Usage
SMDD is supported by online evaluation platforms (e.g., FVC-onGoing) where sequestered, demographically diverse, partially synthetic datasets enable closed evaluation of submitted algorithms (Raja et al., 2020). These platforms enforce stateless algorithmic evaluation and reproduce ISO-standard detection reporting metrics.
Recommendations for further scaling include:
- Device and morphing operator diversity: 10 printer and scanner types, 8 morph engines spanning landmark, GAN, and professional software.
- Data augmentation: mild geometric/color/sensor-noise perturbations, JPEG recompression.
- Detailed metadata: for each image, record device, paper type, morph algorithm, and attack/genuine status, enabling stratified and fairness-oriented analysis (Tapia et al., 2024).
Recent works propose the development of “SMDD 2.0”/“SynMorph”, incorporating GAN-based morphs, mated-sample editing for D-MAD, and exhaustive operational scenario simulation (Zhang et al., 2024).
7. Legal, Ethical, and Reproducibility Aspects
SMDD offers a case study in privacy-preserving biometric research. By entirely avoiding real facial images, SMDD sidesteps GDPR restrictions on biometric data, removing requirements for informed consent, DPIAs, data protection officers, and subject withdrawal rights. Open-source licensing (“research use only”) is standard, with full reproducibility detailed via code and source pairs (Damer et al., 2022). This model is recommended for future biometric dataset releases where privacy constraints are paramount.
By systematically addressing challenges of privacy, scale, operational realism, and algorithmic benchmarking, SMDD and its successor datasets underpin a significant share of contemporary MAD research, enabling robust evaluation, reproducible experiments, and the exploration of emerging detection architectures (Damer et al., 2022, Raja et al., 2020, Tapia et al., 2024, Zhang et al., 2024, Paulo et al., 28 Jan 2026).