SMDD: Synthetic Morphing Attack Detection Dataset

Updated 4 February 2026

SMDD is a suite of synthetic datasets that generate bona fide and morphed facial images using advanced generative models to support scalable morphing attack detection research.
It employs StyleGAN2-based synthesis, landmark and GAN-driven morph generation, and simulated print/scan processes to mimic real-world operational conditions while ensuring privacy.
The dataset design enforces strict identity-disjoint splits and ISO/IEC 30107-3 compliant evaluation protocols, facilitating reproducible assessments in both single-image and differential MAD scenarios.

The Synthetic Morphing Attack Detection Development Dataset (SMDD) is a suite of datasets and methodologies designed to provide privacy-preserving, large-scale, and diverse benchmarks for training and evaluating face morphing attack detection (MAD) algorithms. Its core premise is the synthetic generation of both bona fide and attacked images, thereby obviating the legal and ethical constraints of handling real biometric data. SMDD, as formalized in a series of works, comprises multiple generations and variants, each contributing technical innovations in synthetic data generation, morphing procedures, simulated print-scan (PS) processes, and evaluation protocols. The dataset has become foundational for research in single-image and differential morphing attack detection, supporting both classical and deep learning paradigms.

1. Definition, Motivations, and Privacy Considerations

SMDD defines a class of datasets where both bona fide ("real") face images and morphing attacks are generated synthetically—primarily using StyleGAN2-based architectures—circumventing the handling of sensitive personal data. This approach enables unrestricted data sharing for academic purposes and directly addresses legal barriers outlined in the General Data Protection Regulation (GDPR), which require extensive compliance documentation and limitations for real biometric datasets. By forgoing any real or even demographically labeled identities, SMDD datasets are intrinsically privacy-friendly and can be distributed under simple academic licenses (Damer et al., 2022).

The creation of SMDD was motivated by:

The lack of large, public, diverse morphing attack datasets.
The manual and laborious nature of print/scan data creation.
The need for more realistic training data representing operational use cases (e.g., passport and border control pipelines) (Raja et al., 2020, Tapia et al., 2024).

2. Synthetic Image and Morph Generation Pipelines

a. Bona Fide Image Synthesis

Bona fide images are synthesized using StyleGAN2 (or StyleGAN2-ADA) architectures, typically pretrained on FFHQ 1024×1024 images. Synthetic samples are generated by sampling latent variables $z \sim \mathcal{N}(0, I)$ , with images $x=G(z)$ . Quality assurance employs metrics such as CR-FIQA or VGGFace2-based diversity filtering, with only high-quality, identity-diverse images retained (Damer et al., 2022, Zhang et al., 2024).

In more advanced pipelines, latent "neutralization" is applied to remove variation in pose, illumination, or expression by subtracting projections along specific directions in latent space, using SVM hyperplanes for the corresponding factors (Zhang et al., 2024).

b. Morph Generation

Morphing between synthetic identities is achieved via:

Landmark-based OpenCV pipelines: Dlib 68-point landmarks, Delaunay triangulation, affine warping, and blend weights (typically $\alpha_b=0.5$ ), filtering artifacts through both automatic and manual steps.
GAN-based morphs (latest releases): Latent interpolations in StyleGAN2 or use of architectures like MIPGAN-II, followed by network-in-the-loop refinement (Zhang et al., 2024).

Morphs are generated exclusively from synthetic parents, with no inclusion of real faces or associated metadata in the primary SMDD releases.

c. Simulating Print/Scan and Real-World Variation

To address operational nuances, extended SMDD pipelines introduce digital simulation of print-scan artifacts:

GAN-based print/scan simulation using paired (pix2pix) or unpaired (CycleGAN) architectures. Inputs are digital and physically printed/scanned face pairs, with network outputs simulating high-fidelity PS textures at 600 dpi. Losses comprise adversarial, $\mathcal{L}_1$ -reconstruction, and (for CycleGAN) cycle-consistency terms (Tapia et al., 2024).
Texture-transfer simulation via handcrafted extraction of scanner/printer artifacts from color-patch sweeps, which are then added pixel-wise to digital faces, controlling for artifact blending (Tapia et al., 2024).

Parameters such as scan/paper type, device diversity, and augmentation protocols (jitter, rotation, recompression) are incorporated to maximize generalization.

3. Dataset Composition, Scale, and File Organization

The seminal SMDD (2022) version contains 50,000 bona fide and 30,000 attack samples, all at 1024×1024 resolution, split equally into development (training) and evaluation partitions. Morphed images are generated through established pairs with full reproducibility (pair_lists), maintaining no real-world demographic labeling (Damer et al., 2022).

Second-generation datasets ("SynMorph" [Editor's term: SMDDv2]) scale to 2,350 identities (1,175 M, 1,175 F), providing over 250,000 non-morph and 115,000 morph samples per morph engine (landmark-based LMA-UBO and GAN-based MIPGAN-II) (Zhang et al., 2024). Splits are maintained strictly identity-disjoint for train/dev/test; mated samples support both single-image and differential MAD evaluation.

A summary of SMDD and SynMorph scale:

Version	Bona Fide	Morph Samples	Morph Types	Real Images	Notes
SMDD (2022)	50,000	30,000	OpenCV-LMA	None	1024x1024, full synthetic, LMA only
SynMorph (2024)	~500,000	~230,000	LMA-UBO, MIPGAN	None	Pose/illum-neutralization, D-MAD ready

In all open versions, file organization is systematic (e.g., train/bona_fide, train/morphed, eval/bona_fide), with reproducible source-pair documentation.

4. Evaluation Protocols and MAD Benchmarking

SMDD is designed to be compatible with ISO/IEC 30107-3 standards, providing for both single-image (S-MAD) and differential (D-MAD) attack detection scenarios:

S-MAD: Evaluates classifiers distinguishing bona fide from morphed images based solely on enrollment images, with metrics including APCER, BPCER, and EER.
D-MAD: Admits a reference (e.g., trusted capture at a gate) and compares against possibly morphed enrollments (Raja et al., 2020, Zhang et al., 2024).

Performance is reported for baselines including Inception-MAD, MixFaceNet, PW-MAD (U-Net), MorphHRNet (HRNet), and classical feature-based SVMs. The datasets enable algorithmic validation on real-world operational datasets (FRLL-Morph, MAD22, etc.), highlighting generalization performance. Example results (Damer et al., 2022):

MixFaceNet trained on SMDD achieves EER of 4.39% on FRLL-OpenCV morphs, surpassing analogously trained classifiers on real datasets.

Protocols enforce strict identity-disjoint evaluation and transparent reporting across operating points, including BPCER at fixed APCER thresholds.

5. Strengths, Limitations, and Impact

a. Advantages

Privacy: Full synthetic composition ensures compliance-free distribution and use, as the dataset contains no personal data.
Scale: SMDD and SynMorph surpass real-data alternatives by orders of magnitude, especially in the number of unique identities and morphs (Damer et al., 2022, Zhang et al., 2024).
Diversity (Synthetic): Multiple morphing operators, quality-controlled latent editing, and in new versions, print/scan and mated-sample simulation, enhance coverage.
Realism: Digital PS simulation (GAN-based, texture-transfer) closes the domain gap towards physically printed/scanned attack samples (Tapia et al., 2024).

b. Limitations

Demographic Representation Absence: No explicit modeling of age, gender, ethnicity in purely synthetic sets, so subgroup fairness and biases cannot be directly measured.
Morph Process Bias: Early releases (e.g., OpenCV-LMA only) inadequately model the diversity of real-world morphing pipelines (e.g., GAN, diffusion), leading to reduced cross-attack generalization. Methods trained solely on SMDD may underperform on GAN-based or highly curated attacks (Paulo et al., 28 Jan 2026).
Print/Scan Fidelity: While simulation pipelines improve scalability, their fidelity compared to real P&S devices may not capture all possible operational artifacts unless carefully modeled (Tapia et al., 2024).

6. Expansion, Platform Integration, and Community Usage

SMDD is supported by online evaluation platforms (e.g., FVC-onGoing) where sequestered, demographically diverse, partially synthetic datasets enable closed evaluation of submitted algorithms (Raja et al., 2020). These platforms enforce stateless algorithmic evaluation and reproduce ISO-standard detection reporting metrics.

Recommendations for further scaling include:

Device and morphing operator diversity: $>$ 10 printer and scanner types, $>$ 8 morph engines spanning landmark, GAN, and professional software.
Data augmentation: mild geometric/color/sensor-noise perturbations, JPEG recompression.
Detailed metadata: for each image, record device, paper type, morph algorithm, and attack/genuine status, enabling stratified and fairness-oriented analysis (Tapia et al., 2024).

Recent works propose the development of “SMDD 2.0”/“SynMorph”, incorporating GAN-based morphs, mated-sample editing for D-MAD, and exhaustive operational scenario simulation (Zhang et al., 2024).

7. Legal, Ethical, and Reproducibility Aspects

SMDD offers a case study in privacy-preserving biometric research. By entirely avoiding real facial images, SMDD sidesteps GDPR restrictions on biometric data, removing requirements for informed consent, DPIAs, data protection officers, and subject withdrawal rights. Open-source licensing (“research use only”) is standard, with full reproducibility detailed via code and source pairs (Damer et al., 2022). This model is recommended for future biometric dataset releases where privacy constraints are paramount.

By systematically addressing challenges of privacy, scale, operational realism, and algorithmic benchmarking, SMDD and its successor datasets underpin a significant share of contemporary MAD research, enabling robust evaluation, reproducible experiments, and the exploration of emerging detection architectures (Damer et al., 2022, Raja et al., 2020, Tapia et al., 2024, Zhang et al., 2024, Paulo et al., 28 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (5)

Privacy-friendly Synthetic Data for the Development of Face Morphing Attack Detectors (2022)

Morphing Attack Detection -- Database, Evaluation Platform and Benchmarking (2020)

Generating Automatically Print/Scan Textures for Morphing Attack Detection Applications (2024)

SynMorph: Generating Synthetic Face Morphing Dataset with Mated Samples (2024)

FD-MAD: Frequency-Domain Residual Analysis for Face Morphing Attack Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Synthetic Morphing Attack Detection Development Dataset (SMDD).

SMDD: Synthetic Morphing Attack Detection Dataset

1. Definition, Motivations, and Privacy Considerations

2. Synthetic Image and Morph Generation Pipelines

a. Bona Fide Image Synthesis

b. Morph Generation

c. Simulating Print/Scan and Real-World Variation

3. Dataset Composition, Scale, and File Organization

4. Evaluation Protocols and MAD Benchmarking

5. Strengths, Limitations, and Impact

a. Advantages

b. Limitations

6. Expansion, Platform Integration, and Community Usage

7. Legal, Ethical, and Reproducibility Aspects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SMDD: Synthetic Morphing Attack Detection Dataset

1. Definition, Motivations, and Privacy Considerations

2. Synthetic Image and Morph Generation Pipelines

a. Bona Fide Image Synthesis

b. Morph Generation

c. Simulating Print/Scan and Real-World Variation

3. Dataset Composition, Scale, and File Organization

4. Evaluation Protocols and MAD Benchmarking

5. Strengths, Limitations, and Impact

a. Advantages

b. Limitations

6. Expansion, Platform Integration, and Community Usage

7. Legal, Ethical, and Reproducibility Aspects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research