Synthetic Data for Defect Detection

Updated 6 May 2026

Data-generation for defect detection is a technique to synthesize defect images, addressing sample scarcity, class imbalance, and the demand for precise annotations.
Modern approaches employ physics-based simulations, GANs, VAEs, and diffusion models to create realistic and semantically controlled defective samples.
Empirical studies reveal that blending synthetic and real data improves detection metrics, supporting robust quality control in manufacturing.

Data-generation for defect detection is a core technique in computer vision for manufacturing quality control, enabling the training and benchmarking of machine learning models when real annotated defect data are insufficient, expensive, or highly imbalanced. The field spans rule-based, physically-motivated, and deep generative approaches, supporting classification, localization, and semantic segmentation tasks across diverse defect types, substrates, and inspection modalities. Advanced pipelines not only synthesize visually realistic and physically plausible defective samples, but also provide exact semantic annotations, condition on defect type and morphology, and match the sampling distributions of real production data.

1. Principles and Motivations for Synthetic Data Generation

Synthetic data generation addresses three key challenges in defect detection: scarcity of real defective samples, class imbalance (defect/non-defect), and the need for precise annotations. Industrial defects are inherently rare, and manual annotation is costly. Synthetic data supports:

Large-scale augmentation of training datasets to mitigate overfitting and improve generalization, especially for rare or novel defect modes (Yang et al., 2023).
Controlled study of model robustness under varying defect types, morphologies, and nuisance factors (e.g., lighting, noise) (Gutierrez et al., 2021).
Precise, pixel-perfect ground-truth masks or bounding boxes aligned with physical geometry (Jeziorski et al., 5 Feb 2026).

Approaches include physics-driven simulation (e.g., Monte Carlo for X-ray or ultrasound), rule-based graphics pipelines (Voronoi-based geometric modeling, procedural texture synthesis), and generative models (GANs, VAEs, diffusion models) supporting semantic conditioning and sample diversity.

2. Generative Deep Learning Frameworks

Modern defect image synthesis leverages deep generative models with explicit architectures and mathematical underpinnings:

Denoising Diffusion Probabilistic Models (DDPMs): These provide a principled framework for high-fidelity and diverse synthesis of defect images. The diffusion process is defined as a sequence of forward (noising) and learned reverse (denoising) Markov transitions, governed by

$q(x_{t}\mid x_{t-1}) = \mathcal{N}(x_{t}; \sqrt{\alpha_{t}}x_{t-1}, (1 - \alpha_{t})I)$

with neural approximation of the reverse process, minimizing an “ε-matching” loss (Yang et al., 2023, Capogrosso et al., 2024). Class or attribute conditioning can be enforced through embedding injection or FiLM-style modulation, enabling generation of specific defect types, shapes, and locations. Two-stage diffusion (local patch texture then global structure) further improves realism and diversity (Yang et al., 2023).

GANs: GAN frameworks synthesize defect images by adversarial learning. The generator maps latent noise to visually plausible samples, and the discriminator enforces realism via feedback, often with auxiliary detection discrimination for improved alignment with downstream tasks (Posilović et al., 2021, Huang et al., 12 Jan 2025, Phan et al., 2024, Rožanec et al., 2022).
VAEs: For domains with extremely small datasets, VAEs are used to reconstruct and sample in the vicinity of real examples within a regularized latent space, producing high-fidelity samples with minimal overfitting (Ferdousi et al., 2023).
Vision-language and prompt-driven conditioning: Large Vision-LLMs (VLMs) can be used for automatic prompt construction, driving LoRA-adapted diffusion or mask-guided inpainting to insert synthetic defects corresponding to complex defect taxonomies (Kühn et al., 29 Apr 2026).

3. Rule-Based and Physics-Based Synthetic Pipelines

High-fidelity rule-based pipelines remain prominent, particularly for metal forming, composite manufacturing, and NDT imaging:

Parametric Model-Based Simulation: Defect cross-sections are generated using 2D Voronoi tessellations, Delaunay triangulations, and parametric splines, which are then extruded to 3D for digital twin construction (Jeziorski et al., 5 Feb 2026). These defects are merged via Boolean operations with base object meshes and rendered using physics-based Monte Carlo simulators or Blender path-tracing for optical, X-ray, or acoustic modalities.
Physically Accurate Data Generation: In X-ray defect detection, data generation pipelines employ Monte Carlo photon transport to accurately simulate scatter and material-dependent attenuation, mirroring the imaging physics of real systems (Andriiashen et al., 2023). The necessity of including physical effects (e.g., scattering) is quantified via probability of detection (POD) metrics.

Rule-based strategies afford precise sampling control over defect-class distributions, geometric variability, and annotation extraction, facilitating dataset balancing and rare-defect oversampling (Jeziorski et al., 5 Feb 2026, Gutierrez et al., 2021).

4. Evaluation Metrics and Empirical Findings

Quantitative assessment of synthetic data generation is performed through both image similarity/diversity metrics and downstream defect-detection task performance:

FID (Fréchet Inception Distance): Measures how close the distribution of generated defect images is to that of real defect data, with lower values indicating higher fidelity (Yang et al., 2023, Shi et al., 2024).
LPIPS (Learned Perceptual Image Patch Similarity): Assesses sample diversity, with higher values reflecting greater visual variety.
Downstream Model Performance: Relative mAP (mean average precision), mIoU (mean intersection over union), F1, accuracy, and recall are evaluated by training segmentation or detection architectures either on real-only, synthetic-only, or mixed datasets. Gains of +3.7 mIoU over state-of-the-art baselines have been reported when using diffusion-generated synthetic data (Yang et al., 2023). Up to 10 percentage point increases in [email protected]:0.95 are observed in low-data regimes via anomaly-guided pretraining (Liu et al., 23 Sep 2025).

Empirical findings consistently show that blending synthetic and real data yields optimal results, particularly under data-scarce and class-imbalanced conditions. Synthetic-only regimes generally underperform but are valuable for rare defect modes or completely new domains (Kühn et al., 29 Apr 2026).

5. Semantic, Conditional, and Annotation Mechanisms

State-of-the-art systems require fine-grained, semantically controlled generation and precise annotation pipelines:

Semantic Conditioning: Class-conditional sampling is achieved by embedding defect-type/categorical information (e.g., “scratch,” “crack,” “burr”) as learnable vectors injected into the generative model (Yang et al., 2023, Kühn et al., 29 Apr 2026). Classifier-free guidance is frequently used for trade-offs between synthetic diversity and semantic fidelity.
Pixel-Perfect Annotations: Ground-truth annotations are produced in parallel using geometry-aware renderers (object/material ID passes) (Jeziorski et al., 5 Feb 2026), mask propagation or segmentation model feedback (SAM-3) (Kühn et al., 29 Apr 2026), or direct label extraction from the digital twin (Gutierrez et al., 2021). Cross-attention maps in diffusion models are further reutilized for real-time mask extraction (Shi et al., 2024).
Mask-guided Inpainting: Synthesis of defects at arbitrary, controlled locations is feasible using mask-based inpainting in the latent or image domain, enabling model-driven augmentation for targeted defect types (Kühn et al., 29 Apr 2026, Valvano et al., 2024, Girella et al., 2024).

6. Component Ablation, Trade-offs, and Best Practices

Numerous studies extensively ablate the effect of model hyperparameters, conditioning schemes, and pipeline components:

Receptive-Field Size and Multi-Stage Fusion: Fusion of small patch-level and global full-resolution diffusion stages offers a balance of local texture diversity and large-scale structural coherence (Yang et al., 2023).
Switch Timestep (u) and Diversity-Fidelity Trade-off: Small early u yields high-diversity/low-fidelity, large u yields high-fidelity/low-diversity. Empirical selection of the switch point is crucial (Yang et al., 2023).
Annotation Quality and Filtering: Sample filtering using automatic metrics (CLIPScore, DreamSim) ensures that only realistic and task-relevant synthetic images are retained for model training (Kühn et al., 29 Apr 2026).
Effect of Augmentation Ratio: A synthetic:real ratio of 1:1 or 25–50% synthetic often preserves or improves detection AP; pure synthetic always underperforms real in cross-domain trials (Kühn et al., 29 Apr 2026, Phan et al., 2024).
Domain Adaptation over Domain Randomization: While unsupervised adaptation provides marginal benefits, photorealism and systematic domain randomization drive larger gains (Gutierrez et al., 2021).

7. Application Domains, Limitations, and Future Directions

Synthetic data generation frameworks are now pervasive in:

Metal casting, additive manufacturing, electronics (PCB, semiconductor), automotive, utilities, and composite materials (Huang et al., 12 Jan 2025, Shinde et al., 15 May 2025, Phan et al., 2024, Dey et al., 2024, Posilović et al., 2021, Jeziorski et al., 5 Feb 2026).
Inspection modalities including visual, X-ray, SEM, and ultrasonic B-scans.
Benchmarks with fine-grained semantic labels, such as “Defect Spectrum” (Yang et al., 2023).

Identified challenges include:

Domain gap between synthetic and real data, particularly in capturing subtle manufacturing or imaging artifacts (Kühn et al., 29 Apr 2026).
The need for high-fidelity rendering in physics-based pipelines with computational cost constraints (Andriiashen et al., 2023).
Generalization to unseen defect classes and products, addressed through prompt-based conditioning, transfer of defect-perturbation directions, and zero-shot consistency modeling (Shi et al., 2024).

Ongoing work aims to further automate prompt construction, integrate closed-loop hard-negative mining, and extend multi-modal diffusion conditioning for complex multi-class segmentation problems.

References

"Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics" (Yang et al., 2023)
"Quantifying the effect of X-ray scattering for data generation in real-time defect detection" (Andriiashen et al., 2023)
"Synthetic training data generation for deep learning based quality inspection" (Gutierrez et al., 2021)
"Defect Detection in Photolithographic Patterns Using Deep Learning Models Trained on Synthetic Data" (Shinde et al., 15 May 2025)
"Scalable AI Framework for Defect Detection in Metal Additive Manufacturing" (Phan et al., 2024)
"TransferD2: Automated Defect Detection Approach in Smart Manufacturing using Transfer Learning Techniques" (Mih et al., 2023)
"Defect Detection Network In PCB Circuit Devices Based on GAN Enhanced YOLOv11" (Huang et al., 12 Jan 2025)
"Advancing Metallic Surface Defect Detection via Anomaly-Guided Pretraining on a Large Industrial Dataset" (Liu et al., 23 Sep 2025)
"Generative Model-Driven Synthetic Training Image Generation: An Approach to Cognition in Rail Defect Detection" (Ferdousi et al., 2023)
"Controllable Image Synthesis of Industrial Data Using Stable Diffusion" (Valvano et al., 2024)
"Generative adversarial network with object detector discriminator for enhanced defect detection on ultrasonic B-scans" (Posilović et al., 2021)
"Leveraging Latent Diffusion Models for Training-Free In-Distribution Data Augmentation for Surface Defect Detection" (Girella et al., 2024)
"Synthetic Defect Geometries of Cast Metal Objects Modeled via 2d Voronoi Tessellations" (Jeziorski et al., 5 Feb 2026)
"Synthetic Data Augmentation Using GAN For Improved Automated Visual Inspection" (Rožanec et al., 2022)
"End-to-End Defect Detection in Automated Fiber Placement Based on Artificially Generated Data" (Zambal et al., 2019)
"Diffusion-based Image Generation for In-distribution Data Augmentation in Surface Defect Detection" (Capogrosso et al., 2024)
"Addressing Class Imbalance and Data Limitations in Advanced Node Semiconductor Defect Inspection: A Generative Approach for SEM Images" (Dey et al., 2024)
"SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection" (Kühn et al., 29 Apr 2026)
"DDPM-MoCo: Advancing Industrial Surface Defect Generation and Detection with Generative and Contrastive Learning" (He et al., 2024)
"Few-shot Defect Image Generation based on Consistency Modeling" (Shi et al., 2024)