SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection

Published 29 Apr 2026 in cs.CV and cs.AI | (2604.26633v1)

Abstract: The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defects are rare, annotations are expensive, and collecting balanced training sets is slow. We present an end-to-end pipeline for synthetic defect generation and annotation, combining Vision-Language-Model-based prompts, LoRA-adapted diffusion, mask-guided inpainting, and sample filtering with automatic label derivation, and demonstrates the potential of real data with realistic synthetic samples to overcome data scarcity. The evaluation is conducted on, a challenging dataset of pitting defects on ball screw drives, and then on a subset of the Mobile phone screen surface defect segmentation dataset (MSD) dataset to test cross-domain transfer. Beyond downstream detector performance, we analyze key stages of the pipeline, including prompt construction, LoRA selection, and sample filtering with DreamSim and CLIPScore, to understand which synthetic samples are both realistic and useful. Experiments with YOLOv26, YOLOX, and LW-DETR show that synthetic-only training does not replace real data. When combined with real data, synthetic defects can preserve performance and yield modest gains in selected BSData training regimes. The MSD transfer study shows that the overall pipeline structure carries over to a second industrial inspection domain, while also highlighting the importance of domain-specific adaptation and annotation-quality control. Overall, the paper provides an end-to-end assessment of diffusion-based industrial defect synthesis and shows that its strongest value lies in strengthening scarce real datasets rather than substituting for them.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces SynSur that integrates automated vision-language prompt extraction with LoRA adaptation to generate realistic synthetic industrial defects without heavy manual labor.
It employs mask-guided inpainting and dual-axis filtering (using DreamSim and CLIPScore) to ensure high domain fidelity and prompt adherence in defect synthesis.
Quantitative evaluations reveal that blending synthetic and real data enhances detector performance, although synthetic-only regimes remain less reliable.

SynSur: An End-to-End Diffusion Pipeline for Industrial Surface Defect Synthesis and Detection

Introduction and Problem Statement

The scarcity of labeled industrial defect data remains a persistent obstacle in deploying robust deep defect detectors. Existing approaches to synthetic data generation, including 3D rendering pipelines and classic image-based synthesis, either impose significant manual overhead or are characterized by insufficient realism for effective supervised learning. In this context, diffusion models have emerged as a powerful backbone for conditional, high-fidelity synthesis but their potential for instance-level industrial defect generation in an end-to-end supervised detection pipeline remains underexplored.

The "SynSur" pipeline addresses this gap, leveraging a sequence of automated vision-language prompt extraction, LoRA-enabled diffusion adaptation, mask-guided inpainting, and multi-metric sample selection and annotation. The methodology facilitates the generation of synthetic defect datasets exhibiting high domain fidelity, with all processing steps designed for deployment without manual annotation effort aside from initial seed annotations.

Pipeline Overview

Figure 1: Overview of the end-to-end SynSur pipeline, illustrating each phase from real data processing and prompt extraction through LoRA-based diffusion, inpainting, blending, and automatic annotation.

The pipeline comprises several distinct stages designed for minimal manual intervention and high domain controllability. Specifically:

Data preparation involves defect patch extraction and annotation-based mask generation.
Prompts describing visual and context attributes are automatically synthesized from image batches using vision-LLMs (Qwen2-VL).
Low-Rank Adaptation (LoRA) finetunes a pretrained diffusion foundation model to industrial domain specifics using selected image-prompt pairs.
Mask-guided inpainting with LoRA-modified diffusion models generates context-consistent synthetic defects in defect-free image regions.
Candidate samples are filtered using a combination of DreamSim (perceptual proximity) and CLIPScore (prompt adherence).
Automatic annotations are produced with Segment Anything Model v3 (SAM-3), yielding COCO-format labels without additional user input.

Dataset Instantiation and Practical Synthesis

Figure 3: Qualitative depiction of multiple typical pitting defects from the BSData corpus, demonstrating fine-grained texture and size diversity.

The main evaluation is performed on BSData (pitting defects on ball screw drives) and a scratch subset of the Mobile phone screen surface defect dataset (MSD), representing two challenging inspection modalities. Defect patches are carefully extracted, with distributional analysis ensuring realistic spatial priors and structural diversity. Defect patches serve as both the basis for prompt extraction and LoRA adaptation.

Prompt Engineering via Vision-LLMs

A central element is the replacement of hand-crafted prompts with tags automatically extracted by Qwen2-VL from defect image batches. Tags are subjected to light post-processing to prevent overgeneralization, resulting in context/taxa-rich strings encapsulating material, morphology, texture, and imaging parameters. This approach increases reproducibility and consistency versus traditional manual engineering, ensuring that LoRA adaptation and diffusion guidance are attuned to objective, naturally clustered attribute distributions.

LoRA Diffusion Fine-Tuning and Ablation

The effectiveness of LoRA adaptation is empirically benchmarked against multiple sampling and size-binning strategies. While models trained on larger defect patches yielded marginal gains in visual similarity and prompt adherence (as measured by DreamSim and CLIPScore), overall synthesis performance was not strongly sensitive to patch selection heuristics, justifying simplicity in deployment.

Synthetic Data Filtering and Quality Control

To guarantee utility for supervised detector training, synthetic samples are ranked and selected using dual axes: DreamSim for proximity to real training images and CLIPScore for prompt-image fidelity. Samples at the top of these rankings exhibit both the characteristic morphology of true defects and prompt consistency, while low-ranked images display blurring or non-defect-like features.

Figure 2: Distributions of extreme ranking for synthetic samples under DreamSim and CLIPScore, illustrating selection for high realism and prompt support.

Quantitative Evaluation on Detection

Downstream impact on object detection is assessed by (1) detector family (YOLOX, YOLOv26, LW-DETR), (2) various blendings of real/synthetic training data, and (3) transfer to the cross-domain scratch segmentation task. The critical findings are:

Synthetic-only training does not match the reliability or accuracy of real-only regimes.
The primary value of synthetic samples emerges as augmentation for scarce real datasets, with mixed- and union-regimes (e.g., $AP = 0.655$ for YOLOv26 with 75/25 real/synthetic) exhibiting performance parity or modest gains over real-only training.
The benefit is bounded: detection performance saturates or degrades if synthetic prevalence overwhelms the real set, and margin gains are architecture-dependent (Transformer-based LW-DETR displays lower sensitivity to synthetic augmentation).
Cross-dataset transfer to MSD demonstrates operationality but emphasizes the requirement for domain-specific annotation practices, since annotation noise from automatic segmentation can reduce utility in some modalities.

Qualitative Examples and Failure Modes

Figure 6: Representative successful and failed synthetic samples, highlighting realistic pitting and scratch morphology versus cases with mask boundary overlap and annotation artifacts.

Good synthetic samples are indistinguishable from real defects in morphology/placement and do not exhibit obvious artifacts. However, failures regularly stem from poor mask location (e.g., defect generation on image boundaries) or limitations in automatic annotation, supporting the need for further refinement in mask strategy or closed-loop annotation correction.

Ablative and Failure Analysis

Figure 8: Limitation visualization of the foundation Flux.1-dev diffusion model (without LoRA) yielding contextually or morphologically implausible results regardless of prompt, substantiating the necessity of domain adaptation.

Ablation experiments confirm that prompt engineering alone is insufficient; LoRA adaptation is a critical prerequisite for domain-aligned synthesis. Domain-specific masks further enhance realism, but annotation post-processing with models such as SAM-3 remains a challenge for thin or elongated defect morphologies.

Implications and Future Directions

SynthSur demonstrates that, for high-value industrial inspection contexts, an end-to-end generative pipeline incorporating automatic prompt extraction, targeted diffusion adaptation, mask-driven generation, and multi-metric curation can adequately address annotation scarcity when appropriately fused with scarce real samples. However, synthetic-only regimes remain suboptimal due to subtle domain gaps.

Open areas remain in: supporting multiple defect classes concurrently, learning placement or mask priors directly from unlabeled real distributions, prompt diversification for finer control, and automating the feedback loop between augmentation and detector performance. The approach is directly generalizable to other fine-grained industrial visual inspection problems contingent on mask and prompt engineering adaptation.

Conclusion

The SynSur framework advances the field of generative data augmentation for industrial defect detection by establishing a practical, fully automated pipeline from real image analysis to final annotated synthetic sample production. The value of diffusion-generated samples is clear when used as a complement to scarce real data, and the engineering of LoRA prompts, mask processes, and post-filtering is shown to be both essential and tractable. The proposed design will underpin future exploration in self-improving closed-loop synthetic data generators for domain-robust industrial perception tasks.

References