UDG Dataset: Visual Defect Generation Resource

Updated 17 June 2026

UDG Dataset is a large-scale visual defect generation resource comprising 300K quadruplets that include normal images, defective images, binary masks, and structured captions.
It implements a reproducible, semi-automated annotation pipeline that combines real and synthetic defect samples from industrial, natural, and medical domains.
The dataset supports state-of-the-art anomaly detection and localization by mapping 269 raw defect types into 28 standardized categories for robust model training.

The UDG dataset referenced in contemporary scientific literature denotes two distinct large-scale resources: (1) datasets involving ultra-diffuse galaxies (“UDG” as a galaxy type, predominantly in extragalactic astronomy), and (2) the UDG dataset introduced as “Universal Defect Generation” (UDG) in computer vision for defect/anomaly generation and synthesis. The former serves galaxy research in astrophysics, while the latter is foundational for visual anomaly synthesis and evaluation in machine learning. This article presents a comprehensive overview, emphasizing the structure, content, and research utility of the Universal Defect Generation dataset as described in (Fan et al., 10 Apr 2026), while referencing but distinguishing it from the astrophysical use cases.

1. Definition and Quadruplet Structure

The UDG dataset, as described in (Fan et al., 10 Apr 2026), is a curated, large-scale data resource explicitly constructed to address deficiencies in visual defect generation and defect/anomaly understanding. Each dataset sample is a quadruplet:

A normal (defect-free) image
The corresponding abnormal (defective) image
A pixel-aligned binary mask indicating the defect’s spatial extent
A structured, natural language caption that hierarchically describes the scene, object, and defect (global scene → object → defect morphology and severity)

The collection totals 300,000 such quadruplets, which are suitable for end-to-end ingestion by generative models or discriminative approaches in prompt-driven or reference-based visual defect processing.

Component	Description	Format/Source
Normal image	Defect-free, paired with mask and abnormal	Image, generated or curated
Abnormal image	Image with one or more annotated defects	Image, existing datasets
Mask	Binary, size-matched, localizing the defect	Derived or synthesized
Caption	Descriptive, structured, filtered by LLMs	Multimodal LLM, verified

2. Domain and Defect-Type Coverage

The assembly of the UDG dataset is explicitly multi-domain:

Industrial inspection (manufacturing surfaces, parts, assemblies)
Natural-world objects (agricultural produce, materials, everyday items)
Medical imagery (pathological regions, modalities)

Data is aggregated from 50 separate sources, such as Real-IAD, MANTA, and 3CAD, ensuring broad domain representativity.

Defect classes are handled on two levels:

269 raw defect types, as originally encoded in the source sets
Mapped into 28 standardized defect macro-categories, preserving the heterogeneity of real-world objects and domains

The top categories include missing, combined, deformation, discoloration, breakage, dirt, scratch, and others, providing wide coverage of real and synthetic anomaly conditions.

3. Annotation Pipeline and Data Generation

The construction of each quadruplet is defined by a reproducible, semi-automated annotation and synthesis protocol:

Mask and Normal Image Acquisition:

For 150,000 real quadruplets, abnormal images and binary masks are obtained from existing anomaly-detection datasets.
“Normal” (defect-free) images are produced by training an inpainting agent (based on FLUX.1-Fill-dev), trained on ≈600,000 defect scenario images, to excise the annotated defects.
For normal images not originally paired with a mask, random mask templates (from foreground extraction plus a template library) are generated to supply masks for inpainting and augment defect instance diversity.

Caption Generation:

A Captioner Agent—leveraging multimodal LLMs such as Gemini-3-Pro, GPT-5.1, Qwen3-VL-235B, and GLM-4.6V—is orchestrated with a system prompt enforcing a global→object→defect structure.
Each caption receives an intrinsic confidence score [0,1]; a Verifier Agent discards low-confidence (<0.8) or inconsistent captions, ensuring only highly reliable annotations populate the dataset.

Synthetic Quadruplet Expansion:

An additional 150,000 samples are synthesized: normal images are re-masked using a template library extracted and clustered from real masks (2,800 clusters), followed by captioning and verification.

Total inpainting training is conducted for 2.4 million iterations, harmonizing with downstream generative model requirements in both optimizer choice and image resolution.

4. Dataset Statistics, Diversity, and Distribution

The final UDG dataset offers:

300,000 quadruplets: 50% real, 50% synthetic.
Coverage of 269 raw defect types mapped to 28 categories.
Representation across industrial, natural, and medical domains (no explicit proportion, but all are well represented).

Quantitative diversity is describable, e.g., via the standard diversity index

$D = 1 - \sum_{i=1}^{K} p_i^2$

where $p_i$ is the fraction of samples in category $i$ , $K = 28$ . Average defect area is:

$\bar{A} = \frac{1}{N} \sum_{n=1}^N \frac{\mathrm{Area(mask}_n)}{\mathrm{Area(image}_n)}$

Though not explicitly reported in (Fan et al., 10 Apr 2026), the dataset is constructed for high intra-category and inter-category variance, supporting both diversity-intensive generation and fine-grained detection/localization.

5. Protocols for Split, Evaluation, and Use

The data is distributed as a monolithic pool, suitable for user-defined splitting:

No enforced train/val/test partitions; developers commonly reserve 5–10% for validation, while the rest is used directly for model development and experimentation.
All source datasets are publicly available under their original licenses. UDG imposes no further proprietary constraints.

Models such as UniDG are trained and validated exclusively on UDG-supplied quadruplets; downstream model evaluation is performed on held-out external benchmarks (e.g., MVTec-AD, VisA).

6. Licensing, Accessibility, and Research Context

The composite UDG dataset is licensable under the terms of its source components, with public release (dataset and associated code) announced at [https://github.com/RetoFan233/UniDG]. Downstream usage is unrestricted beyond source licensing, facilitating broad academic and industrial experimentation (Fan et al., 10 Apr 2026).

The dataset’s conception directly addresses the limitations of few-shot and per-category trained anomaly/defect generation frameworks: small data, limited diversity, poor generalization, and inflexible category structure. By supporting both reference-based and text instruction-based defect editing, the resource enables universal generative models without bespoke category-specific pipelines. Empirical results in (Fan et al., 10 Apr 2026) demonstrate state-of-the-art anomaly generation and detection/localization performance by training on the UDG dataset.

7. Relation to UDG Datasets in Astronomy

There is a parallel and entirely distinct literature for ultra-diffuse galaxies (“UDGs”) datasets in astrophysics. These resources pertain to catalogs of galaxies with low surface brightness and large effective radii, such as the Southern SMUDGes catalog (Zaritsky et al., 2022), the KiDS UDG catalog (Su et al., 17 Sep 2025), and HI-selected UDGs from WALLABY (O'Beirne et al., 22 Oct 2025). These should not be conflated with the machine vision UDG dataset. Each discipline employs its own definition, selection standards, and catalog structures, which are independent except for acronymic overlap.