AnomVerse: Multimodal Anomaly Dataset

Updated 15 November 2025

AnomVerse is a large-scale multimodal dataset defined by 12,987 triplets, each with an anomalous image, a pixel-level mask, and a detailed textual caption.
It utilizes an automated LLM pipeline that generates consistent, mask-aligned captions using visual cues and a fixed template.
The dataset is pivotal for training crossmodal generative models, supporting zero-shot anomaly generation and detection across five key domains.

AnomVerse is a large-scale, multimodal dataset specifically designed for research in zero-shot and prompt-driven anomaly generation. Developed as a foundational resource for the Anomagic framework, AnomVerse comprises 12,987 annotated triplets—each consisting of an anomalous image, a precise pixel-level mask, and a textual caption generated by multimodal LLMs. The dataset aggregates samples from 13 public anomaly detection benchmarks, encompassing five major domains: industrial, textiles, consumer products, medical imagery, and electronics. Its construction facilitates the training and evaluation of crossmodal generative models capable of synthesizing semantically coherent, mask-aligned anomalies with high realism in both visual and textual modalities.

1. Data Composition and Structure

AnomVerse consists of triplets (Iʳᵉᶠ, Mʳᵉᶠ, tʳᵉᶠ), where:

Iʳᵉᶠ is an image exhibiting a genuine anomaly,
Mʳᵉᶠ is a binary segmentation mask identifying the anomalous region,
tʳᵉᶠ is a natural-language caption describing the defect.

Statistical Summary:

Statistic	Value
Total triplets	12,987
Source datasets	13 (including MVTec AD, VisA, MANTA, among others)
Covered domains	Industrial, textiles, consumer, medical, electronics

Captions in tʳᵉᶠ are generated systematically using a multimodal LLM, specifically Doubao-Seed-1.6, guided by structured visual cues and template-based textual hints. The caption pipeline includes supplying the LLM with cropped mask regions (to anchor local context) and prompting with a fixed schema:

“The image depicts [object], with a [defect type] observed [location]. The defect is characterized by [detail] and exhibits [features].”

This approach yields fine-grained, standardized textual labels without manual curation.

2. Caption Generation and Curation Pipeline

Captioning in AnomVerse leverages automatic, LLM-driven annotation, systematically pairing each anomaly with a detailed and context-aware description. The pipeline operates as follows:

The anomalous region is detected and enclosed by a bounding box using the mask Mʳᵉᶠ.
The LLM receives both the whole image and the cropped region, ensuring its response is focused on the localized defect.
Caption generation adheres strictly to the fixed template, maximizing consistency across domains and defect types.

Unlike prior datasets reliant on human annotation or brief, often ambiguous categorical labels, the AnomVerse caption generation protocol ensures semantic richness, unambiguous localization, and strong correlation with visual features.

3. Utility in Anomaly Generation and Detection

AnomVerse provides the conceptual backbone for training crossmodal generative models, particularly those employing prompt-based inpainting with both visual and textual guidance. The triplet structure supports:

Visual prompt encoding: Directly isolating the anomalous region for fine-grained visual conditioning.
Textual prompt encoding: Allowing for zero-shot or compositional anomaly generation via descriptive natural language.
Mask supervision: Enabling precise, pixel-accurate inpainting and alignment during both training and refinement.

These properties make AnomVerse key for workflows such as the Anomagic pipeline, where crossmodal prompt encoders leverage both regions and captions to steer inpainting-based anomaly synthesis. The dataset’s coverage of diverse defect types promotes strong generalization to both in-domain and out-of-domain abnormality generation.

4. Applications in Anomagic and Beyond

In Anomagic (Jiang et al., 13 Nov 2025), AnomVerse is central for conditioning the model to produce targeted, realistic anomalies. The dataset supports:

Crossmodal Prompt Encoding (CPE): Visual features are extracted (Iʳᵉᶠ, Mʳᵉᶠ) via CLIP and region attention; textual captions tʳᵉᶠ are split, embedded, and pooled; both channels are fused for unified prompt vectors.
Zero-shot Prompt Retrieval: User queries are answered by matching their textual input to tʳᵉᶠ via semantic similarity, retrieving exemplar triplets and associated masks for constructive conditioning.
Refinement and Consistency: Precision in mask annotation (from Mʳᵉᶠ) enables post-generation contrastive refinement, enforcing tight localization of artifacts.

The versatility of AnomVerse, especially its support for arbitrary prompts and compositional queries, extends its utility to anomaly detection benchmarks, unsupervised model evaluation, and the creation of synthetic datasets tailored to rare or unseen defect scenarios.

5. Comparative Analysis and Limitations

AnomVerse represents a substantive advancement over prior resources:

Previous anomaly datasets typically lack detailed language annotations or high-fidelity mask alignment.
Existing synthetic anomaly benchmarks often depend on heuristic cut-paste or augmentation protocols, which seldom imbue semantic or structural coherence between anomalies and their descriptions.

AnomVerse’s strengths lie in automated, rich, and scalable captioning, broad coverage of domains, and integration with current multimodal LLMs. However, by relying entirely on automatic caption generation, it inherits any domain or descriptive bias present in the LLM, and thus the semantic granularity is upper-bounded by the information content and templates provided to the LLM.

6. Role in Research and Future Directions

AnomVerse establishes a standard for benchmarking prompt-driven, crossmodal anomaly synthesis. Its structure facilitates investigations into:

Prompt-conditioned diffusion and inpainting models,
Evaluation of transfer and generalization abilities across domains or modalities,
Zero-shot and few-shot learning protocols for rare defect categories,
Retrieval, attribution, and fine-grained diagnosis in industrial and medical settings.

A plausible implication is that future iterations of AnomVerse may employ reinforcement learning or human-in-the-loop refinement to further enhance caption quality, as well as expanded coverage to more complex modalities (e.g., multispectral, 3D, sequential defects), scaling the resource as downstream requirements evolve.

PDF Markdown Chat (Pro)

References (1)

Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to AnomVerse.