Anomagic: Zero-Shot Anomaly Generation

Updated 15 November 2025

Anomagic is a crossmodal prompt-driven zero-shot anomaly generation system that fuses visual and textual cues to create semantically coherent and realistic anomalies.
It employs a masked latent diffusion model with crossmodal prompt encoding and contrastive mask refinement to ensure precise mask–anomaly alignment.
Tested on the AnomVerse dataset across various domains, Anomagic enhances anomaly detection pipelines with improved realism, diversity, and spatial accuracy.

Anomagic is a crossmodal prompt-driven, zero-shot anomaly generation framework that synthesizes semantically coherent and realistic anomalies without requiring any exemplar anomalies. The system leverages a foundation model paradigm, integrating both visual and textual cues via a dedicated crossmodal prompt encoding scheme to steer an inpainting-based generation pipeline. Subsequent contrastive refinement ensures precise mask–anomaly alignment, enabling the robust augmentation of anomaly detection pipelines. The framework is trained on AnomVerse, a curated collection of anomaly–mask–caption triplets derived from industrial, textile, consumer, medical, and electronics datasets using multimodal LLMs for detailed caption generation.

1. Methodology: Crossmodal Prompt Encoding and Inpainting Diffusion

Anomagic operates by unifying visual and textual cues, processed through a Crossmodal Prompt Encoding (CPE) module yielding a unified embedding $\mathbf{P}_c$ that conditions the generative process. The visual guidance component extracts region-specific semantics by masking CLIP-encoded feature maps with reference anomaly masks, using attention mechanisms that suppress background features: $\mathbf{P}_v = \mathrm{Softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{D}} - (1-\mathbf{M}^{\mathrm{ref}})\cdot C\right)\mathbf{V}$ where $\mathbf{Q},\mathbf{K},\mathbf{V}$ are learnable projections, and $C\gg0$ .

Textual semantics are encoded by segmenting the anomaly caption and aggregating CLIP text embeddings: $\mathbf{P}_t = \frac{1}{N}\sum_{i=1}^N e_i$ where $e_i$ are text segment encodings. Fused via cross-attention, the final prompt embedding becomes: $\mathbf{P}_c = \mathrm{CrossFusion}(\mathbf{P}_v, \mathbf{P}_t)$ Only the CPE parameters are updated during training, while the CLIP backbones remain frozen.

The inpainting generator is based on a masked latent diffusion model (LDM), specifically Stable Diffusion v1.5, fine-tuned via LoRA adapters injected into all cross-attention layers. Anomaly regions for inpainting are defined via morphological dilation of ground-truth masks sourced from AnomVerse.

2. Training Procedure and Objective Function

Training proceeds on triplets $(\mathbf{I}^{\mathrm{ref}},\mathbf{M}^{\mathrm{ref}},\mathbf{t}^{\mathrm{ref}})$ , with inpainting masks $\mathbf{M}_{\mathrm{inp}}$ generated by dilating $\mathbf{M}^{\mathrm{ref}}$ . The inpainted input image $\mathbf{I}_\mathrm{inp}$ is encoded to latent $\mathbf{z}_0$ , and noise is injected at variable diffusion timesteps: $\mathbf{z}_t = \sqrt{\alpha_t}\mathbf{z}_0 + \sqrt{1-\alpha_t}\varepsilon$ A UNet parameterized by frozen base weights $\theta_{\mathrm{SD}}$ and trainable LoRA adapters $\theta_L$ predicts the noise, outputting $\varepsilon_\theta(\mathbf{z}_t, t, \mathbf{P}_c)$ . The loss penalizes only masked regions: $\mathcal{L}_{\mathrm{inpaint}} = \mathbb{E}_{\mathbf{z}_0,t,\varepsilon}\left\| (\varepsilon - \varepsilon_\theta(\mathbf{z}_t, t, \mathbf{P}_c )) \odot \mathbf{M}_{\mathrm{inp}} \right\|_2^2$

Training continues for ~50,000 steps using the AdamW optimizer with a learning rate $\approx 10^{-4}$ , batch size 1, and 20-step DDIM sampling.

To resolve alignment discrepancies between inpainting masks and synthesized anomalies, Anomagic applies a contrastive refinement strategy at inference. A discrepancy map is computed: $\Delta(x,y) = \| I_{\mathrm{inpainted}}(x,y) - I_{\mathrm{output}}(x,y) \|$ A pre-trained MetaUAS discrepancy detector generates a refined anomaly mask $\mathbf{M}_r(x,y) = \mathbf{1}[\Delta(x,y) > \tau]$ with $\tau=0.9$ . This process does not require any additional loss or model updates and improves the spatial precision of synthesized anomaly masks.

4. AnomVerse Dataset Construction and Captioning

AnomVerse aggregates 12,987 anomaly–mask–caption triplets from 13 public datasets (e.g., MVTec AD, VisA, MANTA), distributed across five domains:

Domain	Percentage (%)	Number of Triplets
Industrial	56.5	Data-derived
Textiles	23.6	Data-derived
Consumer	8.7	Data-derived
Medical	5.9	Data-derived
Electronics	5.3	Data-derived

Each sample’s caption is generated by providing both the cropped anomaly region and a structured natural-language template to a multimodal LLM (Doubao-Seed-1.6). The LLM fills in templates such as “The image depicts [object]; a [defect type] is observed at [location]. The defect is characterized by [detail] and exhibits [features].” This yields fine-grained, contextually relevant prompting for anomaly generation.

5. Quantitative and Qualitative Performance

Performance evaluation encompasses both generation realism/diversity and impact on anomaly detection:

Anomaly Generation

IS (Inception Score): Realism of samples
IL (Intra-cluster LPIPS): Diversity within anomaly types
On VisA (12 categories): Anomagic achieves IS/IL = 2.16/0.39; exceeds prior zero-shot (AnoAny 1.94/0.33) and few-shot SOTA (AnoGen 2.10/0.39).

Anomaly Detection Augmentation

Using synthetic data as augmentation for SOTA detector INP-Former++, Anomagic achieves:

I-ROC = 99.08%
I-F1 = 96.77%
PRO = 95.92%
P-F1 = 54.00%

Compared to baseline or augmentation with real anomalies, synthetic data from Anomagic provides consistently higher pixel accuracy and region overlap.

Qualitative Alignment

t-SNE embeddings on ResNet50 features reveal that Anomagic’s synthetic anomalies cluster closely with real defect samples, outperforming AnoGen and others on distributional alignment.

6. Capabilities, Application Scenarios, and Limitations

Anomagic supports both unimodal (text-only or image-only) and crossmodal prompting, with the highest generation fidelity from crossmodal input. User queries in natural language, routed through an MLLM and semantic retrieval in AnomVerse, enable zero-shot synthesis of plausible, context-specific anomaly types in new categories. Potential applications include:

Synthetic data augmentation for anomaly detector training
Digital-twin simulation and defect prototyping
Interactive inspection and anomaly scenario prototyping in manufacturing and medical QA

Principal limitations are the reliance on large pretrained models (CLIP, Stable Diffusion, MLLM), no explicit parametric control of synthesized anomaly geometry, and possible mask misalignment at the boundaries of low-contrast defects, particularly where the 0.9 discrepancy threshold may be suboptimal.

7. System Architecture and Deployment Considerations

The main components are:

CPE module (with trainable projections and cross-attention)
LoRA-adapted latent diffusion model (Stable Diffusion v1.5)
Integration with OpenCLIP ViT-H/14 (frozen encoders)
Contrastive mask refinement at inference via MetaUAS

Deployment involves minimal backbone retraining, with updated CPE and LoRA weights sufficient for adaptation to new anomaly types via AnomVerse prompts. The pipeline is applicable as a generative foundation model for anomaly synthesis in diverse, data-scarce industrial settings, enabling augmentation and direct scenario simulation via multimodal, user-driven prompting.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Anomagic.

Anomagic: Zero-Shot Anomaly Generation

1. Methodology: Crossmodal Prompt Encoding and Inpainting Diffusion

2. Training Procedure and Objective Function

3. Contrastive Mask Refinement

4. AnomVerse Dataset Construction and Captioning

5. Quantitative and Qualitative Performance

6. Capabilities, Application Scenarios, and Limitations

7. System Architecture and Deployment Considerations

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Anomagic: Zero-Shot Anomaly Generation

1. Methodology: Crossmodal Prompt Encoding and Inpainting Diffusion

2. Training Procedure and Objective Function

3. Contrastive Mask Refinement

4. AnomVerse Dataset Construction and Captioning

5. Quantitative and Qualitative Performance

6. Capabilities, Application Scenarios, and Limitations

7. System Architecture and Deployment Considerations

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics