Anomagic: Zero-Shot Anomaly Generation
- Anomagic is a crossmodal prompt-driven zero-shot anomaly generation system that fuses visual and textual cues to create semantically coherent and realistic anomalies.
- It employs a masked latent diffusion model with crossmodal prompt encoding and contrastive mask refinement to ensure precise mask–anomaly alignment.
- Tested on the AnomVerse dataset across various domains, Anomagic enhances anomaly detection pipelines with improved realism, diversity, and spatial accuracy.
Anomagic is a crossmodal prompt-driven, zero-shot anomaly generation framework that synthesizes semantically coherent and realistic anomalies without requiring any exemplar anomalies. The system leverages a foundation model paradigm, integrating both visual and textual cues via a dedicated crossmodal prompt encoding scheme to steer an inpainting-based generation pipeline. Subsequent contrastive refinement ensures precise mask–anomaly alignment, enabling the robust augmentation of anomaly detection pipelines. The framework is trained on AnomVerse, a curated collection of anomaly–mask–caption triplets derived from industrial, textile, consumer, medical, and electronics datasets using multimodal LLMs for detailed caption generation.
1. Methodology: Crossmodal Prompt Encoding and Inpainting Diffusion
Anomagic operates by unifying visual and textual cues, processed through a Crossmodal Prompt Encoding (CPE) module yielding a unified embedding that conditions the generative process. The visual guidance component extracts region-specific semantics by masking CLIP-encoded feature maps with reference anomaly masks, using attention mechanisms that suppress background features: where are learnable projections, and .
Textual semantics are encoded by segmenting the anomaly caption and aggregating CLIP text embeddings: where are text segment encodings. Fused via cross-attention, the final prompt embedding becomes: Only the CPE parameters are updated during training, while the CLIP backbones remain frozen.
The inpainting generator is based on a masked latent diffusion model (LDM), specifically Stable Diffusion v1.5, fine-tuned via LoRA adapters injected into all cross-attention layers. Anomaly regions for inpainting are defined via morphological dilation of ground-truth masks sourced from AnomVerse.
2. Training Procedure and Objective Function
Training proceeds on triplets , with inpainting masks generated by dilating . The inpainted input image is encoded to latent , and noise is injected at variable diffusion timesteps: A UNet parameterized by frozen base weights and trainable LoRA adapters predicts the noise, outputting . The loss penalizes only masked regions:
Training continues for ~50,000 steps using the AdamW optimizer with a learning rate , batch size 1, and 20-step DDIM sampling.
3. Contrastive Mask Refinement
To resolve alignment discrepancies between inpainting masks and synthesized anomalies, Anomagic applies a contrastive refinement strategy at inference. A discrepancy map is computed: A pre-trained MetaUAS discrepancy detector generates a refined anomaly mask with . This process does not require any additional loss or model updates and improves the spatial precision of synthesized anomaly masks.
4. AnomVerse Dataset Construction and Captioning
AnomVerse aggregates 12,987 anomaly–mask–caption triplets from 13 public datasets (e.g., MVTec AD, VisA, MANTA), distributed across five domains:
| Domain | Percentage (%) | Number of Triplets |
|---|---|---|
| Industrial | 56.5 | Data-derived |
| Textiles | 23.6 | Data-derived |
| Consumer | 8.7 | Data-derived |
| Medical | 5.9 | Data-derived |
| Electronics | 5.3 | Data-derived |
Each sample’s caption is generated by providing both the cropped anomaly region and a structured natural-language template to a multimodal LLM (Doubao-Seed-1.6). The LLM fills in templates such as “The image depicts [object]; a [defect type] is observed at [location]. The defect is characterized by [detail] and exhibits [features].” This yields fine-grained, contextually relevant prompting for anomaly generation.
5. Quantitative and Qualitative Performance
Performance evaluation encompasses both generation realism/diversity and impact on anomaly detection:
Anomaly Generation
- IS (Inception Score): Realism of samples
- IL (Intra-cluster LPIPS): Diversity within anomaly types
- On VisA (12 categories): Anomagic achieves IS/IL = 2.16/0.39; exceeds prior zero-shot (AnoAny 1.94/0.33) and few-shot SOTA (AnoGen 2.10/0.39).
Anomaly Detection Augmentation
Using synthetic data as augmentation for SOTA detector INP-Former++, Anomagic achieves:
- I-ROC = 99.08%
- I-F1 = 96.77%
- PRO = 95.92%
- P-F1 = 54.00%
Compared to baseline or augmentation with real anomalies, synthetic data from Anomagic provides consistently higher pixel accuracy and region overlap.
Qualitative Alignment
t-SNE embeddings on ResNet50 features reveal that Anomagic’s synthetic anomalies cluster closely with real defect samples, outperforming AnoGen and others on distributional alignment.
6. Capabilities, Application Scenarios, and Limitations
Anomagic supports both unimodal (text-only or image-only) and crossmodal prompting, with the highest generation fidelity from crossmodal input. User queries in natural language, routed through an MLLM and semantic retrieval in AnomVerse, enable zero-shot synthesis of plausible, context-specific anomaly types in new categories. Potential applications include:
- Synthetic data augmentation for anomaly detector training
- Digital-twin simulation and defect prototyping
- Interactive inspection and anomaly scenario prototyping in manufacturing and medical QA
Principal limitations are the reliance on large pretrained models (CLIP, Stable Diffusion, MLLM), no explicit parametric control of synthesized anomaly geometry, and possible mask misalignment at the boundaries of low-contrast defects, particularly where the 0.9 discrepancy threshold may be suboptimal.
7. System Architecture and Deployment Considerations
The main components are:
- CPE module (with trainable projections and cross-attention)
- LoRA-adapted latent diffusion model (Stable Diffusion v1.5)
- Integration with OpenCLIP ViT-H/14 (frozen encoders)
- Contrastive mask refinement at inference via MetaUAS
Deployment involves minimal backbone retraining, with updated CPE and LoRA weights sufficient for adaptation to new anomaly types via AnomVerse prompts. The pipeline is applicable as a generative foundation model for anomaly synthesis in diverse, data-scarce industrial settings, enabling augmentation and direct scenario simulation via multimodal, user-driven prompting.