Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 186 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 65 tok/s Pro
Kimi K2 229 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Anomagic: Zero-Shot Anomaly Generation

Updated 15 November 2025
  • Anomagic is a crossmodal prompt-driven zero-shot anomaly generation system that fuses visual and textual cues to create semantically coherent and realistic anomalies.
  • It employs a masked latent diffusion model with crossmodal prompt encoding and contrastive mask refinement to ensure precise mask–anomaly alignment.
  • Tested on the AnomVerse dataset across various domains, Anomagic enhances anomaly detection pipelines with improved realism, diversity, and spatial accuracy.

Anomagic is a crossmodal prompt-driven, zero-shot anomaly generation framework that synthesizes semantically coherent and realistic anomalies without requiring any exemplar anomalies. The system leverages a foundation model paradigm, integrating both visual and textual cues via a dedicated crossmodal prompt encoding scheme to steer an inpainting-based generation pipeline. Subsequent contrastive refinement ensures precise mask–anomaly alignment, enabling the robust augmentation of anomaly detection pipelines. The framework is trained on AnomVerse, a curated collection of anomaly–mask–caption triplets derived from industrial, textile, consumer, medical, and electronics datasets using multimodal LLMs for detailed caption generation.

1. Methodology: Crossmodal Prompt Encoding and Inpainting Diffusion

Anomagic operates by unifying visual and textual cues, processed through a Crossmodal Prompt Encoding (CPE) module yielding a unified embedding Pc\mathbf{P}_c that conditions the generative process. The visual guidance component extracts region-specific semantics by masking CLIP-encoded feature maps with reference anomaly masks, using attention mechanisms that suppress background features: Pv=Softmax ⁣(QKTD(1Mref)C)V\mathbf{P}_v = \mathrm{Softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{D}} - (1-\mathbf{M}^{\mathrm{ref}})\cdot C\right)\mathbf{V} where Q,K,V\mathbf{Q},\mathbf{K},\mathbf{V} are learnable projections, and C0C\gg0.

Textual semantics are encoded by segmenting the anomaly caption and aggregating CLIP text embeddings: Pt=1Ni=1Nei\mathbf{P}_t = \frac{1}{N}\sum_{i=1}^N e_i where eie_i are text segment encodings. Fused via cross-attention, the final prompt embedding becomes: Pc=CrossFusion(Pv,Pt)\mathbf{P}_c = \mathrm{CrossFusion}(\mathbf{P}_v, \mathbf{P}_t) Only the CPE parameters are updated during training, while the CLIP backbones remain frozen.

The inpainting generator is based on a masked latent diffusion model (LDM), specifically Stable Diffusion v1.5, fine-tuned via LoRA adapters injected into all cross-attention layers. Anomaly regions for inpainting are defined via morphological dilation of ground-truth masks sourced from AnomVerse.

2. Training Procedure and Objective Function

Training proceeds on triplets (Iref,Mref,tref)(\mathbf{I}^{\mathrm{ref}},\mathbf{M}^{\mathrm{ref}},\mathbf{t}^{\mathrm{ref}}), with inpainting masks Minp\mathbf{M}_{\mathrm{inp}} generated by dilating Mref\mathbf{M}^{\mathrm{ref}}. The inpainted input image Iinp\mathbf{I}_\mathrm{inp} is encoded to latent z0\mathbf{z}_0, and noise is injected at variable diffusion timesteps: zt=αtz0+1αtε\mathbf{z}_t = \sqrt{\alpha_t}\mathbf{z}_0 + \sqrt{1-\alpha_t}\varepsilon A UNet parameterized by frozen base weights θSD\theta_{\mathrm{SD}} and trainable LoRA adapters θL\theta_L predicts the noise, outputting εθ(zt,t,Pc)\varepsilon_\theta(\mathbf{z}_t, t, \mathbf{P}_c). The loss penalizes only masked regions: Linpaint=Ez0,t,ε(εεθ(zt,t,Pc))Minp22\mathcal{L}_{\mathrm{inpaint}} = \mathbb{E}_{\mathbf{z}_0,t,\varepsilon}\left\| (\varepsilon - \varepsilon_\theta(\mathbf{z}_t, t, \mathbf{P}_c )) \odot \mathbf{M}_{\mathrm{inp}} \right\|_2^2

Training continues for ~50,000 steps using the AdamW optimizer with a learning rate 104\approx 10^{-4}, batch size 1, and 20-step DDIM sampling.

3. Contrastive Mask Refinement

To resolve alignment discrepancies between inpainting masks and synthesized anomalies, Anomagic applies a contrastive refinement strategy at inference. A discrepancy map is computed: Δ(x,y)=Iinpainted(x,y)Ioutput(x,y)\Delta(x,y) = \| I_{\mathrm{inpainted}}(x,y) - I_{\mathrm{output}}(x,y) \| A pre-trained MetaUAS discrepancy detector generates a refined anomaly mask Mr(x,y)=1[Δ(x,y)>τ]\mathbf{M}_r(x,y) = \mathbf{1}[\Delta(x,y) > \tau] with τ=0.9\tau=0.9. This process does not require any additional loss or model updates and improves the spatial precision of synthesized anomaly masks.

4. AnomVerse Dataset Construction and Captioning

AnomVerse aggregates 12,987 anomaly–mask–caption triplets from 13 public datasets (e.g., MVTec AD, VisA, MANTA), distributed across five domains:

Domain Percentage (%) Number of Triplets
Industrial 56.5 Data-derived
Textiles 23.6 Data-derived
Consumer 8.7 Data-derived
Medical 5.9 Data-derived
Electronics 5.3 Data-derived

Each sample’s caption is generated by providing both the cropped anomaly region and a structured natural-language template to a multimodal LLM (Doubao-Seed-1.6). The LLM fills in templates such as “The image depicts [object]; a [defect type] is observed at [location]. The defect is characterized by [detail] and exhibits [features].” This yields fine-grained, contextually relevant prompting for anomaly generation.

5. Quantitative and Qualitative Performance

Performance evaluation encompasses both generation realism/diversity and impact on anomaly detection:

Anomaly Generation

  • IS (Inception Score): Realism of samples
  • IL (Intra-cluster LPIPS): Diversity within anomaly types
  • On VisA (12 categories): Anomagic achieves IS/IL = 2.16/0.39; exceeds prior zero-shot (AnoAny 1.94/0.33) and few-shot SOTA (AnoGen 2.10/0.39).

Anomaly Detection Augmentation

Using synthetic data as augmentation for SOTA detector INP-Former++, Anomagic achieves:

  • I-ROC = 99.08%
  • I-F1 = 96.77%
  • PRO = 95.92%
  • P-F1 = 54.00%

Compared to baseline or augmentation with real anomalies, synthetic data from Anomagic provides consistently higher pixel accuracy and region overlap.

Qualitative Alignment

t-SNE embeddings on ResNet50 features reveal that Anomagic’s synthetic anomalies cluster closely with real defect samples, outperforming AnoGen and others on distributional alignment.

6. Capabilities, Application Scenarios, and Limitations

Anomagic supports both unimodal (text-only or image-only) and crossmodal prompting, with the highest generation fidelity from crossmodal input. User queries in natural language, routed through an MLLM and semantic retrieval in AnomVerse, enable zero-shot synthesis of plausible, context-specific anomaly types in new categories. Potential applications include:

  • Synthetic data augmentation for anomaly detector training
  • Digital-twin simulation and defect prototyping
  • Interactive inspection and anomaly scenario prototyping in manufacturing and medical QA

Principal limitations are the reliance on large pretrained models (CLIP, Stable Diffusion, MLLM), no explicit parametric control of synthesized anomaly geometry, and possible mask misalignment at the boundaries of low-contrast defects, particularly where the 0.9 discrepancy threshold may be suboptimal.

7. System Architecture and Deployment Considerations

The main components are:

  • CPE module (with trainable projections and cross-attention)
  • LoRA-adapted latent diffusion model (Stable Diffusion v1.5)
  • Integration with OpenCLIP ViT-H/14 (frozen encoders)
  • Contrastive mask refinement at inference via MetaUAS

Deployment involves minimal backbone retraining, with updated CPE and LoRA weights sufficient for adaptation to new anomaly types via AnomVerse prompts. The pipeline is applicable as a generative foundation model for anomaly synthesis in diverse, data-scarce industrial settings, enabling augmentation and direct scenario simulation via multimodal, user-driven prompting.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Anomagic.