AutoIAD: Automated Visual Anomaly Detection
- AutoIAD is a multi-agent framework that automates the complete development of industrial visual anomaly detection systems using a central Manager Agent and specialized sub-agents.
- It employs a review and refine mechanism where the Manager Agent iterates over sub-tasks, ensuring robust error correction and improved performance on benchmarks like MVTec AD.
- The framework integrates a domain-specific knowledge base to guide data preparation, model design, and training, significantly outperforming traditional AutoML approaches.
AutoIAD is a multi-agent collaboration framework designed to automate the end-to-end development of industrial visual anomaly detection (IAD) systems. It orchestrates domain-specialized sub-agents using a central Manager Agent and integrates a structured domain-specific knowledge base to guide decision making and pipeline execution. AutoIAD demonstrates substantial improvements over both traditional AutoML and general agentic frameworks in industrial anomaly detection tasks, as validated on the MVTec AD benchmark using various LLM backends (Ji et al., 7 Aug 2025).
1. Manager Agent and Pipeline Orchestration
The central component of AutoIAD is the Manager Agent , which parses a user-specified TaskCard , decomposes it into granular sub-tasks, and schedules four domain-specialized sub-agents in sequence: Data Preparation (), Data Loader (), Model Designer (), and Trainer (). The Manager supervises execution by reviewing deliverables and issuing corrective feedback.
The orchestration is governed by a scheduling function: where is the shared workspace (files, code, artifacts), is Manager-issued feedback, is the pipeline state (0 or 1), and 2 is the next sub-agent to call. Execution proceeds iteratively: the Manager invokes each agent via a CALL routine, agents perform self-review, and the Manager may request refinement or retry in the event of errors or unsatisfactory results (Algorithm 1, (Ji et al., 7 Aug 2025), Fig. 3 and §3.2). This “review and refine” mechanism is crucial for correcting LLM-induced hallucinations and ensuring rigorous output quality throughout the pipeline.
2. Sub-Agent Structure and Function
Each sub-agent in AutoIAD focuses on a distinct stage of the IAD workflow, contributing modular deliverables to the shared workspace:
| Agent | Input Artifacts | Output Artifacts |
|---|---|---|
| Data Prep (3) | Raw image folders | dataset.csv |
| Data Loader (4) | dataset.csv | Dataloader.py, test code |
| Model Designer (5) | Dataloader.py, TaskCard 6 | Model.py, design rationale |
| Trainer (7) | Model.py, Dataloader.py | Model checkpoint, log, AUROC |
- Data Preparation Agent (8) inspects raw dataset structures, derives train/val/test splits, extracts class labels, and generates a canonical dataset.csv.
- Data Loader Agent (9) builds a PyTorch dataloader with batch handling and augmentation hooks, consulting the knowledge base for appropriate transforms.
- Model Designer Agent (0) selects or synthesizes an anomaly detection model (e.g., PatchCore, FastFlow), configures hyperparameters, and outputs tested, documented code.
- Trainer Agent (1) sets up the training script, manages checkpointing, and conducts hyper-parameter optimization. Training progress is evaluated on image-level AUROC, with prospective retraining instructed by the Manager upon poor validation metrics [(Ji et al., 7 Aug 2025), Fig. 2].
If a sub-agent’s output fails its own or Manager review (e.g., code errors, insufficient AUROC), the corresponding agent is iteratively recalled until the sub-task is satisfactorily completed.
3. Domain-Specific Knowledge Base
AutoIAD’s “Domain Knowledge Module” is a curated, queryable repository containing anomaly detection best practices. It encompasses:
- Data augmentations: domain-relevant image transforms (resize, flip, noise, custom methods)
- Model templates: reference implementations for autoencoders, patch-embedding models, normalizing flows
- Hyper-parameter guidelines: suggested learning rates, regularization, coreset sampling ratios
- Training scripts: standard loss and logging routines
Agents access the knowledge base by keyword lookup (e.g., “anomaly_model_templates”) at decision points during execution. This structured foundation grounds the pipeline in proven industrial IAD practices and effectively mitigates LLM hallucination, as established by ablation: removing the knowledge base reduces task success to 60% and test AUROC to 0%, indicating model outputs are functionally ineffective without it (§4.4, Table 2 (Ji et al., 7 Aug 2025)).
4. Loss Functions, Evaluation Metrics, and Optimization
AutoIAD employs standard loss functions contingent upon the model class:
- Reconstruction-based: 2
- Normalizing-flow-based: 3
Model performance is universally evaluated using image-level AUROC: 4 where TPR and FPR derive from thresholded anomaly scores. The Trainer Agent logs AUROC at each epoch, which the Manager uses to determine convergence or whether retraining is necessary. Hyper-parameter optimization is performed via grid or random search, steered by ranges and heuristics in the knowledge base [(Ji et al., 7 Aug 2025), §3.3–3.4].
5. Benchmark Dataset and Evaluation Protocol
Benchmarking utilizes the MVTec AD dataset, covering 15 tasks with diverse object categories (bottle, metal_nut) and textures (carpet, tile). Protocol specifics:
- Data regime: Training on defect-free samples only; test sets include both normal and defective samples, with pixel-wise masks withheld.
- Pipeline requirements: Each task must complete all four sub-agent stages within fixed time and token limits; output must include a non-NaN AUROC.
- Baselines: Comparisons are conducted with MLAgent-Bench, AutoML-Agent (AutoML approaches), openManus, openHands (generic agentic frameworks), unified by the same Gemini LLM core.
In head-to-head comparison, AutoIAD offers superior pipeline completion and anomaly detection performance (Table 1, (Ji et al., 7 Aug 2025)). Evaluation mandates success across all pipeline stages and anomaly detection efficacy as per AUROC.
6. Comparative Performance and Ablation Analysis
AutoIAD's results on the MVTec AD benchmark:
| Framework | Success Rate (%) | Test AUROC (%) |
|---|---|---|
| MLAgent-Bench | 0 | - |
| AutoML-Agent | 0 | - |
| openManus | 50 | 48.09 |
| openHands | 73.3 | 53.88 |
| AutoIAD | 88.3 | 63.69 |
LLM backbone evaluation demonstrates the importance of underlying model quality:
- Gemini-2.5-Flash: 88.3% success, 63.69% AUROC
- Qwen-Max: 77.8%, 25.71% AUROC
- Claude-3.7: 63.3% (timeout), no AUROC
- GPT-4o-Mini: 43.3%, 25.00% AUROC
- DeepSeek-v3: 37.8%, 0.0% AUROC
- Qwen3-235B: 50.0%, 28.65% AUROC
Class-wise breakdown reveals most object and texture categories achieve 3–4 stage completion and, in select cases (e.g., carpet, tile, metal_nut), AUROC > 80%. Cases with “null” AUROC denote completed pipelines with no meaningful anomaly signal [(Ji et al., 7 Aug 2025), Table 5].
Ablation experiments further demonstrate the core contributions:
- Without Manager Agent: Success drops to 83.3% and AUROC to 35.01%. Centralized review is pivotal for error correction.
- Without Knowledge Base: Success is 60.0%, AUROC is 0.0%. Domain priors are essential for producing meaningful anomaly detection results (§4.4, Table 2).
7. Significance, Context, and Implications
AutoIAD reconstitutes the conventional industrial anomaly detection workflow—traditionally a manual sequence of data cleaning, augmentation, model/HP selection, and evaluation—into a fully automated, multi-agent pipeline. The Manager Agent functions effectively as a meta-engineer, orchestrating, validating, and iterating each phase toward convergence on an operational anomaly detector. Integration of a structured domain knowledge base enforces grounded, empirically validated practices, crucially reducing the prevalence of LLM-induced hallucination and misconfiguration observed in baseline approaches.
On the MVTec AD benchmark, AutoIAD establishes a new state of the art for automated industrial anomaly detection, achieving both the highest end-to-end pipeline completion (88.3%) and model performance (mean AUROC ≈ 64%) among tested frameworks (Ji et al., 7 Aug 2025). Ablation results underscore the indispensable roles of both centralized managerial supervision and structured domain priors in robust industrial machine learning automation.
This suggests that effective automation of real-world visual anomaly detection workflows is contingent on both agentic oversight and domain-grounded guidance, not solely LLM-based generative code synthesis.