MRAD-FT: Fine-Tuned Anomaly Detection
- MRAD-FT is a fine-tuning variant of MRAD-TF that employs a two-level memory bank and frozen CLIP backbone to differentiate normal from anomalous samples.
- It introduces trainable projection matrices and a similarity-dropout operator for optimizing both image-level classification and pixel-level segmentation with minimal computational overhead.
- Empirical evaluations on industrial and medical benchmarks demonstrate significant AUROC gains, showcasing up to a 6-point improvement over the train-free baseline.
MRAD-FT (Memory-Retrieval Anomaly Detection – Fine-Tuned) refers to a lightweight fine-tuning variant of the MRAD-TF (train-free) model for zero-shot anomaly detection tasks. It is designed to enhance the discrimination between normal and anomalous samples in image-level anomaly classification and pixel-level anomaly segmentation, particularly leveraging large vision-LLMs such as CLIP while maintaining a frozen backbone and offering state-of-the-art performance with minimal computational overhead (Xu et al., 31 Jan 2026).
1. Conceptual Framework and Architecture
MRAD-FT builds upon the MRAD-TF paradigm by incorporating a two-level memory bank and a frozen CLIP ViT-L/14-336 backbone. The model separates the CLIP image encoder into two branches: a global branch yielding the class token, and a local V-V attention branch producing patch tokens. The memory structure consists of:
- Image-level memory: (class tokens), with corresponding one-hot labels (normal, anomaly).
- Pixel-level memory: (patch tokens), with .
All features are -normalized.
MRAD-FT introduces two sets of trainable projection matrices operating on query and key vectors at image and patch levels, respectively. Given a test image , query features (class token) and (patch tokens) are extracted. Retrieval logits in a calibrated subspace are computed:
Here, denotes a similarity-dropout operator used during training to mask the top- similarities, acting as hard negative mining while preventing trivial retrieval. is the temperature parameter.
2. Training Objective and Optimization
MRAD-FT is trained end-to-end on auxiliary datasets by optimizing both image-level and patch-level objectives. Given ground truth image label and downsampled pixel mask :
No additional regularization or margin terms are applied. Instead, with (classification) and (segmentation) acts as hard negative mining. The optimizer is Adam with a learning rate of , batch size 8, for one training epoch. Only and (total parameters; 2.8M for ) are learned, and the CLIP backbone remains frozen throughout.
3. Inference Mechanism and Decision Process
During inference, the similarity-dropout is disabled. The final anomaly score for a test image is computed as:
where averages the highest of patch-level anomaly scores, emphasizing the most salient regions and suppressing background noise. An image is classified as anomalous if exceeds a threshold.
4. Empirical Performance and Comparative Analysis
Across sixteen industrial and medical benchmarks, including MVTec-AD, VisA, BTAD, MPDD, and others, MRAD-FT consistently surpasses the train-free MRAD-TF baseline and other prompt or CLIP-based methods (AdaCLIP, AnomalyCLIP, FAPrompt):
| Metric | MRAD-TF | MRAD-FT |
|---|---|---|
| Pixel AUROC (%) | 85.5 | 91.9 |
| Pixel PRO (%) | 64.6 | 78.3 |
| Image AUROC (%) | 81.0 | 92.0 |
| Image AP (%) | 83.2 | 91.9 |
Notably, on MVTec-AD, MRAD-FT achieves 92.2% pixel-level AUROC and 92.3% image-level AUROC (versus 86.7% and 79.0% for MRAD-TF, respectively), demonstrating up to a 6-point improvement with the addition of only two linear projection layers.
5. Ablation Studies and Design Insights
- Metric calibration: The anomaly-on-anomaly versus normal-on-anomaly similarity gap (AA – NA) increases from ~0.08 (frozen) to ~0.16 (after fine-tuning), indicating a sharper separation of normal/anomalous features.
- Two-level memory: Removing the image-level memory branch degrades image-level AUROC by up to 3 points; discarding pixel-level memory reduces both localization (PRO) and image-level AUROC by 2–4 points. This establishes the complementarity of global and local memories.
- Memory budget: Reducing the patch memory from ~3,000 to 100 entries results in ≤1 point AUROC loss; MRAD-FT is robust to memory size reductions.
6. Implementation and Engineering Details
- Auxiliary data: VisA (2,162 images, 3,093 patches) or MVTec-AD for memory bank construction.
- Resolution: Images resized to ; CLIP ViT-L/14-336 backbone used, always frozen.
- Feature extraction: Class tokens () and patch tokens () are stored with one-hot labels.
- Hyperparameters: Temperature ; mask thresholds , ; top-k pooling fraction .
- Computational profile: Peak memory <8 GB; parameter increase <$3$M; training converges in a single epoch on NVIDIA RTX 3090.
- Additional techniques: Similarity-dropout prevents trivial retrieval; V-V attention () preserves local structure; normalization stabilizes similarity computations.
7. Significance and Context Within Anomaly Detection
MRAD-FT exemplifies a non-parametric, memory-driven approach that leverages the empirical distribution of auxiliary data, departing from conventional parametric or prompt-tuned anomaly detection strategies. Its architectural simplicity (adding two learned projections), training efficiency (single-epoch convergence), and frozen backbone requirement position it as a compelling solution for scenarios requiring both high statistical efficiency and cross-domain robustness. The framework sets new baselines in both image- and pixel-level anomaly detection and segmentation across heterogeneous datasets without incurring the high computational or modeling cost of alternative approaches (Xu et al., 31 Jan 2026).