Papers
Topics
Authors
Recent
Search
2000 character limit reached

MRAD-FT: Fine-Tuned Anomaly Detection

Updated 7 February 2026
  • MRAD-FT is a fine-tuning variant of MRAD-TF that employs a two-level memory bank and frozen CLIP backbone to differentiate normal from anomalous samples.
  • It introduces trainable projection matrices and a similarity-dropout operator for optimizing both image-level classification and pixel-level segmentation with minimal computational overhead.
  • Empirical evaluations on industrial and medical benchmarks demonstrate significant AUROC gains, showcasing up to a 6-point improvement over the train-free baseline.

MRAD-FT (Memory-Retrieval Anomaly Detection – Fine-Tuned) refers to a lightweight fine-tuning variant of the MRAD-TF (train-free) model for zero-shot anomaly detection tasks. It is designed to enhance the discrimination between normal and anomalous samples in image-level anomaly classification and pixel-level anomaly segmentation, particularly leveraging large vision-LLMs such as CLIP while maintaining a frozen backbone and offering state-of-the-art performance with minimal computational overhead (Xu et al., 31 Jan 2026).

1. Conceptual Framework and Architecture

MRAD-FT builds upon the MRAD-TF paradigm by incorporating a two-level memory bank and a frozen CLIP ViT-L/14-336 backbone. The model separates the CLIP image encoder into two branches: a global branch Φcls\Phi_{\text{cls}} yielding the class token, and a local V-V attention branch Φvv\Phi_{\text{vv}} producing RR patch tokens. The memory structure consists of:

  • Image-level memory: KclsRNc×dK_{\text{cls}} \in \mathbb{R}^{N_c \times d} (class tokens), with corresponding one-hot labels Vcls{0,1}Nc×2V_{\text{cls}} \in \{0,1\}^{N_c \times 2} (normal, anomaly).
  • Pixel-level memory: KpatRNp×dK_{\text{pat}} \in \mathbb{R}^{N_p \times d} (patch tokens), with Vpat{0,1}Np×2V_{\text{pat}} \in \{0,1\}^{N_p \times 2}.

All features are 2\ell_2-normalized.

MRAD-FT introduces two sets of trainable d×dd \times d projection matrices (Wqcls,Wkcls,Wqseg,Wkseg)(W_q^{\text{cls}}, W_k^{\text{cls}}, W_q^{\text{seg}}, W_k^{\text{seg}}) operating on query and key vectors at image and patch levels, respectively. Given a test image II, query features QclsQ_{\text{cls}} (class token) and QpatQ_{\text{pat}} (patch tokens) are extracted. Retrieval logits in a calibrated subspace are computed:

Yclsn/a=softmax((QclsWqcls)(KclsWkcls)τ+Mp(Qcls,Kcls))Vclsn/a Ysegn/a=softmax((QpatWqseg)(KpatWkseg)τ+Mp(Qpat,Kpat))Vpatn/a\begin{aligned} Y_{\text{cls}}^{n/a} &= \mathrm{softmax}\left(\frac{(Q_{\text{cls}} W_q^{\text{cls}})(K_{\text{cls}} W_k^{\text{cls}})^\top}{\tau} + M_p(Q_{\text{cls}}, K_{\text{cls}})\right)V_{\text{cls}}^{n/a}\ Y_{\text{seg}}^{n/a} &= \mathrm{softmax}\left(\frac{(Q_{\text{pat}} W_q^{\text{seg}})(K_{\text{pat}} W_k^{\text{seg}})^\top}{\tau} + M_p(Q_{\text{pat}}, K_{\text{pat}})\right)V_{\text{pat}}^{n/a} \end{aligned}

Here, Mp(,)M_p(\cdot,\cdot) denotes a similarity-dropout operator used during training to mask the top-p%p\% similarities, acting as hard negative mining while preventing trivial retrieval. τ\tau is the temperature parameter.

2. Training Objective and Optimization

MRAD-FT is trained end-to-end on auxiliary datasets by optimizing both image-level and patch-level objectives. Given ground truth image label y{0,1}y \in \{0,1\} and downsampled pixel mask M{0,1}RM \in \{0,1\}^R:

L=BCE(Ycls,y)Lcls+Dice(Yseg,M)+Focal(Yseg,M)Lseg\mathcal{L} = \underbrace{\mathrm{BCE}(Y_{\text{cls}}, y)}_{\mathcal{L}_{\text{cls}}} + \underbrace{\mathrm{Dice}(Y_{\text{seg}}, M) + \mathrm{Focal}(Y_{\text{seg}}, M)}_{\mathcal{L}_{\text{seg}}}

No additional regularization or margin terms are applied. Instead, MpM_p with p=5%p=5\% (classification) and p=20%p=20\% (segmentation) acts as hard negative mining. The optimizer is Adam with a learning rate of 5×1045 \times 10^{-4}, batch size 8, for one training epoch. Only WqW_q and WkW_k (total 2d22d^2 parameters; \sim2.8M for d=768d=768) are learned, and the CLIP backbone remains frozen throughout.

3. Inference Mechanism and Decision Process

During inference, the similarity-dropout MpM_p is disabled. The final anomaly score for a test image II is computed as:

S(I)=Yclsa+TopKMean(Ysega)S(I) = Y_{\text{cls}}^a + \mathrm{TopKMean}\left(Y_{\text{seg}}^a\right)

where TopKMean\mathrm{TopKMean} averages the highest 1%1\% of patch-level anomaly scores, emphasizing the most salient regions and suppressing background noise. An image is classified as anomalous if S(I)S(I) exceeds a threshold.

4. Empirical Performance and Comparative Analysis

Across sixteen industrial and medical benchmarks, including MVTec-AD, VisA, BTAD, MPDD, and others, MRAD-FT consistently surpasses the train-free MRAD-TF baseline and other prompt or CLIP-based methods (AdaCLIP, AnomalyCLIP, FAPrompt):

Metric MRAD-TF MRAD-FT
Pixel AUROC (%) 85.5 91.9
Pixel PRO (%) 64.6 78.3
Image AUROC (%) 81.0 92.0
Image AP (%) 83.2 91.9

Notably, on MVTec-AD, MRAD-FT achieves 92.2% pixel-level AUROC and 92.3% image-level AUROC (versus 86.7% and 79.0% for MRAD-TF, respectively), demonstrating up to a 6-point improvement with the addition of only two linear projection layers.

5. Ablation Studies and Design Insights

  • Metric calibration: The anomaly-on-anomaly versus normal-on-anomaly similarity gap (Aq_qAk_k – Nq_qAk_k) increases from ~0.08 (frozen) to ~0.16 (after fine-tuning), indicating a sharper separation of normal/anomalous features.
  • Two-level memory: Removing the image-level memory branch degrades image-level AUROC by up to 3 points; discarding pixel-level memory reduces both localization (PRO) and image-level AUROC by 2–4 points. This establishes the complementarity of global and local memories.
  • Memory budget: Reducing the patch memory from ~3,000 to 100 entries results in ≤1 point AUROC loss; MRAD-FT is robust to memory size reductions.

6. Implementation and Engineering Details

  • Auxiliary data: VisA (2,162 images, 3,093 patches) or MVTec-AD for memory bank construction.
  • Resolution: Images resized to 518×518518 \times 518; CLIP ViT-L/14-336 backbone used, always frozen.
  • Feature extraction: Class tokens (Φcls\Phi_{\text{cls}}) and R=14×14R=14 \times 14 patch tokens (Φvv\Phi_{\text{vv}}) are stored with one-hot labels.
  • Hyperparameters: Temperature τ=1\tau=1; mask thresholds pcls=5%p_{\text{cls}}=5\%, pseg=20%p_{\text{seg}}=20\%; top-k pooling fraction k=1%k=1\%.
  • Computational profile: Peak memory <8 GB; parameter increase <$3$M; training converges in a single epoch on NVIDIA RTX 3090.
  • Additional techniques: Similarity-dropout MpM_p prevents trivial retrieval; V-V attention (Φvv\Phi_{\text{vv}}) preserves local structure; 2\ell_2 normalization stabilizes similarity computations.

7. Significance and Context Within Anomaly Detection

MRAD-FT exemplifies a non-parametric, memory-driven approach that leverages the empirical distribution of auxiliary data, departing from conventional parametric or prompt-tuned anomaly detection strategies. Its architectural simplicity (adding two learned projections), training efficiency (single-epoch convergence), and frozen backbone requirement position it as a compelling solution for scenarios requiring both high statistical efficiency and cross-domain robustness. The framework sets new baselines in both image- and pixel-level anomaly detection and segmentation across heterogeneous datasets without incurring the high computational or modeling cost of alternative approaches (Xu et al., 31 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MRAD-FT.