MRAD-FT: Fine-Tuned Anomaly Detection

Updated 7 February 2026

MRAD-FT is a fine-tuning variant of MRAD-TF that employs a two-level memory bank and frozen CLIP backbone to differentiate normal from anomalous samples.
It introduces trainable projection matrices and a similarity-dropout operator for optimizing both image-level classification and pixel-level segmentation with minimal computational overhead.
Empirical evaluations on industrial and medical benchmarks demonstrate significant AUROC gains, showcasing up to a 6-point improvement over the train-free baseline.

MRAD-FT (Memory-Retrieval Anomaly Detection – Fine-Tuned) refers to a lightweight fine-tuning variant of the MRAD-TF (train-free) model for zero-shot anomaly detection tasks. It is designed to enhance the discrimination between normal and anomalous samples in image-level anomaly classification and pixel-level anomaly segmentation, particularly leveraging large vision-LLMs such as CLIP while maintaining a frozen backbone and offering state-of-the-art performance with minimal computational overhead (Xu et al., 31 Jan 2026).

1. Conceptual Framework and Architecture

MRAD-FT builds upon the MRAD-TF paradigm by incorporating a two-level memory bank and a frozen CLIP ViT-L/14-336 backbone. The model separates the CLIP image encoder into two branches: a global branch $\Phi_{\text{cls}}$ yielding the class token, and a local V-V attention branch $\Phi_{\text{vv}}$ producing $R$ patch tokens. The memory structure consists of:

Image-level memory: $K_{\text{cls}} \in \mathbb{R}^{N_c \times d}$ (class tokens), with corresponding one-hot labels $V_{\text{cls}} \in \{0,1\}^{N_c \times 2}$ (normal, anomaly).
Pixel-level memory: $K_{\text{pat}} \in \mathbb{R}^{N_p \times d}$ (patch tokens), with $V_{\text{pat}} \in \{0,1\}^{N_p \times 2}$ .

All features are $\ell_2$ -normalized.

MRAD-FT introduces two sets of trainable $d \times d$ projection matrices $(W_q^{\text{cls}}, W_k^{\text{cls}}, W_q^{\text{seg}}, W_k^{\text{seg}})$ operating on query and key vectors at image and patch levels, respectively. Given a test image $I$ , query features $Q_{\text{cls}}$ (class token) and $Q_{\text{pat}}$ (patch tokens) are extracted. Retrieval logits in a calibrated subspace are computed:

$\begin{aligned} Y_{\text{cls}}^{n/a} &= \mathrm{softmax}\left(\frac{(Q_{\text{cls}} W_q^{\text{cls}})(K_{\text{cls}} W_k^{\text{cls}})^\top}{\tau} + M_p(Q_{\text{cls}}, K_{\text{cls}})\right)V_{\text{cls}}^{n/a}\ Y_{\text{seg}}^{n/a} &= \mathrm{softmax}\left(\frac{(Q_{\text{pat}} W_q^{\text{seg}})(K_{\text{pat}} W_k^{\text{seg}})^\top}{\tau} + M_p(Q_{\text{pat}}, K_{\text{pat}})\right)V_{\text{pat}}^{n/a} \end{aligned}$

Here, $M_p(\cdot,\cdot)$ denotes a similarity-dropout operator used during training to mask the top- $p\%$ similarities, acting as hard negative mining while preventing trivial retrieval. $\tau$ is the temperature parameter.

2. Training Objective and Optimization

MRAD-FT is trained end-to-end on auxiliary datasets by optimizing both image-level and patch-level objectives. Given ground truth image label $y \in \{0,1\}$ and downsampled pixel mask $M \in \{0,1\}^R$ :

$\mathcal{L} = \underbrace{\mathrm{BCE}(Y_{\text{cls}}, y)}_{\mathcal{L}_{\text{cls}}} + \underbrace{\mathrm{Dice}(Y_{\text{seg}}, M) + \mathrm{Focal}(Y_{\text{seg}}, M)}_{\mathcal{L}_{\text{seg}}}$

No additional regularization or margin terms are applied. Instead, $M_p$ with $p=5\%$ (classification) and $p=20\%$ (segmentation) acts as hard negative mining. The optimizer is Adam with a learning rate of $5 \times 10^{-4}$ , batch size 8, for one training epoch. Only $W_q$ and $W_k$ (total $2d^2$ parameters; $\sim$ 2.8M for $d=768$ ) are learned, and the CLIP backbone remains frozen throughout.

3. Inference Mechanism and Decision Process

During inference, the similarity-dropout $M_p$ is disabled. The final anomaly score for a test image $I$ is computed as:

$S(I) = Y_{\text{cls}}^a + \mathrm{TopKMean}\left(Y_{\text{seg}}^a\right)$

where $\mathrm{TopKMean}$ averages the highest $1\%$ of patch-level anomaly scores, emphasizing the most salient regions and suppressing background noise. An image is classified as anomalous if $S(I)$ exceeds a threshold.

4. Empirical Performance and Comparative Analysis

Across sixteen industrial and medical benchmarks, including MVTec-AD, VisA, BTAD, MPDD, and others, MRAD-FT consistently surpasses the train-free MRAD-TF baseline and other prompt or CLIP-based methods (AdaCLIP, AnomalyCLIP, FAPrompt):

Metric	MRAD-TF	MRAD-FT
Pixel AUROC (%)	85.5	91.9
Pixel PRO (%)	64.6	78.3
Image AUROC (%)	81.0	92.0
Image AP (%)	83.2	91.9

Notably, on MVTec-AD, MRAD-FT achieves 92.2% pixel-level AUROC and 92.3% image-level AUROC (versus 86.7% and 79.0% for MRAD-TF, respectively), demonstrating up to a 6-point improvement with the addition of only two linear projection layers.

5. Ablation Studies and Design Insights

Metric calibration: The anomaly-on-anomaly versus normal-on-anomaly similarity gap (A $_q$ A $_k$ – N $_q$ A $_k$ ) increases from ~0.08 (frozen) to ~0.16 (after fine-tuning), indicating a sharper separation of normal/anomalous features.
Two-level memory: Removing the image-level memory branch degrades image-level AUROC by up to 3 points; discarding pixel-level memory reduces both localization (PRO) and image-level AUROC by 2–4 points. This establishes the complementarity of global and local memories.
Memory budget: Reducing the patch memory from ~3,000 to 100 entries results in ≤1 point AUROC loss; MRAD-FT is robust to memory size reductions.

6. Implementation and Engineering Details

Auxiliary data: VisA (2,162 images, 3,093 patches) or MVTec-AD for memory bank construction.
Resolution: Images resized to $518 \times 518$ ; CLIP ViT-L/14-336 backbone used, always frozen.
Feature extraction: Class tokens ( $\Phi_{\text{cls}}$ ) and $R=14 \times 14$ patch tokens ( $\Phi_{\text{vv}}$ ) are stored with one-hot labels.
Hyperparameters: Temperature $\tau=1$ ; mask thresholds $p_{\text{cls}}=5\%$ , $p_{\text{seg}}=20\%$ ; top-k pooling fraction $k=1\%$ .
Computational profile: Peak memory <8 GB; parameter increase <$3$M; training converges in a single epoch on NVIDIA RTX 3090.
Additional techniques: Similarity-dropout $M_p$ prevents trivial retrieval; V-V attention ( $\Phi_{\text{vv}}$ ) preserves local structure; $\ell_2$ normalization stabilizes similarity computations.

7. Significance and Context Within Anomaly Detection

MRAD-FT exemplifies a non-parametric, memory-driven approach that leverages the empirical distribution of auxiliary data, departing from conventional parametric or prompt-tuned anomaly detection strategies. Its architectural simplicity (adding two learned projections), training efficiency (single-epoch convergence), and frozen backbone requirement position it as a compelling solution for scenarios requiring both high statistical efficiency and cross-domain robustness. The framework sets new baselines in both image- and pixel-level anomaly detection and segmentation across heterogeneous datasets without incurring the high computational or modeling cost of alternative approaches (Xu et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MRAD-FT.