Papers
Topics
Authors
Recent
2000 character limit reached

AnoRefiner: Anomaly-Aware Refiner

Updated 3 December 2025
  • The paper introduces a plug-in decoder (ARD) that refines coarse patch-level anomaly maps from vision transformers into precise pixel-level segmentations.
  • It employs adaptive attention and cross-residual BI blocks to fuse spatial features with anomaly scores, significantly enhancing defect localization.
  • The Progressive Group-Wise Test-Time Training strategy robustly updates ARD parameters, reducing reliance on synthetic anomalies even with noisy pseudo-labels.

The Anomaly-Aware Refiner (AnoRefiner) is a refinement framework designed for zero-shot industrial anomaly detection (ZSAD) that transforms coarse, patch-level anomaly maps produced by @@@@2@@@@ (ViTs) into fine-grained, pixel-level anomaly segmentations. AnoRefiner operates as a plug-in decoder architecture that leverages complementary spatial information encoded in anomaly score maps to enhance the localization and delineation of defects. It introduces an Anomaly Refinement Decoder (ARD) that interacts with both feature and anomaly branches and a Progressive Group-Wise Test-Time Training (PGT) strategy tailored to real-world, mass-production scenarios. This approach addresses the limitations of previous ZSAD methods that rely heavily on synthetic anomaly data and are unable to recover subtle, fine-scale anomalies present in industrial settings (Huang et al., 27 Nov 2025).

1. Motivation and Problem Setting

ZSAD approaches typically extract patch-level descriptors using ViT backbones, resulting in coarse anomaly maps insufficient for high-precision localization. Prior attempts to leverage features from ZSAD models for refining anomaly maps have struggled to recover fine-grained anomalies, especially due to the domain gap between synthesized and real anomalies. AnoRefiner is motivated by the empirical observation that coarse anomaly score maps, though lacking in granularity, contain spatial cues absent from the backbone feature representation. This complementary property allows AnoRefiner to localize defects at the pixel level by fusing these score maps with reconstructed feature hierarchies via attention-based refinement (Huang et al., 27 Nov 2025).

2. Anomaly Refinement Decoder (ARD): Architecture and Data Flow

The ARD processes, for each image, a spatial feature map FRh×w×CF \in\mathbb{R}^{h\times w\times C} from a ViT backbone (e.g., C=768C=768 for ViT-L/14), paired with a patch-level anomaly score map ARh×w×1A\in\mathbb{R}^{h\times w\times 1}. The process is as follows:

  • Pre-fusion:
    • The score map AA is transformed by a 1×11\times 1 convolution to produce AˉRh×w×64\bar{A}\in\mathbb{R}^{h\times w\times 64}.
    • [F;Aˉ][F;\bar{A}] (concatenation) is processed by another 1×11\times 1 convolution to yield F~Rh×w×256\tilde{F}\in\mathbb{R}^{h\times w\times 256}.
  • Anomaly-Aware Refinement (AR) Block (applied T=2T=2 times):
    • Both F~\tilde{F} and Aˉ\bar{A} are upsampled by a factor of 2 (bilinear interpolation).
    • Features are prepared: Ydown=Conv5×5(Fu)Y_{\text{down}}=\mathrm{Conv}_{5\times 5}(F_u), Ytop=FuY_{\text{top}}=F_u, and anomaly branch Aˉnext=Conv3×3(Au)\bar{A}_{\text{next}}=\mathrm{Conv}_{3\times 3}(A_u).
    • Anomaly-Attention (AA) Module: The fused feature Yfuse=Conv3×3(Ytop)+Conv3×3(Ydown)Y_{\text{fuse}} = \mathrm{Conv}_{3\times 3}(Y_{\text{top}}) + \mathrm{Conv}_{3\times 3}(Y_{\text{down}}) passes through an adaptive gating mechanism ζ=exp((β1)Yfuse)\zeta = \exp((\beta-1)\odot Y_{\text{fuse}}), followed by two convolutions and a sigmoid to yield spatial weights WW, which modulate YdownY_{\text{down}}.
    • Output is concatenated: [weighted Ydown;Aˉnext][\text{weighted } Y_{\text{down}};\bar{A}_{\text{next}}].
  • Bidirectional Perception & Interaction (BI) Block:
    • Cross-residual operations further enhance both the feature and anomaly branches,
    • Aref=Aˉ(T)+Conv3×3(ReLU(Conv3×3(F~(T))))A_{\text{ref}} = \bar{A}^{(T)} + \mathrm{Conv}_{3\times 3}(\mathrm{ReLU}(\mathrm{Conv}_{3\times 3}(\tilde{F}^{(T)})))
    • Fref=F~(T)+Conv3×3(ReLU(Conv3×3(Aˉ(T))))F_{\text{ref}} = \tilde{F}^{(T)} + \mathrm{Conv}_{3\times 3}(\mathrm{ReLU}(\mathrm{Conv}_{3\times 3}(\bar{A}^{(T)})))
  • Final Upsampling and Segmentation:
    • FoutF_{\text{out}} is upsampled to input image size (H×W)(H\times W), passed through a 1×11\times 1 convolution and sigmoid activation to yield the final refined anomaly map Aˉ\bar{A}.

3. Mathematical Formulation and Fusion Mechanism

The primary update equation within the AR block for each stage tt is:

F(t+1),A(t+1)=φ(F(t),A(t);θARD)F^{(t+1)},A^{(t+1)} = \varphi\bigl(F^{(t)},A^{(t)};\theta_{\text{ARD}}\bigr)

Expanding, the sequence is:

  • F~(t)=Up2(F(t))\tilde F^{(t)} = \text{Up}_2(F^{(t)}), A~(t)=Up2(A(t))\tilde A^{(t)} = \text{Up}_2(A^{(t)})
  • Ydown=Conv5×5(F~(t))Y_{\text{down}} = \mathrm{Conv}_{5\times 5}(\tilde F^{(t)}), A(t+1)=Conv3×3(A~(t))A^{(t+1)} = \mathrm{Conv}_{3\times 3}(\tilde A^{(t)})
  • Yfuse=Conv3×3(Ytop)+Conv3×3(Ydown)Y_{\text{fuse}} = \mathrm{Conv}_{3\times 3}(Y_{\text{top}}) + \mathrm{Conv}_{3\times 3}(Y_{\text{down}})
  • ζ=exp((β1)Yfuse)\zeta = \exp((\beta-1)\odot Y_{\text{fuse}})
  • Y=YfuseζY' = Y_{\text{fuse}}\odot\zeta, W=σ(Conv3×3(Conv3×3(Y)))W = \sigma(\mathrm{Conv}_{3\times 3}(\mathrm{Conv}_{3\times 3}(Y')))
  • F(t+1)=concat(WYdown,A(t+1))F^{(t+1)} = \text{concat}(W\odot Y_{\text{down}}, A^{(t+1)})

The BI block adds cross-residuals to both branches. This sequence of operations enables joint refinement of spatial features and anomaly likelihoods across scales.

4. Progressive Group-Wise Test-Time Training (PGT)

PGT is designed to emulate the mass-production paradigm and minimize dependence on clean anomaly examples:

  1. All NN test images are split into KK non-overlapping groups G1,,GKG_1,\dots,G_K of uniform size.
  2. Parameters θARD\theta_{\text{ARD}} are initialized randomly.
  3. For each group GkG_k:

    • The ZSAD backbone produces the coarse map AIA_I for each IGkI\in G_k.
    • If k>1k>1, the ARD refines the map: AˉI=ARD(I,AI;θARD)\bar{A}_I = \mathrm{ARD}(I, A_I; \theta_{\text{ARD}}), and the final map is averaged (AI+AˉI)/2(A_I+\bar{A}_I)/2. For k=1k=1, the final map is AIA_I.
    • Top-rr images with lowest maximal anomaly score are assumed pseudo-normal.
    • For each pseudo-normal, ll synthetic anomalies are rendered to create image-mask pairs (IiP,MiP)(I_i^P, M_i^P).
    • Update θARD\theta_{\text{ARD}} by minimizing pixelwise Dice loss over these pairs:

    θARDθARDηθi=1rlDice(ARD(IiP,AiP;θ),MiP)\theta_{\text{ARD}} \gets \theta_{\text{ARD}} - \eta \nabla_\theta \sum_{i=1}^{r\cdot l} \mathrm{Dice}(\mathrm{ARD}(I^P_i, A^P_i; \theta), M^P_i)

  4. After all groups, output the refined anomaly maps for the entire dataset.

Empirically, even when up to 40% of the pseudo-normal pool contains true anomalies, pixel-AP degrades by less than 1%, indicating robustness to noisy pseudo-labels (Huang et al., 27 Nov 2025).

5. Training Objective and Reduced Dependency on Synthetic Anomalies

The refinement loss for training is the sum of Dice losses over pseudo-anomaly image-mask pairs: Lrefine=iDice(AˉiP,MiP)\mathcal{L}_{\text{refine}} = \sum_i \mathrm{Dice}(\bar{A}^P_i, M^P_i) where

Dice(P,G)=12PGP+G\mathrm{Dice}(P, G) = 1 - \frac{2\sum P \cdot G}{\sum P + \sum G}

ARD minimizes its reliance on synthetic anomalies by integrating coarse anomaly maps (Aˉ\bar{A}) at every stage, anchoring the network on regions substantiated by the zero-shot backbone. The cross-branch refinement in the BI block further suppresses upsampling noise and enhances localization.

6. Quantitative and Qualitative Results

Observed experimental gains demonstrate the efficacy of AnoRefiner on the MVTec AD and VisA datasets:

Method w/o ARD (pixel-AP) w/ ARD (pixel-AP) Δpixel-AP
APRIL-GAN 40.8% 44.4% +3.6%
VCP-CLIP 49.1% 51.6% +2.5%
MuSc 61.2% 66.3% +5.1%

Ablation on MVTec AD shows that the base decoder + anomaly-map branch yields +1.3%, adding the anomaly-attention module gives +1.2%, and the BI block a further +1.2% average pixel-AP improvement.

Qualitatively, refined maps from AnoRefiner display sharper boundaries, fewer spurious regions, and recover thin or small-scale anomalies missed by prior patch-level methods. On highly structured objects such as PCB, the method suppresses background noise and substantially reduces false positives (Huang et al., 27 Nov 2025).

7. Interpretations and Significance

AnoRefiner (ARD + PGT) marks a shift towards leveraging spatially complementary information from anomaly score maps in ZSAD, departing from approaches that treat detection and refinement as purely feature-driven or wholly reliant on synthetic data. This architecture demonstrates that the explicit fusion of coarse anomaly cues with hierarchical upsampling and adaptive attention is sufficient to bridge the granularity gap in industrial AD, achieving up to a 5.2% improvement in pixel-AP over backbones alone. This suggests that integrating anomaly-aware spatial guidance at every decoding stage is a robust paradigm for fine-grained segmentation in low-label or zero-shot regimes (Huang et al., 27 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Anomaly-Aware Refiner (AnoRefiner).