Papers
Topics
Authors
Recent
2000 character limit reached

ViewSense-AD (VSAD)

Updated 1 December 2025
  • ViewSense-AD (VSAD) is an unsupervised framework that leverages homography-guided alignment and latent diffusion to detect anomalies across multiple fixed viewpoints.
  • It integrates specialized modules—MVAM, VALDM, and FRM—to fuse, refine, and enforce global consistency in multi-view feature representations.
  • Experimental evaluations on RealIAD and MANTA datasets demonstrate significant gains in anomaly localization and scoring compared to baseline methods.

ViewSense-AD (VSAD) is an unsupervised end-to-end visual anomaly detection framework tailored for objects imaged from multiple fixed viewpoints. VSAD addresses the core challenge of separating genuine anomalies from variations introduced by viewpoint changes by explicitly enforcing geometric consistency across views, fusing multi-view feature representations through homography-guided alignment within a latent diffusion process, and refining global feature consistency prior to anomaly scoring. The architecture is structured around three tightly-coupled modules: the Multi-View Alignment Module (MVAM), the View-Align Latent Diffusion Model (VALDM), and the Fusion Refiner Module (FRM), establishing robust, viewpoint-invariant representations for robust multi-level anomaly localization and scoring (Chen et al., 24 Nov 2025).

1. Architectural Components

VSAD integrates three crucial modules for multi-view anomaly detection:

  • Multi-View Alignment Module (MVAM) projects latent feature patches from each view into neighboring views using pre-computed homographies HijH_{i\to j}, aggregating information within a local R×RR \times R window via attention-weighted sums. This mechanism enforces patch-level geometric alignment, ensuring each feature attends to its spatially corresponding regions across all views.
  • View-Align Latent Diffusion Model (VALDM) augments a standard latent diffusion pipeline—consisting of a VAE encoder and DDIM-style U-Net denoiser—by inserting MVAM at every decoder layer. This enables progressive, coarse-to-fine multi-view alignment throughout the denoising process, yielding a holistic, semantically consistent feature representation.
  • Fusion Refiner Module (FRM) immediately follows MVAM in each decoder layer. FRM employs a lightweight convolutional network and a Squeeze-and-Excitation block to model global consistency and suppress residual feature noise, producing globally refined and discriminative multi-view features.

At inference, refined features extracted by DDIM inversion are compared on a patch-wise basis to a multi-level memory bank of normal prototypes, yielding pixel-, view-, and sample-level anomaly scores.

2. Mathematical Framework

The core of VSAD lies in its explicit homography-guided feature alignment, multi-stage latent diffusion processing, feature refinement, and prototype-based anomaly scoring:

2.1 Homography-Guided Feature Alignment

Given two calibrated views ii and jj, the homography HijR3×3H_{i\to j} \in \mathbb{R}^{3 \times 3} maps the homogeneous pixel coordinate pi=[ui,vi,1]p_i = [u_i, v_i, 1]^\top in ii to pjp_j in jj. The mapped location [uj,vj][u_j, v_j]^\top is computed as:

pjHijpi    [uj,vj]=1(Hijpi)3[(Hijpi)1 (Hijpi)2]p_j \propto H_{i\to j}\,p_i \implies [u_j, v_j]^\top = \frac{1}{(H_{i\to j}\,p_i)_3} \begin{bmatrix} (H_{i\to j}\,p_i)_1 \ (H_{i\to j}\,p_i)_2 \end{bmatrix}

Within a local R×RR \times R window around pjp_j in feature map XjX_j, candidate patches pjkp_j^k are offset from the projected pip_i, embedded via a 2D positional encoding γ(Δpjk)\gamma(\Delta p_j^k), and transformed to queries, keys, and values via linear projections (Wq,Wk,Wv)(W_q, W_k, W_v). Attention weights αj,k\alpha_{j, k} aggregate across view-window pairs, yielding the aligned feature for pip_i.

2.2 Latent Diffusion and Progressive Alignment

The VAE encoder E\mathcal{E} generates initial latent codes z0RM×C×h×wz_0 \in \mathbb{R}^{M\times C \times h \times w} across all MM views. The noising process:

zt=αˉtz0+1αˉtϵ,ϵN(0,I)z_t = \sqrt{\bar\alpha_t} z_0 + \sqrt{1-\bar\alpha_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

is followed by a U-Net decoder which, at each decoder layer ll, applies:

z~t(l)=MVAM(l)(U ⁣NetBlock(l)(zt(l))),zt(l+1)FRM(l)(z~t(l))\tilde{z}_t^{(l)} = \mathrm{MVAM}^{(l)}(\mathrm{U\!NetBlock}^{(l)}(z_t^{(l)})), \quad z_{t}^{(l+1)} \leftarrow \mathrm{FRM}^{(l)}(\tilde{z}_t^{(l)})

and is trained with the usual L2L_2 denoising loss.

2.3 Feature Refinement and Consistency

FRM computes refined features: \begin{align*} r_m{(l)} &= f(\tilde{Z}_m{(l)}) \odot \mathcal{A}(f(\tilde{Z}_m{(l)})) \ F_m{(l)} &= \tilde{Z}_m{(l)} + r_m{(l)} \end{align*} A cross-view refinement loss enforces coherence:

Lr=1Ll=1L1P(i,j)PFi(l)Fj(l)22\mathcal{L}_r = \frac{1}{L} \sum_{l=1}^L \frac{1}{|\mathcal{P}|} \sum_{(i, j) \in \mathcal{P}} \| F_i^{(l)} - F_j^{(l)} \|_2^2

yielding a total loss Ltotal=Ld+λLr\mathcal{L}_{\rm total} = \mathcal{L}_d + \lambda \mathcal{L}_r.

2.4 Memory Bank and Anomaly Scoring

Refined feature prototypes from normal samples are collected in memory banks M(l)\mathcal{M}^{(l)}. At test time, a pixel-wise anomaly score is computed as:

Spixel(u,v)=l=1LwlminmM(l)Fq(l)(u,v)m2S_{\rm pixel}(u, v) = \sum_{l=1}^L w_l \min_{m \in \mathcal{M}^{(l)}} \| F_q^{(l)}(u, v) - m \|_2

with view-level (SviewS_{\rm view}) and sample-level (SsampleS_{\rm sample}) scores as the maxima across pixels and views, respectively.

3. Training Protocols and Implementation Details

VSAD is validated on the RealIAD and MANTA datasets:

Dataset Categories Images Views per Object
RealIAD 30 151,050 5
MANTA 38 137,338 5
  • All images are resized to 256×256256\times 256.
  • VAE encoder output: M×4×64×64M \times 4 \times 64 \times 64.
  • Hyperparameters: patch search radius R=3R=3 (empirically optimal), memory banks from decoder layers 3 and 4, T=50T=50 DDIM steps, AdamW optimizer (lr=10410^{-4}, weight decay=10210^{-2}), batch size 16 (samples contain all 5 views), trained for 80 epochs on 4×\timesNVIDIA A6000 GPUs.

Ablation confirms the necessity of R=3, memory from layers 3+4, and the inclusion of both MVAM and FRM for best performance.

4. Experimental Evaluation

VSAD achieves the following on RealIAD and MANTA:

Metric RealIAD (%) MANTA (%)
P-AUROC 98.34 96.81
V-AUROC 91.71 93.94
S-AUROC 94.84 94.52
Absolute Gain 1.16–1.31 1.08–1.27

Ablations removing MVAM yield severe performance drops (e.g., RealIAD S-AUROC –8.32%, MANTA S-AUROC –10.30%), demonstrating the indispensability of explicit geometric alignment. FRM removal results in additional, but smaller, drops (RealIAD P-AUROC –0.99%; MANTA –0.98%), highlighting its role in suppressing residual noise.

Qualitative anomaly maps demonstrate VSAD’s tight localization and low false positive rates compared to PatchCore and CKAAD, particularly on challenging textures and significant viewpoint shifts. t-SNE visualizations reveal progression from scattered to well-separated multi-view feature manifolds after successive alignment and refinement.

5. Robustness, Limitations, and Future Prospects

VSAD’s explicit homography-driven multi-stage alignment imparts robustness to challenging textures and large viewpoint variations, closing the gap between traditional single-view pipelines and multi-view human-like inspection. The described framework, however, is constrained by several factors:

  • Dependence on pre-calibrated homographies restricts applicability to unstructured capture setups. Future work could target learnable alignment transforms, such as CNN-based homography estimation or deformable fields.
  • MVAM’s assumption of local planarity limits effectiveness on non-rigid or highly curved geometries; volumetric or neural radiance field-based alignment mechanisms may extend capability for such cases.
  • Integration of temporal consistency or 3D priors (e.g., point clouds) is suggested as a further step to improve robustness under object pose variation and real-world acquisition pipelines.

This suggests a central research direction towards end-to-end geometric alignment learning, and adaptation for complex, non-planar surface anomaly detection.

6. Significance and Context Within Visual Anomaly Detection

VSAD is characterized as the first unsupervised framework to integrate homography-guided, multi-stage feature alignment into a diffusion-based backbone, combined with a lightweight SE-style global refiner, to achieve state-of-the-art multi-view anomaly detection and localization. Extensive analysis demonstrates that enforcing geometric consistency and progressive feature fusion across views is essential for robust performance in industrial visual inspection scenarios marked by viewpoint diversity and texture complexity. The release of open-source code and detailed ablation studies positions VSAD as a critical reference point for future research on geometric consistency in visual inspection applications (Chen et al., 24 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ViewSense-AD (VSAD).