ViewSense-AD (VSAD)

Updated 1 December 2025

ViewSense-AD (VSAD) is an unsupervised framework that leverages homography-guided alignment and latent diffusion to detect anomalies across multiple fixed viewpoints.
It integrates specialized modules—MVAM, VALDM, and FRM—to fuse, refine, and enforce global consistency in multi-view feature representations.
Experimental evaluations on RealIAD and MANTA datasets demonstrate significant gains in anomaly localization and scoring compared to baseline methods.

ViewSense-AD (VSAD) is an unsupervised end-to-end visual anomaly detection framework tailored for objects imaged from multiple fixed viewpoints. VSAD addresses the core challenge of separating genuine anomalies from variations introduced by viewpoint changes by explicitly enforcing geometric consistency across views, fusing multi-view feature representations through homography-guided alignment within a latent diffusion process, and refining global feature consistency prior to anomaly scoring. The architecture is structured around three tightly-coupled modules: the Multi-View Alignment Module (MVAM), the View-Align Latent Diffusion Model (VALDM), and the Fusion Refiner Module (FRM), establishing robust, viewpoint-invariant representations for robust multi-level anomaly localization and scoring (Chen et al., 24 Nov 2025).

1. Architectural Components

VSAD integrates three crucial modules for multi-view anomaly detection:

Multi-View Alignment Module (MVAM) projects latent feature patches from each view into neighboring views using pre-computed homographies $H_{i\to j}$ , aggregating information within a local $R \times R$ window via attention-weighted sums. This mechanism enforces patch-level geometric alignment, ensuring each feature attends to its spatially corresponding regions across all views.
View-Align Latent Diffusion Model (VALDM) augments a standard latent diffusion pipeline—consisting of a VAE encoder and DDIM-style U-Net denoiser—by inserting MVAM at every decoder layer. This enables progressive, coarse-to-fine multi-view alignment throughout the denoising process, yielding a holistic, semantically consistent feature representation.
Fusion Refiner Module (FRM) immediately follows MVAM in each decoder layer. FRM employs a lightweight convolutional network and a Squeeze-and-Excitation block to model global consistency and suppress residual feature noise, producing globally refined and discriminative multi-view features.

At inference, refined features extracted by DDIM inversion are compared on a patch-wise basis to a multi-level memory bank of normal prototypes, yielding pixel-, view-, and sample-level anomaly scores.

2. Mathematical Framework

The core of VSAD lies in its explicit homography-guided feature alignment, multi-stage latent diffusion processing, feature refinement, and prototype-based anomaly scoring:

2.1 Homography-Guided Feature Alignment

Given two calibrated views $i$ and $j$ , the homography $H_{i\to j} \in \mathbb{R}^{3 \times 3}$ maps the homogeneous pixel coordinate $p_i = [u_i, v_i, 1]^\top$ in $i$ to $p_j$ in $j$ . The mapped location $[u_j, v_j]^\top$ is computed as:

$p_j \propto H_{i\to j}\,p_i \implies [u_j, v_j]^\top = \frac{1}{(H_{i\to j}\,p_i)_3} \begin{bmatrix} (H_{i\to j}\,p_i)_1 \ (H_{i\to j}\,p_i)_2 \end{bmatrix}$

Within a local $R \times R$ window around $p_j$ in feature map $X_j$ , candidate patches $p_j^k$ are offset from the projected $p_i$ , embedded via a 2D positional encoding $\gamma(\Delta p_j^k)$ , and transformed to queries, keys, and values via linear projections $(W_q, W_k, W_v)$ . Attention weights $\alpha_{j, k}$ aggregate across view-window pairs, yielding the aligned feature for $p_i$ .

2.2 Latent Diffusion and Progressive Alignment

The VAE encoder $\mathcal{E}$ generates initial latent codes $z_0 \in \mathbb{R}^{M\times C \times h \times w}$ across all $M$ views. The noising process:

$z_t = \sqrt{\bar\alpha_t} z_0 + \sqrt{1-\bar\alpha_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

is followed by a U-Net decoder which, at each decoder layer $l$ , applies:

$\tilde{z}_t^{(l)} = \mathrm{MVAM}^{(l)}(\mathrm{U\!NetBlock}^{(l)}(z_t^{(l)})), \quad z_{t}^{(l+1)} \leftarrow \mathrm{FRM}^{(l)}(\tilde{z}_t^{(l)})$

and is trained with the usual $L_2$ denoising loss.

FRM computes refined features: \begin{align*} r_m^{(l)} &= f(\tilde{Z}_m^{(l)}) \odot \mathcal{A}(f(\tilde{Z}_m^{(l)})) \ F_m^{(l)} &= \tilde{Z}_m^{(l)} + r_m^{(l)} \end{align*} A cross-view refinement loss enforces coherence:

$\mathcal{L}_r = \frac{1}{L} \sum_{l=1}^L \frac{1}{|\mathcal{P}|} \sum_{(i, j) \in \mathcal{P}} \| F_i^{(l)} - F_j^{(l)} \|_2^2$

yielding a total loss $\mathcal{L}_{\rm total} = \mathcal{L}_d + \lambda \mathcal{L}_r$ .

2.4 Memory Bank and Anomaly Scoring

Refined feature prototypes from normal samples are collected in memory banks $\mathcal{M}^{(l)}$ . At test time, a pixel-wise anomaly score is computed as:

$S_{\rm pixel}(u, v) = \sum_{l=1}^L w_l \min_{m \in \mathcal{M}^{(l)}} \| F_q^{(l)}(u, v) - m \|_2$

with view-level ( $S_{\rm view}$ ) and sample-level ( $S_{\rm sample}$ ) scores as the maxima across pixels and views, respectively.

3. Training Protocols and Implementation Details

VSAD is validated on the RealIAD and MANTA datasets:

Dataset	Categories	Images	Views per Object
RealIAD	30	151,050	5
MANTA	38	137,338	5

All images are resized to $256\times 256$ .
VAE encoder output: $M \times 4 \times 64 \times 64$ .
Hyperparameters: patch search radius $R=3$ (empirically optimal), memory banks from decoder layers 3 and 4, $T=50$ DDIM steps, AdamW optimizer (lr= $10^{-4}$ , weight decay= $10^{-2}$ ), batch size 16 (samples contain all 5 views), trained for 80 epochs on 4 $\times$ NVIDIA A6000 GPUs.

Ablation confirms the necessity of R=3, memory from layers 3+4, and the inclusion of both MVAM and FRM for best performance.

4. Experimental Evaluation

VSAD achieves the following on RealIAD and MANTA:

Metric	RealIAD (%)	MANTA (%)
P-AUROC	98.34	96.81
V-AUROC	91.71	93.94
S-AUROC	94.84	94.52
Absolute Gain	1.16–1.31	1.08–1.27

Ablations removing MVAM yield severe performance drops (e.g., RealIAD S-AUROC –8.32%, MANTA S-AUROC –10.30%), demonstrating the indispensability of explicit geometric alignment. FRM removal results in additional, but smaller, drops (RealIAD P-AUROC –0.99%; MANTA –0.98%), highlighting its role in suppressing residual noise.

Qualitative anomaly maps demonstrate VSAD’s tight localization and low false positive rates compared to PatchCore and CKAAD, particularly on challenging textures and significant viewpoint shifts. t-SNE visualizations reveal progression from scattered to well-separated multi-view feature manifolds after successive alignment and refinement.

5. Robustness, Limitations, and Future Prospects

VSAD’s explicit homography-driven multi-stage alignment imparts robustness to challenging textures and large viewpoint variations, closing the gap between traditional single-view pipelines and multi-view human-like inspection. The described framework, however, is constrained by several factors:

Dependence on pre-calibrated homographies restricts applicability to unstructured capture setups. Future work could target learnable alignment transforms, such as CNN-based homography estimation or deformable fields.
MVAM’s assumption of local planarity limits effectiveness on non-rigid or highly curved geometries; volumetric or neural radiance field-based alignment mechanisms may extend capability for such cases.
Integration of temporal consistency or 3D priors (e.g., point clouds) is suggested as a further step to improve robustness under object pose variation and real-world acquisition pipelines.

This suggests a central research direction towards end-to-end geometric alignment learning, and adaptation for complex, non-planar surface anomaly detection.

6. Significance and Context Within Visual Anomaly Detection

VSAD is characterized as the first unsupervised framework to integrate homography-guided, multi-stage feature alignment into a diffusion-based backbone, combined with a lightweight SE-style global refiner, to achieve state-of-the-art multi-view anomaly detection and localization. Extensive analysis demonstrates that enforcing geometric consistency and progressive feature fusion across views is essential for robust performance in industrial visual inspection scenarios marked by viewpoint diversity and texture complexity. The release of open-source code and detailed ablation studies positions VSAD as a critical reference point for future research on geometric consistency in visual inspection applications (Chen et al., 24 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Unsupervised Multi-View Visual Anomaly Detection via Progressive Homography-Guided Alignment (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ViewSense-AD (VSAD).

ViewSense-AD (VSAD)

1. Architectural Components

2. Mathematical Framework

2.1 Homography-Guided Feature Alignment

2.2 Latent Diffusion and Progressive Alignment

2.3 Feature Refinement and Consistency

2.4 Memory Bank and Anomaly Scoring

3. Training Protocols and Implementation Details

4. Experimental Evaluation

5. Robustness, Limitations, and Future Prospects

6. Significance and Context Within Visual Anomaly Detection

Whiteboard

Follow Topic

Continue Learning

ViewSense-AD (VSAD)

1. Architectural Components

2. Mathematical Framework

2.1 Homography-Guided Feature Alignment

2.2 Latent Diffusion and Progressive Alignment

2.3 Feature Refinement and Consistency

2.4 Memory Bank and Anomaly Scoring

3. Training Protocols and Implementation Details

4. Experimental Evaluation

5. Robustness, Limitations, and Future Prospects

6. Significance and Context Within Visual Anomaly Detection

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics