ViewSense-AD (VSAD)
- ViewSense-AD (VSAD) is an unsupervised framework that leverages homography-guided alignment and latent diffusion to detect anomalies across multiple fixed viewpoints.
- It integrates specialized modules—MVAM, VALDM, and FRM—to fuse, refine, and enforce global consistency in multi-view feature representations.
- Experimental evaluations on RealIAD and MANTA datasets demonstrate significant gains in anomaly localization and scoring compared to baseline methods.
ViewSense-AD (VSAD) is an unsupervised end-to-end visual anomaly detection framework tailored for objects imaged from multiple fixed viewpoints. VSAD addresses the core challenge of separating genuine anomalies from variations introduced by viewpoint changes by explicitly enforcing geometric consistency across views, fusing multi-view feature representations through homography-guided alignment within a latent diffusion process, and refining global feature consistency prior to anomaly scoring. The architecture is structured around three tightly-coupled modules: the Multi-View Alignment Module (MVAM), the View-Align Latent Diffusion Model (VALDM), and the Fusion Refiner Module (FRM), establishing robust, viewpoint-invariant representations for robust multi-level anomaly localization and scoring (Chen et al., 24 Nov 2025).
1. Architectural Components
VSAD integrates three crucial modules for multi-view anomaly detection:
- Multi-View Alignment Module (MVAM) projects latent feature patches from each view into neighboring views using pre-computed homographies , aggregating information within a local window via attention-weighted sums. This mechanism enforces patch-level geometric alignment, ensuring each feature attends to its spatially corresponding regions across all views.
- View-Align Latent Diffusion Model (VALDM) augments a standard latent diffusion pipeline—consisting of a VAE encoder and DDIM-style U-Net denoiser—by inserting MVAM at every decoder layer. This enables progressive, coarse-to-fine multi-view alignment throughout the denoising process, yielding a holistic, semantically consistent feature representation.
- Fusion Refiner Module (FRM) immediately follows MVAM in each decoder layer. FRM employs a lightweight convolutional network and a Squeeze-and-Excitation block to model global consistency and suppress residual feature noise, producing globally refined and discriminative multi-view features.
At inference, refined features extracted by DDIM inversion are compared on a patch-wise basis to a multi-level memory bank of normal prototypes, yielding pixel-, view-, and sample-level anomaly scores.
2. Mathematical Framework
The core of VSAD lies in its explicit homography-guided feature alignment, multi-stage latent diffusion processing, feature refinement, and prototype-based anomaly scoring:
2.1 Homography-Guided Feature Alignment
Given two calibrated views and , the homography maps the homogeneous pixel coordinate in to in . The mapped location is computed as:
Within a local window around in feature map , candidate patches are offset from the projected , embedded via a 2D positional encoding , and transformed to queries, keys, and values via linear projections . Attention weights aggregate across view-window pairs, yielding the aligned feature for .
2.2 Latent Diffusion and Progressive Alignment
The VAE encoder generates initial latent codes across all views. The noising process:
is followed by a U-Net decoder which, at each decoder layer , applies:
and is trained with the usual denoising loss.
2.3 Feature Refinement and Consistency
FRM computes refined features: \begin{align*} r_m{(l)} &= f(\tilde{Z}_m{(l)}) \odot \mathcal{A}(f(\tilde{Z}_m{(l)})) \ F_m{(l)} &= \tilde{Z}_m{(l)} + r_m{(l)} \end{align*} A cross-view refinement loss enforces coherence:
yielding a total loss .
2.4 Memory Bank and Anomaly Scoring
Refined feature prototypes from normal samples are collected in memory banks . At test time, a pixel-wise anomaly score is computed as:
with view-level () and sample-level () scores as the maxima across pixels and views, respectively.
3. Training Protocols and Implementation Details
VSAD is validated on the RealIAD and MANTA datasets:
| Dataset | Categories | Images | Views per Object |
|---|---|---|---|
| RealIAD | 30 | 151,050 | 5 |
| MANTA | 38 | 137,338 | 5 |
- All images are resized to .
- VAE encoder output: .
- Hyperparameters: patch search radius (empirically optimal), memory banks from decoder layers 3 and 4, DDIM steps, AdamW optimizer (lr=, weight decay=), batch size 16 (samples contain all 5 views), trained for 80 epochs on 4NVIDIA A6000 GPUs.
Ablation confirms the necessity of R=3, memory from layers 3+4, and the inclusion of both MVAM and FRM for best performance.
4. Experimental Evaluation
VSAD achieves the following on RealIAD and MANTA:
| Metric | RealIAD (%) | MANTA (%) |
|---|---|---|
| P-AUROC | 98.34 | 96.81 |
| V-AUROC | 91.71 | 93.94 |
| S-AUROC | 94.84 | 94.52 |
| Absolute Gain | 1.16–1.31 | 1.08–1.27 |
Ablations removing MVAM yield severe performance drops (e.g., RealIAD S-AUROC –8.32%, MANTA S-AUROC –10.30%), demonstrating the indispensability of explicit geometric alignment. FRM removal results in additional, but smaller, drops (RealIAD P-AUROC –0.99%; MANTA –0.98%), highlighting its role in suppressing residual noise.
Qualitative anomaly maps demonstrate VSAD’s tight localization and low false positive rates compared to PatchCore and CKAAD, particularly on challenging textures and significant viewpoint shifts. t-SNE visualizations reveal progression from scattered to well-separated multi-view feature manifolds after successive alignment and refinement.
5. Robustness, Limitations, and Future Prospects
VSAD’s explicit homography-driven multi-stage alignment imparts robustness to challenging textures and large viewpoint variations, closing the gap between traditional single-view pipelines and multi-view human-like inspection. The described framework, however, is constrained by several factors:
- Dependence on pre-calibrated homographies restricts applicability to unstructured capture setups. Future work could target learnable alignment transforms, such as CNN-based homography estimation or deformable fields.
- MVAM’s assumption of local planarity limits effectiveness on non-rigid or highly curved geometries; volumetric or neural radiance field-based alignment mechanisms may extend capability for such cases.
- Integration of temporal consistency or 3D priors (e.g., point clouds) is suggested as a further step to improve robustness under object pose variation and real-world acquisition pipelines.
This suggests a central research direction towards end-to-end geometric alignment learning, and adaptation for complex, non-planar surface anomaly detection.
6. Significance and Context Within Visual Anomaly Detection
VSAD is characterized as the first unsupervised framework to integrate homography-guided, multi-stage feature alignment into a diffusion-based backbone, combined with a lightweight SE-style global refiner, to achieve state-of-the-art multi-view anomaly detection and localization. Extensive analysis demonstrates that enforcing geometric consistency and progressive feature fusion across views is essential for robust performance in industrial visual inspection scenarios marked by viewpoint diversity and texture complexity. The release of open-source code and detailed ablation studies positions VSAD as a critical reference point for future research on geometric consistency in visual inspection applications (Chen et al., 24 Nov 2025).