Vision-Attention Anomaly Scoring (VAAS)

Updated 24 December 2025

Vision-Attention Anomaly Scoring (VAAS) is a modular, attention-centric framework for detecting image manipulations and hallucinations in forensic and vision-language applications.
It fuses global long-range attention from Vision Transformers with patch-level self-consistency from SegFormer to generate continuous, interpretable anomaly scores.
The approach extends to multimodal scenarios, offering visual diagnostics and hybrid scoring to enhance the reliability and transparency of manipulation detection.

Vision-Attention Anomaly Scoring (VAAS) describes a modular, attention-centric paradigm for anomaly detection within vision and vision-LLMs. As introduced in "VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics" (Bamigbade et al., 17 Dec 2025), VAAS fuses global long-range attention statistics—extracted from Vision Transformers (ViT)—with patch-level consistency evaluations derived from semantic segmentation encoders (SegFormer), producing continuous and interpretable scores that quantify manipulations of digital image data. The paradigm has also been extended to multimodal scenarios, notably object hallucination detection in Large Vision-LLMs (LVLMs), by monitoring self-attention flows over generated tokens, as formalized in Prelim Attention Score (PAS) (Hoang-Xuan et al., 14 Nov 2025). The underlying principle is that anomalies typically manifest as atypical, quantifiable deviations in the distribution or allocation of attention—whether spatial (images/patches) or token-based (text/image tokens in LVLMs).

1. Architectural Foundation and Dual-Module Framework

VAAS operationalizes anomaly scoring via a dual-module pipeline:

Full-Image Consistency Module (Fx): Employs a ViT-Base-Patch16-224 architecture, pretrained on ImageNet-21k. The module ingests standardized $224\times 224$ RGB images, processes them through Transformer layers, and derives attention maps from the final four encoder layers. Aggregated attention distributions reflect the global context of the scene and are upsampled to original resolution. Fx quantifies how much the image's global attention map diverges from a reference distribution computed on authentic training samples.
Patch-Level Self-Consistency Module (Px): Utilizes SegFormer-B1, pretrained on ADE20K. The input image is partitioned into non-overlapping $32\times 32$ patches. Each patch $P_{i}$ is mapped to a $256$-dimensional feature embedding $F(P_{i})$ . Anomaly scoring is based on the contextual consistency—i.e., cosine similarity—between $F(P_{i})$ and its $N$ spatial neighbors. Outlier patches, out-of-alignment with locality, amplify local anomaly scores.

Outputs from Fx and Px consist of:

$S_{F}$ : a global anomaly score for the image (Fx)
$S_{P}(i)$ : local scores per patch and $S_{P}$ : average patch anomaly (Px)

Hybrid fusion is performed via a Hybrid Scoring Mechanism (HSM): a weighted linear combination $S_{H} = \alpha S_{F} + (1 - \alpha) S_{P}$ , or a harmonic mean variant, producing the final interpretable anomaly score $S_{\mathrm{VAAS}}$ (Bamigbade et al., 17 Dec 2025).

2. Mathematical Definitions and Score Formulations

Fx Global Anomaly Score

For an image attention map $A$ of shape $L\times H\times W$ :

$S_{F} = \frac{|\mu(A) - \mu_{\mathrm{ref}}|}{\sigma_{\mathrm{ref}}}$

where $\mu(A)$ is the mean attention mass for the test image, and $(\mu_{\mathrm{ref}}, \sigma_{\mathrm{ref}})$ are the mean and standard deviation of authentic reference images.

Px Patch-Level Score

Each patch $P_i$ has a neighborhood $\mathcal{N}(i)$ . Compute:

$\mathrm{Sim}(P_i, P_j) = \frac{F(P_i) \cdot F(P_j)}{\|F(P_i)\| \|F(P_j)\|}$

$S_P(i) = 1 - \frac{1}{N}\sum_{j \in \mathcal{N}(i)} \mathrm{Sim}(P_i, P_j)$

$S_P = \frac{1}{M}\sum_{i=1}^M S_P(i)$

Fusion Scoring (HSM)

Weighted: $S_H = \alpha S_F + (1 - \alpha) S_P, \quad \alpha \in [0,1]$ Harmonic: $S_H^{\mathrm{harmonic}} = \frac{2}{1/S_F + 1/S_P}$

Training Objective

Px is supervised by a composite segmentation loss:

$\mathcal{L}_{\mathrm{seg}} = \lambda_{bce}\,\mathcal{L}_{BCE} + \lambda_{dice}\,\mathcal{L}_{Dice} + \lambda_{focal}\,\mathcal{L}_{Focal}$

with $(\lambda_{bce}, \lambda_{dice}, \lambda_{focal}) = (1.0, 0.7, 1.0)$ .

Attention alignment between Px and Fx is regularized via: $\mathcal{L}_{fx} = 1 - \cos(F_{Px}, F_{Fx})$ Composite objective: $\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{seg}} + \omega_{fx} \mathcal{L}_{fx}$ where $\omega_{fx}=0.1$ .

3. Attention-Based Anomaly Scoring in Vision-LLMs

PAS extends VAAS principles to LVLMs to detect object hallucinations ("PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision--LLMs" (Hoang-Xuan et al., 14 Nov 2025)). PAS quantifies over-dependence on preliminary decoded tokens (prelim) using attention weights:

For the $k$ 'th generated token $y_k$ :

$s_{\mathrm{prel}}(y_k, \mathbf{y}, \mathbf{x}) = \frac{1}{H}\sum_{h=1}^H\sum_{j=m+1}^{k-1} \mathbf{A}^{(\ell,h)}(k,j)$

where $\mathbf{A}^{(\ell, h)}$ are self-attention weights in layer $\ell$ ( $\ell=0$ preferred).

The PAS detector flags $y_k$ as hallucinated if $s_{\mathrm{prel}}(y_k, \mathbf{y}, \mathbf{x}) \geq \tau$ , with empirically optimal $\tau \in [0.2, 0.25]$ . PAS achieves state-of-the-art AUROC performance, surpassing NLL, entropy-based, and global-local similarity baselines (Hoang-Xuan et al., 14 Nov 2025).

4. Evaluation Protocols and Empirical Results

VAAS was rigorously benchmarked on CASIA v2.0 (12,614 images) and DF2023 (100k images sampled from the full million-image corpus) (Bamigbade et al., 17 Dec 2025). Fx reference statistics are computed from validation splits of authentic samples.

Metrics:

Metric	Functionality
Precision	Manipulation detection
Recall	Manipulation detection
F1-Score	Binary detection (presence)
IoU	Mask localization

Major results:

Dataset	Precision	Recall	F1	IoU
CASIA v2.0	93.5%	94.8%	94.1%	89.0%
DF2023	95.9%	94.2%	94.9%	91.1%

Ablations revealed optimal trade-offs at $\alpha \approx 0.6$ (CASIA: $\alpha\in[0.4,0.6]$ ; DF2023: $\alpha\approx0.7$ for generative manipulations) and regularization weight $\omega_{fx}=0.1$ . ViT-Base exhibited best efficiency-accuracy balance.

PAS was evaluated on MSCOCO and Pascal VOC object benchmarks using three LVLMs (LLaVA-7B, MiniGPT-4-7B, Shikra-7B) and attained average AUROC 85.0%, exceeding prior approaches (SVAR: 80.3%).

5. Visual Interpretability and Qualitative Diagnostics

VAAS generates interpretable heatmaps overlaying input images, revealing both the spatial footprint and intensity of manipulation. Outputs include:

Binary Px mask (local manipulation regions)
Px anomaly heatmap overlay (fine-grained)
Fx attention overlay (global context deviation)
Hybrid anomaly map (fused diagnostic)

These visualizations facilitate transparent forensic assessments, bridging raw detection accuracy with human-understandable evidence. High-anomaly cases demonstrate crisp mask boundaries and localized attention mass; mid-level anomalies show diffused edges but maintain semantically relevant global cues (Bamigbade et al., 17 Dec 2025).

PAS similarly visualizes anomalous attention flows in LVLMs, delineating object token generations that excessively depend on model-internal context rather than image tokens. Early attention layers signal raw grounding, with anomalous distributions corresponding tightly to hallucination events (Hoang-Xuan et al., 14 Nov 2025).

6. Implementation, Reproducibility, and Limitations

VAAS is implemented using PyTorch and HuggingFace Transformers, with open-source code provided for full reproducibility (Bamigbade et al., 17 Dec 2025). Key resources:

Environment .yml (dependency specifications)
Data download/preprocessing scripts (CASIA, DF2023)
Training/inference scripts for Px (with Fx guidance)
Jupyter notebooks for baseline and ablation analysis

PAS requires no extra forward passes and operates on precomputed attention tensors in LVLMs, affording training-free and reference-free deployment with minimal memory overhead (attention tensor: 18 GB VRAM for 7B models) (Hoang-Xuan et al., 14 Nov 2025).

Limitations include dependence on global self-attention structures (future sparse/local-window architectures may impact interpretability), focus on object-existence hallucinations (requires task-specific extensions for relational/attribute anomalies), and lexicon-based object token identification.

7. Extensions and Future Directions

A plausible implication is that VAAS can be generalized across multimodal tasks by cataloguing multi-channel attention metrics (over image, prelim, instruction, and special tokens), conducting per-head/layer diagnostics, and dynamically calibrating fusion weights per task or domain. MI-based approximations (conditional mutual information) offer additional distributional anomaly signals, and hybrid PAS/MI approaches could boost robustness.

The toolkit paradigm encapsulated by VAAS is extensible for real-time fine-grained monitoring in image integrity, hallucination detection, and other anomaly-prone multimodal reasoning settings. The core insight—"anomalous attention patterns reveal hallucinations"—suggests that modular attention diagnostics, fused across spatial, semantic, and linguistic modalities, may underpin future trustworthy, explainable detection systems (Bamigbade et al., 17 Dec 2025, Hoang-Xuan et al., 14 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics (2025)

PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Vision-Attention Anomaly Scoring (VAAS).