Papers
Topics
Authors
Recent
2000 character limit reached

Vision-Attention Anomaly Scoring (VAAS)

Updated 24 December 2025
  • Vision-Attention Anomaly Scoring (VAAS) is a modular, attention-centric framework for detecting image manipulations and hallucinations in forensic and vision-language applications.
  • It fuses global long-range attention from Vision Transformers with patch-level self-consistency from SegFormer to generate continuous, interpretable anomaly scores.
  • The approach extends to multimodal scenarios, offering visual diagnostics and hybrid scoring to enhance the reliability and transparency of manipulation detection.

Vision-Attention Anomaly Scoring (VAAS) describes a modular, attention-centric paradigm for anomaly detection within vision and vision-LLMs. As introduced in "VAAS: Vision-Attention Anomaly Scoring for Image Manipulation Detection in Digital Forensics" (Bamigbade et al., 17 Dec 2025), VAAS fuses global long-range attention statistics—extracted from Vision Transformers (ViT)—with patch-level consistency evaluations derived from semantic segmentation encoders (SegFormer), producing continuous and interpretable scores that quantify manipulations of digital image data. The paradigm has also been extended to multimodal scenarios, notably object hallucination detection in Large Vision-LLMs (LVLMs), by monitoring self-attention flows over generated tokens, as formalized in Prelim Attention Score (PAS) (Hoang-Xuan et al., 14 Nov 2025). The underlying principle is that anomalies typically manifest as atypical, quantifiable deviations in the distribution or allocation of attention—whether spatial (images/patches) or token-based (text/image tokens in LVLMs).

1. Architectural Foundation and Dual-Module Framework

VAAS operationalizes anomaly scoring via a dual-module pipeline:

  • Full-Image Consistency Module (Fx): Employs a ViT-Base-Patch16-224 architecture, pretrained on ImageNet-21k. The module ingests standardized 224×224224\times 224 RGB images, processes them through Transformer layers, and derives attention maps from the final four encoder layers. Aggregated attention distributions reflect the global context of the scene and are upsampled to original resolution. Fx quantifies how much the image's global attention map diverges from a reference distribution computed on authentic training samples.
  • Patch-Level Self-Consistency Module (Px): Utilizes SegFormer-B1, pretrained on ADE20K. The input image is partitioned into non-overlapping 32×3232\times 32 patches. Each patch PiP_{i} is mapped to a $256$-dimensional feature embedding F(Pi)F(P_{i}). Anomaly scoring is based on the contextual consistency—i.e., cosine similarity—between F(Pi)F(P_{i}) and its NN spatial neighbors. Outlier patches, out-of-alignment with locality, amplify local anomaly scores.

Outputs from Fx and Px consist of:

  • SFS_{F}: a global anomaly score for the image (Fx)
  • SP(i)S_{P}(i): local scores per patch and SPS_{P}: average patch anomaly (Px)

Hybrid fusion is performed via a Hybrid Scoring Mechanism (HSM): a weighted linear combination SH=αSF+(1α)SPS_{H} = \alpha S_{F} + (1 - \alpha) S_{P}, or a harmonic mean variant, producing the final interpretable anomaly score SVAASS_{\mathrm{VAAS}} (Bamigbade et al., 17 Dec 2025).

2. Mathematical Definitions and Score Formulations

Fx Global Anomaly Score

For an image attention map AA of shape L×H×WL\times H\times W:

SF=μ(A)μrefσrefS_{F} = \frac{|\mu(A) - \mu_{\mathrm{ref}}|}{\sigma_{\mathrm{ref}}}

where μ(A)\mu(A) is the mean attention mass for the test image, and (μref,σref)(\mu_{\mathrm{ref}}, \sigma_{\mathrm{ref}}) are the mean and standard deviation of authentic reference images.

Px Patch-Level Score

Each patch PiP_i has a neighborhood N(i)\mathcal{N}(i). Compute:

Sim(Pi,Pj)=F(Pi)F(Pj)F(Pi)F(Pj)\mathrm{Sim}(P_i, P_j) = \frac{F(P_i) \cdot F(P_j)}{\|F(P_i)\| \|F(P_j)\|}

SP(i)=11NjN(i)Sim(Pi,Pj)S_P(i) = 1 - \frac{1}{N}\sum_{j \in \mathcal{N}(i)} \mathrm{Sim}(P_i, P_j)

SP=1Mi=1MSP(i)S_P = \frac{1}{M}\sum_{i=1}^M S_P(i)

Fusion Scoring (HSM)

Weighted: SH=αSF+(1α)SP,α[0,1]S_H = \alpha S_F + (1 - \alpha) S_P, \quad \alpha \in [0,1] Harmonic: SHharmonic=21/SF+1/SPS_H^{\mathrm{harmonic}} = \frac{2}{1/S_F + 1/S_P}

Training Objective

Px is supervised by a composite segmentation loss:

Lseg=λbceLBCE+λdiceLDice+λfocalLFocal\mathcal{L}_{\mathrm{seg}} = \lambda_{bce}\,\mathcal{L}_{BCE} + \lambda_{dice}\,\mathcal{L}_{Dice} + \lambda_{focal}\,\mathcal{L}_{Focal}

with (λbce,λdice,λfocal)=(1.0,0.7,1.0)(\lambda_{bce}, \lambda_{dice}, \lambda_{focal}) = (1.0, 0.7, 1.0).

Attention alignment between Px and Fx is regularized via: Lfx=1cos(FPx,FFx)\mathcal{L}_{fx} = 1 - \cos(F_{Px}, F_{Fx}) Composite objective: Ltotal=Lseg+ωfxLfx\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{seg}} + \omega_{fx} \mathcal{L}_{fx} where ωfx=0.1\omega_{fx}=0.1.

3. Attention-Based Anomaly Scoring in Vision-LLMs

PAS extends VAAS principles to LVLMs to detect object hallucinations ("PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision--LLMs" (Hoang-Xuan et al., 14 Nov 2025)). PAS quantifies over-dependence on preliminary decoded tokens (prelim) using attention weights:

For the kk'th generated token yky_k:

sprel(yk,y,x)=1Hh=1Hj=m+1k1A(,h)(k,j)s_{\mathrm{prel}}(y_k, \mathbf{y}, \mathbf{x}) = \frac{1}{H}\sum_{h=1}^H\sum_{j=m+1}^{k-1} \mathbf{A}^{(\ell,h)}(k,j)

where A(,h)\mathbf{A}^{(\ell, h)} are self-attention weights in layer \ell (=0\ell=0 preferred).

The PAS detector flags yky_k as hallucinated if sprel(yk,y,x)τs_{\mathrm{prel}}(y_k, \mathbf{y}, \mathbf{x}) \geq \tau, with empirically optimal τ[0.2,0.25]\tau \in [0.2, 0.25]. PAS achieves state-of-the-art AUROC performance, surpassing NLL, entropy-based, and global-local similarity baselines (Hoang-Xuan et al., 14 Nov 2025).

4. Evaluation Protocols and Empirical Results

VAAS was rigorously benchmarked on CASIA v2.0 (12,614 images) and DF2023 (100k images sampled from the full million-image corpus) (Bamigbade et al., 17 Dec 2025). Fx reference statistics are computed from validation splits of authentic samples.

Metrics:

Metric Functionality
Precision Manipulation detection
Recall Manipulation detection
F1-Score Binary detection (presence)
IoU Mask localization

Major results:

Dataset Precision Recall F1 IoU
CASIA v2.0 93.5% 94.8% 94.1% 89.0%
DF2023 95.9% 94.2% 94.9% 91.1%

Ablations revealed optimal trade-offs at α0.6\alpha \approx 0.6 (CASIA: α[0.4,0.6]\alpha\in[0.4,0.6]; DF2023: α0.7\alpha\approx0.7 for generative manipulations) and regularization weight ωfx=0.1\omega_{fx}=0.1. ViT-Base exhibited best efficiency-accuracy balance.

PAS was evaluated on MSCOCO and Pascal VOC object benchmarks using three LVLMs (LLaVA-7B, MiniGPT-4-7B, Shikra-7B) and attained average AUROC 85.0%, exceeding prior approaches (SVAR: 80.3%).

5. Visual Interpretability and Qualitative Diagnostics

VAAS generates interpretable heatmaps overlaying input images, revealing both the spatial footprint and intensity of manipulation. Outputs include:

  • Binary Px mask (local manipulation regions)
  • Px anomaly heatmap overlay (fine-grained)
  • Fx attention overlay (global context deviation)
  • Hybrid anomaly map (fused diagnostic)

These visualizations facilitate transparent forensic assessments, bridging raw detection accuracy with human-understandable evidence. High-anomaly cases demonstrate crisp mask boundaries and localized attention mass; mid-level anomalies show diffused edges but maintain semantically relevant global cues (Bamigbade et al., 17 Dec 2025).

PAS similarly visualizes anomalous attention flows in LVLMs, delineating object token generations that excessively depend on model-internal context rather than image tokens. Early attention layers signal raw grounding, with anomalous distributions corresponding tightly to hallucination events (Hoang-Xuan et al., 14 Nov 2025).

6. Implementation, Reproducibility, and Limitations

VAAS is implemented using PyTorch and HuggingFace Transformers, with open-source code provided for full reproducibility (Bamigbade et al., 17 Dec 2025). Key resources:

  • Environment .yml (dependency specifications)
  • Data download/preprocessing scripts (CASIA, DF2023)
  • Training/inference scripts for Px (with Fx guidance)
  • Jupyter notebooks for baseline and ablation analysis

PAS requires no extra forward passes and operates on precomputed attention tensors in LVLMs, affording training-free and reference-free deployment with minimal memory overhead (attention tensor: 18 GB VRAM for 7B models) (Hoang-Xuan et al., 14 Nov 2025).

Limitations include dependence on global self-attention structures (future sparse/local-window architectures may impact interpretability), focus on object-existence hallucinations (requires task-specific extensions for relational/attribute anomalies), and lexicon-based object token identification.

7. Extensions and Future Directions

A plausible implication is that VAAS can be generalized across multimodal tasks by cataloguing multi-channel attention metrics (over image, prelim, instruction, and special tokens), conducting per-head/layer diagnostics, and dynamically calibrating fusion weights per task or domain. MI-based approximations (conditional mutual information) offer additional distributional anomaly signals, and hybrid PAS/MI approaches could boost robustness.

The toolkit paradigm encapsulated by VAAS is extensible for real-time fine-grained monitoring in image integrity, hallucination detection, and other anomaly-prone multimodal reasoning settings. The core insight—"anomalous attention patterns reveal hallucinations"—suggests that modular attention diagnostics, fused across spatial, semantic, and linguistic modalities, may underpin future trustworthy, explainable detection systems (Bamigbade et al., 17 Dec 2025, Hoang-Xuan et al., 14 Nov 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Vision-Attention Anomaly Scoring (VAAS).