Papers
Topics
Authors
Recent
Search
2000 character limit reached

X-GRAAD Anomaly Scoring Explained

Updated 27 April 2026
  • X-GRAAD anomaly scoring is an explainable method that uses gradient and attention signals to detect deviations from a normal background.
  • It employs pixel-wise ELBO gradient analysis in VAEs and token-level attention-gradient measures in transformer models to localize anomalies.
  • Experimental results show robust performance with high ROC-AUC for tumor localization and near-zero attack success in NLP backdoor detection.

X-GRAAD (eXplaining Gradient and Attention Anomaly Detection) anomaly scoring encompasses a family of explainable, gradient-based and attention-based scoring mechanisms for detecting data or behavior deviating from an assumed normal background distribution. Originally developed to provide pixel-wise anomaly localization in generative models for medical images and later extended to explainable backdoor trigger detection in neural LLMs, X-GRAAD approaches exploit the concentration of gradients and, where relevant, attention weights caused by anomalous or adversarial inputs. This turns model internals into discriminative signals for flagging out-of-distribution phenomena and malicious artifacts (Das et al., 5 Oct 2025, Zimmerer et al., 2019).

1. Conceptual Foundations and Motivation

X-GRAAD anomaly scoring is rooted in the observation that, for both generative and discriminative neural models, the presence of anomalies can profoundly alter internal model sensitivities with respect to input components. In autoencoding VAEs, regions of input not supported by the training data induce high gradients of the model log-likelihood w.r.t. input pixels. In transformer-based LLMs with backdoor triggers, input tokens that have been co-opted by an attacker cause attention heads and output logits to become highly sensitive to the trigger token, exhibiting both “attention drift” and “gradient dominance” (Das et al., 5 Oct 2025, Zimmerer et al., 2019).

The core objective is to provide a per-component (pixel or token) anomaly score that accurately localizes abnormal structure and a robust sequence- or image-level score for detection and downstream filtering. A central property of X-GRAAD is explainability: the highest scoring component (pixel or token) is interpretable as the likely anomaly source.

2. Mathematical Formulations

2.1. VAE Gradient-Based Anomaly Scoring (Imaging)

Let xRDx \in \mathbb{R}^D be an input (image, typically), zRLz \in \mathbb{R}^L a latent code. The VAE maximizes the evidence lower bound:

logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))

The X-GRAAD scoring function is the per-pixel (or per-dimension) norm of the ELBO gradient:

Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|

Approximated via backpropagation through the standard VAE objective and, typically, “SmoothGrad” (Gaussian input noise plus averaging) to suppress artifacts (Zimmerer et al., 2019).

2.2. Attention-Gradient Scoring for Backdoored PLMs

For an input x=(t1,...,tn)x = (t_1, ..., t_n) to a transformer with LL layers and HH heads, let AlihjRn×nA_{l_i}^{h_j} \in \mathbb{R}^{n \times n} denote the softmax attention matrix for head hjh_j in layer lil_i. The attention importance for token zRLz \in \mathbb{R}^L0:

zRLz \in \mathbb{R}^L1

Zero-mean normalization yields:

zRLz \in \mathbb{R}^L2

Gradient importance is defined as the L2 norm of the output logit gradient w.r.t. token embedding zRLz \in \mathbb{R}^L3:

zRLz \in \mathbb{R}^L4

The combined per-token anomaly score (with sentence-level score as the tokenwise maximum):

zRLz \in \mathbb{R}^L5

3. Algorithms and Inference-Time Procedures

VAE Anomaly Localization and SmoothGrad

Given learned VAE parameters zRLz \in \mathbb{R}^L6, the following routine yields a pixelwise anomaly map:

  1. Normalize and (optionally) resize each test input zRLz \in \mathbb{R}^L7.
  2. For zRLz \in \mathbb{R}^L8 (e.g., zRLz \in \mathbb{R}^L9–logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))0):
    • Add small Gaussian noise: logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))1.
    • Forward: obtain encoder posterior, sample logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))2.
    • Compute ELBO logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))3.
    • Backpropagate: logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))4.
  3. Aggregate: logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))5.
  4. Score: logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))6 (Zimmerer et al., 2019).

Transformer Backdoor Defense Workflow

  1. Precompute anomaly scores logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))7 over a clean validation set for threshold estimation.
  2. Set detection threshold logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))8 as the logpθ(x)L(x;θ,ϕ)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\log p_\theta(x) \geq \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\mathrm{KL}}(q_\phi(z|x)\|p(z))9-th percentile of Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|0 (Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|1 for BERT-class models; Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|2 for ALBERT).
  3. For new input Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|3:
    • Compute Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|4.
    • If Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|5: proceed with standard prediction.
    • Otherwise: locate Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|6; corrupt the flagged token by random character insertion/replacement and re-evaluate prediction.
  4. Output the (possibly sanitized) prediction (Das et al., 5 Oct 2025).

4. Experimental Results and Benchmarks

Imaging (BraTS-2017 Tumor Localization)

Method Pixelwise ROC-AUC
Denoising AE (recon error) Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|7
VAE recon error Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|8
Smoothed recon error Si(x)=[xL(x;θ,ϕ)]iS_i(x) = \left| \left[ \nabla_x \mathcal{L}(x; \theta, \phi) \right]_i \right|9
VAE sampling variance x=(t1,...,tn)x = (t_1, ..., t_n)0
Grad (recon term only) x=(t1,...,tn)x = (t_1, ..., t_n)1
Grad (KL term only) x=(t1,...,tn)x = (t_1, ..., t_n)2
Full ELBO grad (X-GRAAD) x=(t1,...,tn)x = (t_1, ..., t_n)3

X-GRAAD (full ELBO gradient) matches or outperforms all prior unsupervised methods for unsupervised tumor localization in MRI (Zimmerer et al., 2019).

Backdoor Detection in PLMs

On SST-2, IMDb, AG's News (datasets) and several backdoor attacks (BadNets, RIPPLES, LWS) for BERT, RoBERTa, DistilBERT, ALBERT:

  • X-GRAAD reduces attack success rate (ASR) from nearly x=(t1,...,tn)x = (t_1, ..., t_n)4 to x=(t1,...,tn)x = (t_1, ..., t_n)5–x=(t1,...,tn)x = (t_1, ..., t_n)6 in most settings (compared to x=(t1,...,tn)x = (t_1, ..., t_n)7 for prior methods).
  • Maintains clean accuracy within x=(t1,...,tn)x = (t_1, ..., t_n)8–x=(t1,...,tn)x = (t_1, ..., t_n)9 of undefended performance.
  • Ablations: Only the combined attention-gradient score achieves ASR LL0 while preserving clean accuracy (Das et al., 5 Oct 2025).

Example (BERT + SST-2 + BadNets):

Method ASR CACC
ONION 0.142
RAP 0.002
FT 1.0
MEFT 0.998
PURE 0.292
X-GRAAD 0.0 0.923
Undefended 0.931

Computational cost: X-GRAAD inference-time scoring on SST-2 requires ≈44–50s and is LL1 faster than head-pruning methods; no retraining is required (Das et al., 5 Oct 2025).

5. Interpretability, Explainability, and Visualization

A core property of X-GRAAD scores is their attributional clarity:

  • In VAEs, per-pixel score maps highlight only out-of-distribution regions (e.g., tumor voxels in brain MRI), with intensity proportional to the magnitude of gradient-based anomaly.
  • In transformer models, heatmaps over tokens show only rare trigger tokens (“cf”, “mn”, “tq”) as prominently anomalous: the maximally scoring token is flagged for targeted corruption/remediation.
  • The separation between LL2, LL3, and the combined LL4 on clean vs. poisoned data is visualized via histograms; only LL5 provides strong class separation (Das et al., 5 Oct 2025).
  • The decomposition of LL6 into attention and gradient factors enables fine-grained analysis of whether anomalies stem from syntactic/model-level (attention) or output-sensitivity (gradient) disruptions.

6. Relations to General Anomaly Scoring and Mitigation

X-GRAAD fits within a broader taxonomy of anomaly scoring, which includes statistical (z-score, p-value, meta-rarity), distance-based (Euclidean, Mahalanobis, kNN), density-based (Local Outlier Factor, mass-volume), and reconstruction-based methods (Zohrevand et al., 2019). X-GRAAD is a member of gradient-based and (for text) attention-enhanced reconstruction/density approximation approaches.

Threshold-setting and filtering strategies, such as dynamic/percentile thresholds, ROC/PR curve optimization, and tail modelling (e.g., extreme-value theory), are essential for robust deployment. The percentile threshold in X-GRAAD is empirically tuned by backbone and can be adapted in streaming or changing environments via monitoring of validation-set scores (Zohrevand et al., 2019, Das et al., 5 Oct 2025).

A plausible implication is that X-GRAAD anomaly scores could be integrated with ensemble or hybrid methods—combining, for example, density-based and reconstruction-based scores for further robustness, as recommended in general anomaly detection systems (Zohrevand et al., 2019).

7. Limitations and Prospects for Extension

Limitations of current X-GRAAD methods include:

  • Pixel-wise scores can exhibit noise or checkerboarding (imaging) if model or data are not well calibrated; smoothing methods such as “SmoothGrad” are necessary but may introduce blurring (Zimmerer et al., 2019).
  • For “far-out” outliers (inputs for which the generative model assigns near-zero density), scores may collapse or become meaningless.
  • Slice-wise operation in imaging ignores 3D context; extension to volumetric or spatiotemporal modeling is a proposed direction.
  • In NLP, thresholds must be empirically tuned by architecture, and a wrong setting may affect recall or precision.
  • The method relies on model gradients and access to internal attention weights, precluding use with closed-box or inflexible models.

Future work includes more expressive base density models (e.g., Glow, PixelCNN++), joint score aggregation across model layers, learned regularization for anomaly map smoothing, and cross-domain score fusion (Zimmerer et al., 2019). For detection in large-scale or streaming settings, adaptive, persistence-based, and ensemble-based filtering procedures are promising avenues to further reduce false alarms and support high-throughput operation (Zohrevand et al., 2019).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to X-GRAAD Anomaly Scoring.