GLA-Grad Model: Explainability & Vocoding

Updated 1 December 2025

GLA-Grad is a dual-purpose model offering global adversarial explainability for CNNs and phase-aware improvements for diffusion-based speech synthesis.
It quantifies adversarial-induced shifts in model focus using formal metrics (NISSIM, MOD, VID) and compares robustness across CNN architectures under FGSM attacks.
In speech synthesis, integrating the Griffin-Lim algorithm with WaveGrad leads to enhanced objective metrics (PESQ, STOI) and improved out-of-domain performance.

GLA-Grad refers to two distinct frameworks in the literature: one targeting global adversarial explainability in convolutional networks ("GLA-Grad" as Global Localized Adversarial Grad-CAM), and one targeting phase-aware improvements in diffusion-based speech synthesis ("GLA-Grad" as a Griffin-Lim–integrated vocoder method). Both frameworks are significant within their domains. This article prioritizes precision and strict factuality, following the source data.

1. Definition and Scope

GLA-Grad, in the adversarial explainability context, is a diagnostic extension of Grad-CAM that quantifies how a CNN’s class-discriminative focus shifts under adversarial perturbations using formal metrics such as NISSIM, MOD, and VID. In the generative audio synthesis context, GLA-Grad augments the WaveGrad diffusion vocoder by incorporating the Griffin-Lim algorithm (GLA) for spectral phase correction during inference, aiming to reduce conditioning mismatch and improve robustness to out-of-domain scenarios (Chakraborty et al., 2022, Liu et al., 9 Feb 2024).

2. Adversarial Explainability: GLA-Grad with Grad-CAM

GLA-Grad, or Global Localized Adversarial Grad-CAM, extends Grad-CAM from per-instance visual explanations to a population-level diagnostic. In standard Grad-CAM, heatmaps reveal regions most responsible for a single prediction but do not generalize to characterize a model’s global behavior. GLA-Grad bridges this gap by:

Training/fine-tuning multiple CNN architectures (VGG16, ResNet50, ResNet101, InceptionNetV3, XceptionNet) on facial identity data (50 identities from VGGFace2; 80/10/10 train/validation/test split).
Generating adversarial samples via the Fast Gradient Sign Method (FGSM) with varying ε magnitudes.
Computing clean and adversarial Grad-CAM heatmaps for each test image and model.
Quantifying the heatmap shifts using Normalized Inverted Structural Similarity (NISSIM), Mean Observed Dissimilarity (MOD), and Variation in Dissimilarity (VID).

This structure enables empirical assessment of how model attention disperses or shifts under adversarial stress (Chakraborty et al., 2022).

3. Formal Metrics and Computation

Three canonical metrics enable global characterization:

NISSIM: For image $i$ $i$ , $NISSIM_i = (1 - SSIM(H_{clean,i}, H_{adv,i}))/2$ $N I SS I M_{i} = (1 - SS I M (H_{c l e an, i}, H_{a d v, i})) /2$ , where $SSIM$ $SS I M$ is the Structural Similarity Index.
- $NISSIM_i \in [0,1]$ , with 0 indicating identical heatmaps.
MOD: For perturbation ε, $MOD(\epsilon) = \frac{1}{N} \sum_{i=1}^{N} NISSIM_{i}(\epsilon)$ .
- Aggregates population-wide shift and is monotonically increasing in ε across models.
VID: Quantifies the consistency of MOD across K perturbation levels.
- $\mu = \frac{1}{K} \sum_{k=1}^{K} MOD(\epsilon_k)$ ,
- $VID = \sqrt{\frac{1}{K}\sum_{k=1}^K (MOD(\epsilon_k) - \mu)^2}$ .

Metric computation pipeline:

Extract clean and adversarial Grad-CAM heatmaps ( $H_{clean}$ , $H_{adv}$ ) for each image and ε.
Min-max normalize heatmaps; compute $SSIM$ (Gaussian window $11 \times 11$ , $\sigma=1.5$ ).
Calculate $NISSIM_{i}(\epsilon)$ per image; aggregate for $MOD(\epsilon)$ and $VID$ (Chakraborty et al., 2022).

4. Experimental Design and Empirical Results

CNN Architectures and Dataset

Models: VGG16, ResNet50, ResNet101 (deep); InceptionNetV3, XceptionNet (wide).
Dataset: 50 identities from VGGFace2, 80/10/10 train/validation/test, ~5,000 test images.
Training: Only top classifier layers fine-tuned; Adam optimizer, learning rate $1\times10^{-4}$ , 60 epochs.

FGSM Attacks and Accuracy

FGSM with $\epsilon \in \{0.00, 0.01, 0.05, 0.075, 0.10\}$ .

Model	ε=0.00	ε=0.01	ε=0.05	ε=0.075	ε=0.10
VGG16	91.9%	91.6%	90.9%	89.9%	89.9%
ResNet50	90.5%	91.9%	88.5%	78.0%	70.6%
ResNet101	90.9%	92.2%	89.2%	75.3%	61.8%
InceptionV3	88.5%	88.5%	57.1%	35.1%	25.3%
Xception	92.6%	92.6%	81.4%	69.6%	56.4%

MOD increases monotonically with $\epsilon$ , signifying growing divergence in discriminative attention.
Deep architectures show consistently lower MOD than wide networks at the same ε.
VID is minimized for deep models (e.g., VGG16), indicating stable attention shifts; wider models yield broader, less consistent focus drift.

Qualitative examination of heatmaps confirms that, as $\epsilon$ increases, activation regions often drift from facial features to spurious background zones, more so in wide models (Chakraborty et al., 2022).

5. Speech Synthesis: GLA-Grad in Diffusion-Based Vocoding

GLA-Grad, in the context of speech synthesis, extends the WaveGrad diffusion vocoder by interleaving Griffin-Lim phase recovery iterations into early reverse diffusion steps.

WaveGrad: Conditional noise-to-speech diffusion process, guided by a mel-spectrogram $\tilde X$ .
GLA-Grad algorithm: At each of the first M reverse diffusion steps, a short GLA process projects the waveform onto the set of signals with magnitude spectrogram matching the pseudo-inverted mel conditioning. The Griffin-Lim algorithm performs alternating projections between STFT-consistency and magnitude constraints.

Pseudocode (abbreviated):

C = STFT(y_{n-1}^WG)
for k in range(K):
    C = P_consistent(P_|·|=Ŝ(C))
y_{n-1} = iSTFT(C)

where

Ŝ

is the magnitude recovered from the mel filterbank pseudo-inverse (Liu et al., 9 Feb 2024).

6. Empirical Findings and Performance Analysis

Explainability (CNN context):

MOD and VID provide architecture-agnostic, global explainability metrics for robustness.
Lower MOD/VID corresponds to more robust and stable attention under adversarial manipulation.
Useful for model selection, comparative diagnostics, adversarial defense initialization.

Speech Synthesis (Diffusion context):

GLA-Grad shows improved objective metrics (PESQ, STOI, WARP-Q) over baseline WaveGrad when evaluated on out-of-distribution datasets (e.g., VCTK), notably reducing spectro-temporal conditioning mismatch.
No retraining: GLA-Grad operates entirely post hoc, compatible with any pretrained WaveGrad-type model.
Inference latency increases modestly (e.g., 39× real-time for GLA-Grad vs. 54× for WaveGrad-6).

System	PESQ (LJ→VCTK)	STOI (LJ→VCTK)	Inference Speed (×RT)
WaveGrad-6	2.08	0.87	54
GLA-Grad	2.73	0.94	39
Griffin-Lim	—	—	26

GLA-Grad’s projections toward the conditioning magnitude are particularly effective for unseen target speakers (Liu et al., 9 Feb 2024).

7. Implementation and Reproducibility

All source code is implemented in PyTorch 1.8 (for adversarial explainability) and standard WaveGrad codebases (for speech synthesis), leveraging Jupyter notebooks and fixed random seeds for reproducibility. Adversarial explainability experiments require precise data splits (random seed 42), min-max normalization, and consistent preprocessing. Speech synthesis experiments require accurate inverse mel processing, GLA iterations, and careful mapping of noise schedules (Chakraborty et al., 2022, Liu et al., 9 Feb 2024).

8. Limitations and Significance

GLA-Grad for explainability does not attribute causality beyond measured focus shift, and its metrics are limited by the fidelity of Grad-CAM itself.
In the vocoder context, Griffin-Lim is not the optimal phase estimator; the method does not alleviate all forms of conditioning error, and parameters (e.g., number of GLA corrections) are manually selected.
Both lines enable new diagnostic and generative capabilities: global adversarial explainability for CNNs, and out-of-domain robust speech synthesis for diffusion vocoders (Chakraborty et al., 2022, Liu et al., 9 Feb 2024).

GLA-Grad thus represents two influential approaches that operationalize model introspection and robustness, each rooted in the integration of explainability or signal-theoretic constraints into deep learning pipelines.