GMAR: Gradient-Driven Multi-Head Attention Rollout
- GMAR is an interpretability technique for Vision Transformers that uses gradient-based scoring to weight and aggregate attention heads.
- It enhances classical attention rollout by integrating per-head gradient sensitivity, yielding sharper and more focused saliency maps.
- Empirical evaluations show GMAR achieves superior performance compared to Grad-CAM and standard rollout methods on ViT models.
Gradient-Driven Multi-Head Attention Rollout (GMAR) is an interpretability technique designed for Vision Transformers (ViTs) that leverages gradient-based head importance scoring to construct weighted attention rollouts. The method addresses the key challenge of multi-head attention interpretability by quantifying the relative contribution of each head, thus generating sharper and more object-centric saliency maps than prior attention-based approaches. GMAR operates by integrating per-head gradient sensitivity into the attention propagation process, enhancing transparency in ViT predictions while maintaining compatibility with standard transformer architectures.
1. Vision Transformer Architecture and Classical Attention Rollout
The Vision Transformer (ViT) architecture processes images by partitioning the input into fixed-size patches, each patch being linearly projected into an embedding. A learnable class token is inserted, positional embeddings are added, and the entire sequence traverses stacked transformer encoder layers. Each layer contains a multi-head self-attention (MHSA) block with parallel heads. Each head computes an attention matrix:
where is the number of tokens (patches plus the class token) and is the head dimension. Token outputs from all heads are concatenated, projected, and the class token embedding at the output is used for classification.
Classical attention rollout, originating with Abnar & Zuidema (2020), generates saliency maps by recursively multiplying “residualized” attention matrices across layers:
The first row of corresponds to the class token’s aggregated attention over patches and is interpreted as a saliency map. A limitation of this approach is the equal treatment of all attention heads, despite empirical evidence that not all heads are equally informative for a given prediction.
2. Gradient-Based Head Importance Quantification
GMAR introduces a principled mechanism for quantifying per-head importance through gradients with respect to the logit for the target class. For each head , the sensitivity of the logit to the attention matrix is computed:
Aggregating these partial derivatives yields the head’s raw importance. Two norm-based regularizers are used:
- L1 norm:
- L2 norm:
The normalized head-wise weights are then:
In a single layer, the weighted aggregate attention is the convex combination:
Multi-layer GMAR generalizes this weighted aggregation using the classical rollout procedure, propagating head-weighted attention through all layers.
3. The GMAR Algorithm
GMAR comprises the following steps:
- Forward Pass: The image is processed by the ViT, yielding logits and per-layer attention matrices . The maximum class logit is selected.
- Backward Pass: Backpropagation computes gradients : for each layer, split into head-specific gradients .
- Head Weight Calculation: For each , aggregate with the selected norm to obtain , normalize to compute .
- Weighted Attention Rollout: For each layer, compute the weighted sum of head attentions using , add the residual connection, and propagate via matrix multiplication. The output saliency map is the first row of the final aggregated rollout.
Pseudocode Outline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
logits, {A_ℓ} = model(image)
target_logit = logits.max()
zero_grad(model)
target_logit.backward()
for l in range(L):
for h in range(H):
G_lh = grad_of(logit, A_ℓ[h]) # shape (N,N)
if L1:
g_lh = abs(G_lh).sum()
else:
g_lh = (G_lh**2).sum().sqrt()
α_l = g_l / g_l.sum()
A_rollout = eye(N)
for l in range(L):
A_weighted = weighted_sum_heads(A_ℓ, α_l)
 = A_weighted + α_res * eye(N)
A_rollout = A_rollout @ Â
saliency = A_rollout[0] # class token row |
4. Empirical Evaluation and Comparative Analysis
The main evaluation uses ViT-Large-Patch16-224 fine-tuned on Tiny-ImageNet (200 classes; images upsampled to ). The optimizer is Adam (, , ), learning rate , batch size 32, 100 epochs, with standard augmentations.
GMAR is compared against Grad-CAM (applied to ViT embeddings) and classical attention rollout, using metrics suited to saliency explanations for image classification:
- Average Drop: Mean percent drop in target probability when only salient regions are shown ( better)
- Average Increase: Percent of cases with increased probability on salient-only regions ( better)
- Insertion/Deletion AUCs: Aggregate effect of gradually inserting/removing pixels according to the saliency map (/ , respectively, better)
| Method | Avg Drop ↓ | Avg Inc ↑ | Insertion ↑ | Deletion ↓ |
|---|---|---|---|---|
| Grad-CAM | 22.61 | 65.80 | 10.75 | 10.09 |
| Attention Rollout | 25.78 | 46.20 | 11.97 | 12.17 |
| GMAR (L1 norm) | 23.93 | 56.10 | 12.15 | 10.62 |
| GMAR (L2 norm) | 22.13 | 55.90 | 12.16 | 10.64 |
GMAR demonstrates consistent improvement over classical rollout in all metrics and performs comparably or better than Grad-CAM except for Average Increase, where Grad-CAM retains an advantage in highlighting the most discriminative patch.
5. Qualitative Visualizations and Interpretability Insights
Saliency maps produced by GMAR (Figure 1 in the original paper) show tightly focused attention on true object regions, contrasting with Grad-CAM’s tendency to highlight only the most discriminative patch and attention rollout’s tendency to diffuse over background regions. Head importance visualizations (Figure 2) indicate that only a small subset of heads have dominant influence in each layer. Difference maps (Figure 3) between unweighted and GMAR-weighted attention highlight that crucial regions are selectively amplified when gradients are considered, supporting the view that GMAR yields clearer head-level interpretability.
6. Ablation Studies and Parameter Sensitivity
L1 versus L2 normalization for gradient aggregation exhibits minimal sensitivity; L1 yields marginally sharper, sparser saliency maps (higher Average Increase), while L2 slightly reduces Average Drop but both norms are robust across metrics. Adjustment of the residual weight (tested at 0.5, 1.0, 1.5) identifies as optimal, emphasizing that standard residual inclusion best supports information flow in GMAR.
7. Limitations and Prospective Extensions
Current experiments are limited to Tiny-ImageNet and a single ViT variant. Generalization to other datasets and architectures, such as hierarchical ViTs (e.g., Swin, DeiT), remains to be established. GMAR produces patch-level saliency; pixel-level explanations may require interpolation or integration with hybrid schemes. The computational cost, dominated by a single backward pass for gradient computation per input, matches that of Grad-CAM.
Proposed future directions include extending GMAR to hierarchical and sparsified ViTs and applying the technique in other vision tasks (e.g., object detection, segmentation) via cross-attention map aggregation. The integration of semantic extractors and the exploitation of GMAR scores for dynamic head pruning are also suggested for model compression and further interpretability.
In summary, Gradient-Driven Multi-Head Attention Rollout (GMAR) formalizes attention head importance via class-specific gradients, integrates these normalized weights into the attention aggregation and propagation process, and empirically delivers enhanced interpretability over prior methods in Vision Transformers. This approach grounds the analysis of model decision-making at the level of individual attention heads and establishes a framework for subsequent advances in transformer interpretability.