GradCAM++ for CNN Interpretability

Updated 30 March 2026

Grad-CAM++ is a gradient-based attribution method that generates spatially precise, class-discriminative heatmaps for enhanced CNN interpretability.
It refines the original Grad-CAM by incorporating higher-order derivatives and per-pixel weights to improve object localization and capture multiple instances.
Numerical analyses show Grad-CAM++ is equivalent to positive-gradient pooling, simplifying implementation without compromising explanation quality.

Grad-CAM++ is a gradient-based attribution method designed for the interpretability of deep convolutional neural networks (CNNs). As a refinement of the original Grad-CAM approach, Grad-CAM++ produces spatially precise, class-discriminative heatmaps that capture the importance of each spatial location in convolutional feature maps with respect to the decision for a target class. This approach is particularly known for enhancing the localization of entire objects and multiple instances in a single image, as compared to the earlier Grad-CAM formulation. Subsequent research has established that, in practical terms, Grad-CAM++ is numerically equivalent to a simpler variant of Grad-CAM based on positive gradients, which has significant implications for implementation efficiency and interpretability (Lerma et al., 2022).

1. Background and Motivation

Convolutional neural networks, while highly performant for vision tasks, are often criticized for their lack of interpretability. Grad-CAM (Gradient-weighted Class Activation Mapping) addressed this by using the gradient of a class score with respect to the convolutional feature maps to produce a heatmap indicating salient image regions for a given prediction. However, Grad-CAM has several limitations:

It often highlights only the most discriminative part instead of the entirety of an object.
It struggles to correctly localize multiple instances of the same class within an image.
The spatial resolution and coverage of Grad-CAM heatmaps are suboptimal due to uniform global-average pooling of gradients across spatial locations.

Grad-CAM++ was introduced to resolve these issues by generalizing the gradient aggregation scheme and incorporating higher-order derivative information to assign pixel-wise importance weights (Chattopadhyay et al., 2017).

2. Mathematical Formulation and Derivation

Let $A^k \in \mathbb{R}^{H\times W}$ be the $k$ -th feature map at a chosen convolutional layer, and $S^c$ the pre-softmax score for class $c$ . Grad-CAM++ introduces per-pixel weights $\alpha^{k,c}_{ij}$ in the linear combination of activation maps:

$L^c_{\text{Grad-CAM++}}(i,j) = \mathrm{ReLU}\left(\sum_k w^c_k A^k_{ij}\right),$

$w^c_k = \sum_{i,j} \alpha^{k,c}_{ij} \cdot \mathrm{ReLU}\left(\frac{\partial S^c}{\partial A^k_{ij}}\right).$

The pixel-wise weighting coefficients $\alpha^{k,c}_{ij}$ are derived by a Taylor-series-like expansion ensuring that the class score is accurately reconstructed from the activation maps and their gradients. For $Y^c = \exp(S^c)$ , the closed-form solution is:

$\alpha^{k,c}_{ij} = \frac{\left(\frac{\partial S^c}{\partial A^k_{ij}}\right)^2}{2\left(\frac{\partial S^c}{\partial A^k_{ij}}\right)^2 + \sum_{a,b} A^k_{ab} \left(\frac{\partial S^c}{\partial A^k_{ij}}\right)^3}$

Thus, to compute $L^c_{\text{Grad-CAM++}}$ , one must evaluate first, second, and third-order derivatives of the class score with respect to the activation maps at each spatial location (Chattopadhyay et al., 2017, Omeiza, 2019, Omeiza et al., 2019).

3. Comparison to Grad-CAM and Equivalence to Positive-Gradient Pooling

While Grad-CAM++ is motivated by a more expressive per-pixel weighting, empirical analysis reveals that the additional complexity does not yield materially different results from a simpler variant, often called Grad-CAM+ in open-source implementations. In Grad-CAM+, channel weights are computed as the spatial average of the positive part of the gradients, clamping negative responses:

$w^c_k = \frac{1}{H \cdot W} \sum_{i,j} \mathrm{ReLU}\left( \frac{\partial y^c}{\partial A^k_{ij}} \right).$

Numerically, the $\alpha^{k,c}_{ij}$ weights in Grad-CAM++ for nearly all nonzero gradients cluster around $0.5$ for practical CNNs and datasets, and the denominator in their derivation is almost always close to $2$. Thus,

$w^c_k \approx \frac{1}{2} \sum_{i,j} \mathrm{ReLU}\left(\frac{\partial y^c}{\partial A^k_{ij}}\right),$

The practical outcome is that Grad-CAM++ and Grad-CAM+ produce visually and quantitatively indistinguishable heatmaps. This numeric equivalence means that, for object localization and interpretability, no additional accuracy is gained by the higher-order computations of Grad-CAM++ (Lerma et al., 2022).

4. Algorithmic Implementation

The steps involved in Grad-CAM++ (and the numerically equivalent positive-gradient Grad-CAM variant) are as follows:

Forward Pass: Compute the class score $y^c$ using a forward pass through the CNN.
Backward Pass: Obtain the gradients $\frac{\partial y^c}{\partial A^k_{ij}}$ .
Clamping: Apply ReLU to gradients to retain only positive values.
Weight Computation: Compute channel weights $w^c_k$ via average pooling (for Grad-CAM+) or the sum of $\alpha^{k,c}_{ij}$ -weighted positive gradients (for Grad-CAM++).
Linear Combination: Aggregate the weighted feature maps.
Rectification: Apply ReLU to form the class activation heatmap.
Rescaling: Upsample and normalize the heatmap for visualization.

Importantly, for Grad-CAM++, adding per-pixel weights $\alpha^{k,c}_{ij}$ and higher-order derivatives is unnecessary for practical localization performance. First-order (gradient) computations with ReLU suffice. This results in dramatically simplified, numerically stable implementations and reduced code complexity, with identical heatmap fidelity (Lerma et al., 2022).

5. Extensions: Smooth Grad-CAM++

Smooth Grad-CAM++ combines Grad-CAM++ with SmoothGrad, which averages heatmaps computed over noisy image perturbations. The procedure involves generating $n$ noisy input copies via Gaussian perturbation $\epsilon \sim \mathcal{N}(0, \sigma^2)$ , computing Grad-CAM++ for each, and taking the mean heatmap:

$M^{\text{Smooth}}_c(x) = \frac{1}{n}\sum_{l=1}^n L^c_{\text{Grad-CAM++}}(x + \epsilon^{(l)})$

This process reduces gradient noise and enhances boundary sharpness, yielding cleaner and more spatially precise explanations. Smooth Grad-CAM++ is especially useful in medical imaging and bias auditing, as it provides more reliable localization in high-stakes contexts (Omeiza et al., 2019, Omeiza, 2019).

6. Empirical Performance and Applications

Grad-CAM++ outperforms baseline Grad-CAM in localizing multiple object instances and recovers more of the object area in single-instance images. Metrics reported include average drop percentage (proportionality between class confidence and heatmap overlap), percent increase in confidence, and intersection-over-union (IoU) for segmentation. On ImageNet, Grad-CAM++ yields an average drop of 36.84% (vs. 46.56% for Grad-CAM); on Pascal VOC multi-label classification, Grad-CAM++ achieves an average drop of 19.5% (vs. 28.5%) (Chattopadhyay et al., 2017).

Smooth Grad-CAM++ further improves map sharpness and multi-instance coverage, increasing IoU by 8–15% and offering better localization of salient pixels compared to standard Grad-CAM++ (Omeiza et al., 2019, Omeiza, 2019).

Applications span face-recognition bias diagnostics, medical scan analysis, self-driving scene understanding, image captioning, and as an auxiliary objective in teacher-student knowledge distillation.

7. Limitations and Contemporary Perspective

Grad-CAM++’s formal derivation assumes differentiability of the class score; its per-pixel weighting is theoretically motivated but collapses to trivial constants in practical scenarios. As demonstrated by Lerma & Lucas, this renders the method empirically indistinguishable from positive-gradient-based pooling, suggesting that complexity does not translate into improved explanation quality (Lerma et al., 2022). High computational cost, due to higher-order derivatives, is avoidable without loss in interpretability by adopting the positive gradient variant. Extensions to other network classes, dense prediction, or arbitrary target subsets remain areas of ongoing investigation (Chattopadhyay et al., 2017).

Key References:

"Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks" (Chattopadhyay et al., 2017)
"Grad-CAM++ is Equivalent to Grad-CAM With Positive Gradients" (Lerma et al., 2022)
"A Step Towards Exposing Bias in Trained Deep Convolutional Neural Network Models" (Omeiza, 2019)
"Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models" (Omeiza et al., 2019)