Grad-CAM Entropy Loss

Updated 8 January 2026

The paper presents an auxiliary loss leveraging the Shannon entropy of normalized Grad-CAM maps to promote concentrated and interpretable activation regions.
It integrates the entropy term with standard cross-entropy loss using a hyperparameter to control the trade-off between classification accuracy and spatial localization.
Empirical results demonstrate improved Grad-CAM sharpness and contrast on CNN architectures with only a minor accuracy reduction, all without modifying the network structure.

A Grad-CAM-based Dice loss does not appear in the referenced literature. Instead, the relevant approach incorporates an auxiliary loss term based on the Shannon entropy of normalized Grad-CAM maps, aimed at enhancing the interpretability and localization properties of convolutional neural network (CNN) classifiers’ gradient-weighted class activation maps (Grad-CAMs). This entropy-regularized loss combines with the standard classification objective, does not require architectural modifications, and leverages second-order derivatives for optimization. The resulting method enables explicit control over the trade-off between classification accuracy and spatial concentration of explanation maps, facilitating deployment on both deep and embedded architectures without additional network layers (Schöttl, 2020).

1. Formulation of the Grad-CAM Entropy Loss

The technique introduces an auxiliary loss term defined as the spatial Shannon entropy of a normalized Grad-CAM map for the ground-truth class. Let $L_{GC}^{(c)}$ denote the unnormalized Grad-CAM activation map for class $c$ :

$\widetilde M_{ij} = \frac{L_{GC,ij}^{(c)}}{\sum_{k,\ell} L_{GC,k\ell}^{(c)}}$

$c_e\bigl(L_{GC}^{(c)}\bigr) = -\sum_{i,j} \widetilde M_{ij}\log\!\left(\widetilde M_{ij}\right)$

Here, $\widetilde M_{ij}$ is a spatial probability distribution over map locations, ensuring $\sum_{i,j}\widetilde M_{ij}=1$ . The term $c_e$ acts as an “interpretability regularizer”, penalizing distributed (high-entropy) attention maps and favoring concentrated regions of activation.

2. Integration With Classification Objective

The total loss function combines standard cross-entropy on softmax outputs with the entropy term scaled by a new non-negative hyperparameter $\beta$ :

$\mathcal{L} = \mathcal{L}_{CE}\bigl(y_{\rm true}, y_{\rm pred}\bigr) + \beta\,c_e\left(L_{GC}^{(c)}\right)$

where $\mathcal{L}_{CE}(y_{\rm true},y_{\rm pred}) = -\sum_{m=1}^C y_{\rm true}^m \log y_{\rm pred}^m$ . The hyperparameter $\beta$ governs the trade-off, with larger values enforcing greater localization in Grad-CAM maps at potential expense to classification accuracy. For $\beta=0$ , the scheme reverts to conventional training.

3. Grad-CAM Map Computation and Second-Order Gradients

During training, the raw Grad-CAM map for class $c$ is constructed as follows. Let $A^k \in \mathbb{R}^{H\times W}$ denote the $k$ -th feature map from the final convolutional layer, and $y^c$ the pre-softmax logit for class $c$ :

$\alpha_k = \frac{1}{Z}\sum_{i,j}\frac{\partial y^c}{\partial A^k_{ij}}, \quad L_{GC,ij}^{(c)} = \mathrm{ReLU}\left(\sum_k \alpha_k A^k_{ij}\right)$

where $Z = H \cdot W$ .

Since the loss involves entropy of the normalized Grad-CAM map, and $L_{GC}^{(c)}$ in turn depends on gradients of the logit with respect to feature maps, the backward pass computes derivatives of derivatives (second-order gradients) with respect to the model parameters. In modern frameworks supporting higher-order differentiation, this is automatically handled, at the cost of increased computational overhead.

4. Definitions and Symbol Table

Symbol/Term	Meaning
$x$	Input image, $x\in\mathbb{R}^{H_{\rm in}\times W_{\rm in}\times C_{\rm in}}$
$A^k$	$k$ -th feature map in last conv layer, $\in\mathbb{R}^{H\times W}$
$y^c$	Logit (pre-softmax score) for class $c$
$y_{\rm pred}$	Prediction vector, $\mathrm{softmax}(y)$
$y_{\rm true}$	One-hot ground truth
$\mathcal{L}_{CE}$	Cross-entropy loss
$\alpha_k$	Grad-CAM scalar, global-average gradient over feature map $A^k$
$L_{GC,ij}^{(c)}$	Grad-CAM activation map (raw, unnormalized) for class $c$
$\widetilde M_{ij}$	Normalized Grad-CAM activation map
$c_e(L_{GC}^{(c)})$	Entropy of normalized Grad-CAM map
$\beta$	Weight for Grad-CAM entropy term

5. Training Procedure and Workflow

A condensed outline of the training loop integrating the Grad-CAM entropy term:

Initialize network parameters $\theta$ ; select $\beta \ge 0$ .
For each epoch and mini-batch:
- Perform forward pass to compute feature maps $A^k$ , logits $y$ , and predictions.
- Compute standard cross-entropy loss.
- Retain gradients of logits with respect to feature maps.
- Calculate Grad-CAM weights and construct the activation map.
- Normalize $L_{GC,ij}$ to obtain $\widetilde M_{ij}$ .
- Evaluate the entropy $c_e$ .
- Form the combined loss $\mathcal{L}_{CE} + \beta c_e$ .
- Backpropagate (including required second-order gradients) and update parameters.

The key computational cost arises in the backward pass, where second-order derivatives are necessary due to the dependency of $c_e$ on $L_{GC}^{(c)}$ , which itself is a function of gradients.

6. Empirical Findings on Interpretability and Classification

Experiments conducted on a ResNet-50 backbone with additional dense layers, using PASCAL VOC 2012 object crops, illustrate the quantitative effect of the entropy penalty:

With $\beta=0$ (no penalty), test accuracy stabilized at ≈0.94 with relatively high-entropy, diffuse Grad-CAM maps.
For $\beta=100$ , test accuracy decreased marginally to ≈0.92; however, CAM entropy dropped from ≈0.99 to ≈0.92, “ellipsoidal area” ca contracted by ≈20%, and “dispersion” cd increased six-fold, indicating more decisive, sharply focused attention maps.
In a class “dog” example, Grad-CAM contrast increased by 114%, highlighting significantly improved localization on relevant regions.
A trade-off was observed in training speed: higher $\beta$ values approximately doubled backward pass duration due to second-order gradient tracing.

7. Applicability and Limitations

The approach can be incorporated into any standard CNN classifier without structural modification and generalizes across deep and embedded platforms. The hyperparameter $\beta$ permits continuous adjustment between interpretability and prediction performance, with only minor degradation in classification accuracy for substantial gains in map localization and dispersion. There is no introduction of a Dice-style overlap loss; instead, the method targets reduction of spatial entropy in the Grad-CAM explanation itself. This suggests that entropy penalization is a lightweight and effective means to encourage more interpretable, user-tractable explanations from standard CNN-based image classifiers (Schöttl, 2020).

Markdown Report Issue Upgrade to Chat

References (1)

A light-weight method to foster the (Grad)CAM interpretability and explainability of classification networks (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grad-CAM-Based Dice Loss.

Grad-CAM Entropy Loss

1. Formulation of the Grad-CAM Entropy Loss

2. Integration With Classification Objective

3. Grad-CAM Map Computation and Second-Order Gradients

4. Definitions and Symbol Table

5. Training Procedure and Workflow

6. Empirical Findings on Interpretability and Classification

7. Applicability and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Grad-CAM Entropy Loss

1. Formulation of the Grad-CAM Entropy Loss

2. Integration With Classification Objective

3. Grad-CAM Map Computation and Second-Order Gradients

4. Definitions and Symbol Table

5. Training Procedure and Workflow

6. Empirical Findings on Interpretability and Classification

7. Applicability and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research