Class Activation Maps (CAMs) Overview

Updated 31 May 2026

Class Activation Maps (CAMs) are post hoc interpretability methods that generate spatial heatmaps indicating how CNN features contribute to specific class predictions.
Methodological extensions, such as Grad-CAM, Score-CAM, and ensemble fusion approaches, improve localization, noise resistance, and overall model faithfulness.
CAMs are evaluated using metrics like IoU, AUC, and noise-robustness, and are applied in diverse domains including clinical, industrial, and security settings.

Class Activation Maps (CAMs) are a class of post hoc interpretability techniques that produce spatial heatmaps reflecting the degree to which different regions of an input—typically an image—contribute to a convolutional neural network’s (CNN) prediction of a specific class. These methods are foundational in explainable artificial intelligence (XAI) for deep vision models, having evolved from the original linear-projection-based formulations to a diverse family of gradient-based, perturbation-based, ensemble, theoretical, and hybrid approaches that emphasize precise localization, robustness, faithfulness, and interpretability. CAMs are applied across scientific, clinical, industrial, and security domains, with ongoing developments targeting challenges in spatial resolution, faithfulness, noise resistance, and semantic understanding.

1. Core Principles and Mathematical Frameworks

Class Activation Mapping centers on the construction of a class-specific heatmap using a linear or nonlinear combination of deep convolutional features. In its canonical form, given feature maps $A^k \in \mathbb{R}^{H \times W}$ and class index $c$ , CAM produces

$L^c_{\mathrm{CAM}}(x) = \sum_{k=1}^K w^c_k A^k(x)$

where $w^c_k$ are class-specific weights. In the original approach, these are the weights in the fully-connected classification layer after global average pooling (GAP), restricting applicability to architectures with explicit GAP (Minh, 2023). Grad-CAM generalizes CAM by extracting $w^c_k$ as spatial averages of the gradient $\partial y^c / \partial A^k(i,j)$ , affording compatibility with arbitrary CNNs:

$w^c_k = \frac{1}{H W} \sum_{i,j} \frac{\partial y^c}{\partial A^k(i,j)}$

followed by the heatmap

$L^c_{\mathrm{Grad-CAM}}(x) = \operatorname{ReLU} \left( \sum_k w^c_k A^k(x) \right)$

(Uysal et al., 2023, Minh, 2023).

Variants such as Grad-CAM++ introduce higher-order weighting for improved localization of multiple object instances (Minh, 2023, Sarkar et al., 25 Aug 2025). Score-CAM and Ablation-CAM compute $w^c_k$ using channel-wise masking and score differentials, eschewing gradients. All methods ultimately perform a channel-weighted spatial sum, yielding a low-resolution class evidence map.

2. Methodological Extensions and Unification

The CAM family now encompasses a spectrum of methodological innovations:

Gradient-Based Approaches: Grad-CAM and Grad-CAM++ employ backpropagated gradients as importance scores. Further refinements, such as XGrad-CAM and Layer-CAM, modify the weighting scheme for spatial or semantic specificity (Kaczmarek et al., 2023).
Perturbation-/Region-Based Approaches: Score-CAM, Ablation-CAM, and their derivatives estimate feature importance by perturbing input or intermediate activations and observing variations in class scores (Minh, 2023, Kaczmarek et al., 2023).
Ensemble and Metric-Guided Synthesis: MetaCAM fuses the outputs of multiple CAM variants via consensus voting over the most salient pixels, optimizing ensemble composition using the Cumulative Residual Effect (CRE), and adaptive thresholding for individual CAMs, which systematically outperforms any single method under the perturbation-based Remove and Debias (ROAD) metric (Kaczmarek et al., 2023). SyCAM formalizes and automates the search for CAM expressions optimized for user-specified faithfulness or localization metrics via syntax-guided program synthesis over a grammar of CAM-weight compositions (Luque-Cerpa et al., 14 Apr 2025).
Theoretical Attribution Models: The additive linearity of CAM is axiomatized, recovering the unique SHAP (Shapley Additive Explanations) values as the proper theoretically justified feature attribution solution. LIFT-CAM enables efficient SHAP-value approximation using a single-pass DeepLIFT-style backward traversal (Jung et al., 2021).
Collaborative and Higher-Order Fusion: Conceptor-CAM embeds both inter-channel and intra-channel (pixel–pixel) relations by learning low-rank subspace projectors (Conceptors) over weighted feature maps and fuses positive and pseudo-negative channel evidence via Boolean algebra, leading to superior quantitative faithfulness over classical CAMs (Qian et al., 2022).

3. Spatial Resolution, Faithfulness, and Robustness

Spatial resolution and faithfulness are intrinsic concerns due to information bottlenecking in late-network feature maps. Multiple strategies specifically address these limitations:

Multilayer and Multi-Scale Fusion: Poly-CAM performs recursive refinement by integrating high-resolution early-layer and low-resolution deep-layer activations via locally normalized spatial multiplication, significantly sharpening object boundaries and class-discriminative details (Englebert et al., 2022). CAMERAS generalizes this via input upsampling: aggregating maps across a spectrum of input resolutions and fusing gradients and activations at full input scale (Jalwana et al., 2021). Aggregated- and hierarchical-CAM approaches group and fuse CAMs computed at different semantic abstraction levels or across class clusters for global–local coverage (Huang et al., 2019, Cherepanov et al., 2023).
Noise-Resistant and Denoised CAMs: Grad-CAM++ empirically achieves the highest robustness-to-noise among canonical CAMs, maximizing the product of two axes—stability under class-preserving perturbations (“consistency”) and responsiveness to prediction changes (“responsiveness”)—under the proposed RM_c metric across datasets and models (Sarkar et al., 25 Aug 2025). Truncation-based denoising, as in LT-CAM and Fusion-CAM, discards lower-percentile or weak gradient activations and fuses across layers to suppress noise, improving semantic segmentation mIoU and coverage (Dong et al., 2023, Dekdegue et al., 5 Mar 2026).
Ensemble and Adaptive Fusion: Methods such as MetaCAM and Fusion-CAM leverage cross-method consensus and dynamic weighting, producing explanations that are simultaneously robust, precise, and faithful, in contrast to single-method artifacts or incomplete region coverage (Kaczmarek et al., 2023, Dekdegue et al., 5 Mar 2026).

4. Algorithmic Pipelines and Practical Implementations

The canonical CAM pipeline comprises:

Forward pass through a trained CNN to obtain deep feature maps.
For each class $c$ , compute class-specific channel weights $c$ 0 via one of: classifier weights (CAM), average spatial gradients (Grad-CAM, Grad-CAM++), region perturbation scores (Score-CAM, Ablation-CAM), or a metric-optimized synthesis (SyCAM).
Generate the raw saliency map $c$ 1 as the (possibly nonlinear) sum $c$ 2.
Apply rectification (typically $c$ 3).
Upsample the heatmap to input resolution, typically via bilinear interpolation; for methods respecting empirical receptive fields, explicit Gaussian smoothing is used to prevent grid artifacts (Kim et al., 2020).
Threshold or morphologically process saliency maps for localization, segmentation, or detection; e.g., Otsu thresholding and contour analysis (Uysal et al., 2023).

Table: High-level comparison of key CAM variants (columns: Method, Weight definition $c$ 4, Notable properties).

Method	$c$ 5 definition	Localization/Computational property
CAM	classifier FC weights	GAP required, high precision; architecture-limited
Grad-CAM	spatially averaged gradient $c$ 6	General, single backward pass
Grad-CAM++	higher-order, pixel-wise weighted gradient	Improved for multi-object, sharper
Score-CAM	score-diff on masked input	No gradients; many forward passes
Ablation-CAM	class score drop by ablation	No gradients; many forward passes
Layer-CAM	spatial gradient per-location + sum across layers	Layer fusion, fine details
Conceptor-CAM	low-rank collab/inter-channel, Boolean algebra	Highest faithfulness; extra matrix ops
SyCAM	program-synthesized, metric-optimized	Metric-adaptive, potentially slow to synthesize
Fusion-CAM	weighted, denoised fusion of Grad/Score CAMs	Best AD/AI, input-adaptive explainability

All methods enable either weakly-supervised localization (no annotation required beyond class labels) or provide diagnostic visualizations for XAI.

5. Evaluation Metrics and Benchmarking

Multiple orthogonal evaluation metrics are standard for CAM assessment:

Intersection over Union (IoU): Overlap of predicted vs. ground-truth object regions; relevant for applications such as lesion or object localization (Uysal et al., 2023, Meng et al., 2019).
Pointing Game: Proportion of maps whose most-salient pixel falls within ground-truth objects; quantifies focus sharpness (Jalwana et al., 2021).
Faithfulness/Deletion–Insertion Curves (AUC): Measures changes in confidence as salient pixels are incrementally removed or inserted; high insertion and low deletion AUC reflect strong faithfulness (Englebert et al., 2022).
Average Drop (AD) and Increase in Confidence (IC): Average confidence loss/gain on masked images; lower AD and higher IC indicate better localization (Jung et al., 2021, Uysal et al., 2023, Dekdegue et al., 5 Mar 2026).
ADCC (Average DCC): Harmonic mean of coherence (in-place stability), sparsity, and minimal confidence drop; penalizes trivial or cheating explanations (Poppi et al., 2021).
Noise-Robustness Metric (RM_c): Product of consistency and responsiveness across noise perturbations; highest for Grad-CAM++ (Sarkar et al., 25 Aug 2025).
Density Metrics ( $c$ 7): Ratio of class confidence to support size of salient/unsalient map regions (Jalwana et al., 2021).
Success Rate (SR): For lesion detection, the fraction of ground-truth microobjects covered by at least one predicted box (Uysal et al., 2023).

Qualitative assessments routinely accompany these metrics to evaluate localization precision, coverage of object parts, and freedom from spurious highlights or artifacts.

6. Advanced Topics: Semantic Alignment and Global Analysis

Recent CAM advancements extend beyond raw spatial attribution:

Vision-Language Integration: TextCAM aligns per-channel feature activations with CLIP-derived semantic representations, generating textual rationales for predicted classes and mapping explicit regions to descriptive visual attributes. This facilitates multidimensional interpretability and the detection of spurious correlations or biases (Zhao et al., 1 Oct 2025).
Global and Aggregated Explanations: Aggregated-CAM visualizes average and variability statistics over large sets of samples, enabling the discovery of features consistently predictive for specific classes and identifying confounding or volatile predictors. Interactive drill-down histograms allow domain experts to refine interpretations and suggest model/data adjustments (Cherepanov et al., 2023).
Class Grouping and Multi-Level Fusion: Incorporating hierarchical class groupings or selecting representative class pairs increases cue complementarity, reducing localist artifacts and improving coverage in weakly-supervised and small-network settings (Meng et al., 2019, Huang et al., 2019).
Boolean and multi-evidence algebra: Conceptor-CAM’s matrix algebra enables explicit positive, negative, and fused evidential maps, further bolstering coverage and background suppression (Qian et al., 2022).

7. Limitations, Open Directions, and Application Domains

Key acknowledged limitations include:

Spatial Resolution Bottlenecks: All post hoc CAMs suffer from upsampling and coarse spatial granularity in late-layer features. Resolution-enhancing techniques (e.g., fusion, denoising) mitigate but do not eliminate these issues.
Class Overlap, Multi-Label, and Small Object Detection: CAMs can have difficulty with overlapping or small target regions, as highlighted in weakly supervised lesion detection where mAP remains low but alternative coverage metrics provide complementary evaluation (Uysal et al., 2023).
Generalization and Robustness: Transformer-based CAMs show high variance in interpretability robustness, and modality transfer (e.g., to CT, MRI) remains underexplored (Sarkar et al., 25 Aug 2025).
Computational Cost: Ensemble and synthesis approaches (e.g., MetaCAM, SyCAM) can require significant compute for large model or metric sets.
Semantic Interpretability: Raw saliency does not guarantee semantic or human-understandable explanations; embedding-based approaches are an active area (Zhao et al., 1 Oct 2025).

CAMs are heavily deployed in safety-critical and high-stakes domains, including plant pathology (Uysal et al., 2023), radiology, biometric authentication, autonomous vehicles, and malware analysis, where both local and global explanations are essential for trustworthy deployment (Kaczmarek et al., 2023, Cherepanov et al., 2023).

Future research continues toward integrating CAMs with vision-language systems, improving weak supervision with dynamic class grouping, enhancing robustness and domain transfer, and formalizing diagnostic tools capable of supporting human-in-the-loop refinement and large-scale benchmarking (Minh, 2023).