Class Activation Maps (CAMs)
- Class Activation Maps (CAMs) are visualization techniques that highlight the most influential image regions for a CNN’s decision.
- They integrate multi-layer feature fusion and theoretical models, such as SHAP and DeepLIFT, to enhance attribution accuracy.
- Recent advances address challenges like noise sensitivity and adversarial vulnerabilities through ensemble and metric-driven synthesis techniques.
Class Activation Maps (CAMs) are a class of visualization methods used to interpret convolutional neural networks by highlighting the image regions most influential to the network’s class decision. Since the introduction of the original CAM technique, the field has advanced rapidly with diverse methodologies, theoretical models, and practical applications spanning weakly supervised learning, model attribution, diagnostic explainability, and video analysis. The following sections present a technical synthesis of CAM methodologies, theoretical evolutions, improvements, and empirical findings grounded in the current research landscape.
1. Foundations and Evolution of CAM Methods
The canonical CAM approach projects the output-layer weights from a convolutional neural network (CNN) back onto its final convolutional feature maps to produce a class-specific heatmap. For a given class , this takes the form:
where are the feature maps, and are the learned weights for class .
Grad-CAM superseded architectural restrictions by employing gradients to infer the importance of each feature map channel, generating class-specific maps as:
with subsequent refinements including Grad-CAM++ (weighted averaging of positive gradients), Score-CAM (gradient-free, score-based channel importance), and LayerCAM/XGradCAM that enhance spatial precision or generalize the approach (Minh, 2023).
Recent methodologies such as CAMERAS integrate multi-scale accumulation and upsampling to produce high-resolution, sanity-preserving saliency maps, directly addressing the spatial coarseness of deep-layer activations (Jalwana et al., 2021). F-CAM attaches parametric, trainable decoders to recover full-resolution localization maps, leveraging not only the feature structure but foreground–background priors and image statistics (Belharbi et al., 2021).
2. Theoretical Developments and Feature Attribution Models
CAM methods have increasingly been recast within formal frameworks. One significant development is the modeling of CAM as a linear additive explanation system where the explanation map is a weighted sum of feature maps. The theoretical insight is that, for fixed feature activations, only the choice of weighting coefficients determines the explanation (Jung et al., 2021). In this light, SHAP values—arising from cooperative game theory—emerge as the unique solution to ensuring local accuracy, missingness, and consistency in the attributions. The connection is formalized as:
where may be the SHAP value for the ‑th activation map.
LIFT-CAM builds upon this theory by approximating the SHAP values efficiently using DeepLIFT, obviating the need for prohibitive combinatorial enumeration while retaining the theoretical guarantees of attribution faithfulness. Quantitative and qualitative evaluations across benchmarks like ImageNet and PASCAL VOC show that LIFT-CAM produces explanation maps that capture the truly influential image regions and outperform baseline methods in metrics such as Increase in Confidence, Average Drop, Insertion/Deletion AUC, and localization error (Jung et al., 2021).
Further, ShapleyCAM formalizes the relationship between CAM and Shapley value–based explainers by modeling CNN predictions as cooperative games among pixel “players” aggregated over activation maps. A closed-form second-order Taylor approximation involving gradients and Hessians enables efficient Shapley value estimation for visual explanations, accommodating both the game-theoretic and content-reserving properties of attribution (Cai, 9 Jan 2025).
3. Enhanced Model Discrimination and Multi-Layer Fusion Techniques
A recognized limitation of standard CAM is its tendency to produce only localized, incomplete region activations—often corresponding to the most discriminative (not necessarily semantic) object parts. Multiple avenues have addressed this:
- Representative Class Selection and Clustering: Rather than contrasting the target class against all others, representative class selection via similarity clustering (using false positive probabilities) enables complementary binary classifiers to generate CAMs that jointly span global object regions. The similarity matrix is built as over predicted softmax outputs, forming a matrix that guides improved k-means clustering. This yields class pairings that produce more global, precise, and complementary activation regions and facilitate application to small networks, validated with significant mIoU improvements on PASCAL VOC and COCO (Meng et al., 2019).
- Multi-Layer Feature Fusion: Fusing activation maps across multiple network layers (e.g., at each ResNet block or before each VGG max pooling) integrates high-level semantic information with low-level spatially detailed cues. Aggregation is typically performed via sum or weighted sum, and sometimes multiplicative combinations with the final layer CAM, yielding activation maps that highlight both broad and fine object structure. Multi-layer fusion not only enhances localization but is critical for small networks where depth is limited (Meng et al., 2019).
- Hierarchical Class Grouping and Orthogonal Constraints: Training multiple classification models corresponding to hierarchical class clusters—each producing diverse, level-specific CAMs, and integrating their outputs—can further enrich discriminative content. Orthogonal constraints between feature extraction branches enforce feature diversity, minimizing redundancy and yielding more complete activation (Huang et al., 2019). The orthogonal loss is expressed as .
4. Integration of Channel Collaboration, Feature Decomposition, and Boolean Reasoning
Conventional CAM channel fusion treats channels as independent contributors or uses contrastive weighting, but ignores intra- and inter-channel collaboration:
- Conceptor Learning Integration: Conceptor matrices, learned via ridge regression over channel-weighted feature activations, synchronize feature vectors to capture both inter-channel correlation and intra-channel co-reconstruction. The resulting saliency map is , where is the conceptor solution with the inter-channel correlation. The method supports Boolean operations—combining positive and pseudo-negative evidence via NOT/OR—that yield comprehensive and robust region attributions, validated by improvements up to 168% in Average Drop metrics across large-scale datasets (Qian et al., 2022).
- Feature-Level SVD Decomposition: By decomposing activation maps using singular value decomposition (SVD), Decom-CAM produces orthogonal feature maps (OFMs) that isolate distinct, non-overlapping semantic parts (e.g., eyes, ears, wings) in the image. This feature-level interpretability clarifies which spatial regions anchor the decision and enables thorough diagnosis of model failures or biases (Yang et al., 2023).
5. Adaptations and Applications: Video, Small Data, Ensembles, and Metric-Based Synthesis
CAMs have been adapted for diverse contexts:
- Temporal Video Analysis (TCAM): TCAM extends class activation mapping to videos by temporally aggregating framewise CAMs using max pooling over recent frames (CAM-TMP), generating pseudo-labels for U-Net style decoders. This improves object region coverage in the temporal domain, with real-time inference supported by per-frame independence (Belharbi et al., 2022).
- Outcome-Agnostic and Small Data Scenarios (BroadCAM): Outcome-agnostic weighting using a Broad Learning System (BLS)—a flat architecture with feature/enhanced mapping and ridge regression—decouples CAM reliability from unstable outcomes in small datasets. This ensures the activation maps are both robust and class-informative even at low data scales (<5%), outperforming gradient or score-based methods (Lin et al., 2023).
- Ensemble Interpretability (MetaCAM): Combining multiple CAMs via top-% pixel consensus and adaptive thresholding, MetaCAM refines saliency precision and trustworthiness. The Cumulative Residual Effect (CRE) quantifies each component CAM's ensemble contribution. Experiments confirm significantly higher ROAD interpretability scores compared to individual methods (Kaczmarek et al., 2023).
- Metric-Guided Synthesis (SyCAM): SyCAM employs syntax-guided search over a defined template/grammar of CAM weighting expressions, optimizing for a specified evaluation metric (e.g., Deletion/Insertion, ground-truth overlap). Automated synthesis discovers expressions that match or surpass established CAMs, enabling user-driven, context-adaptive explanation (Luque-Cerpa et al., 14 Apr 2025).
- Out-of-Distribution Detection with Multi-Exit Networks (MECAM): By aggregating multi-exit CAMs at varying depths, MECAM leverages global and local features. Masking input images with inverted CAMs and assessing mean squared distance in intermediate embeddings robustly separates in-distribution from out-of-distribution medical images, outperforming prior OOD detection methods (Chen et al., 13 May 2025).
6. Robustness, Adversarial Vulnerabilities, and Future Directions
While CAMs are widely adopted, several limitations are actively addressed in recent research:
- Noise and Adversarial Robustness: CAM explanations are variably susceptible to noise perturbations and adversarial manipulation. A new robustness metric analyzes both consistency (stability of CAM ranking under invariant predictions) and responsiveness (appropriate CAM adaptation when predictions change), scored by rank-biased overlap (RBO) and AUC of class change discrimination. GradCAM++ emerges as especially robust, while methods like EigenCAM, though sometimes visually stable, lack discrimination when class outcomes shift (Sarkar et al., 25 Aug 2025).
- Contrastive Attribution for Softmax Networks (DiffGradCAM): CAMs can be passively fooled (via adversarial training to produce misleading maps with intact decision performance) due to their single-logit focus. DiffGradCAM computes gradients w.r.t. the contrastive logit difference (true class minus an aggregator over false classes), robustly aligning attributions with the decision boundary and resisting passive fooling. SHAMs (Salience-Hoax Activation Maps) formalize a benchmark for such attack scenarios (Piland et al., 10 Jun 2025).
- Open Questions and Advancements: Ongoing work aims to further refine multi-scale fusion, improve class grouping, integrate orthogonal constraints cross-layer, automate CAM expression synthesis for user-defined metrics, and harmonize CAM quality with uncertainty estimation, feature decomposition, and ensemble interpretability. Applications span real-time video analytics, robust safety-critical systems, and low-data contexts.
7. Mathematical Formulations and Key Evaluation Metrics
CAM variants are distinguished by their weighting schemes, attribution models, and evaluation criteria. Table 1 summarizes salient mathematical notations and evaluation metrics:
| CAM Variant / Technique | Mathematical Formulation | Special Notes |
|---|---|---|
| Original CAM | Needs GAP+FC layer | |
| Grad-CAM | = avg. gradient | |
| LIFT-CAM (SHAP approx.) | Single backward pass | |
| Representative Class Selection | , clustering over | Builds binary classifiers for pairs |
| SVD Decomposition (Decom-CAM) | , project with | Feature-level interpretability |
| Multi-Exit OOD (MECAM) | , | OOD by masking salient regions |
| Robustness Metric (RM) | Consistency/Responsiveness via RBO/AUC | |
| DiffGradCAM | , attribute gradients w.r.t. | Contrastive, adversarial-resistant |
Major evaluation metrics include mIoU (mean Intersection over Union), Average Drop, Increase in Confidence, Insert/Delete AUC, pointing game accuracy, and advanced robustness metrics based on region ranking overlaps (Sarkar et al., 25 Aug 2025).
In summary, Class Activation Maps are a rapidly evolving area at the intersection of interpretability, feature attribution, adversarial robustness, and weakly supervised learning. Theoretical advances now underpin practice; integration and fusion techniques yield explanations with higher spatial precision and semantic completeness; and emerging frameworks support adaptive, metric-driven synthesis, robust aggregation, and extension to temporal, small-sample, and medical OOD settings. The field continues to expand with open questions regarding uncertainty, composability, explainability metrics, and the harmonization of interpretation with downstream task performance.