Keep CALM and Improve Visual Feature Attribution (2106.07861v3)

Published 15 Jun 2021 in cs.CV

Abstract: The class activation mapping, or CAM, has been the cornerstone of feature attribution methods for multiple vision tasks. Its simplicity and effectiveness have led to wide applications in the explanation of visual predictions and weakly-supervised localization tasks. However, CAM has its own shortcomings. The computation of attribution maps relies on ad-hoc calibration steps that are not part of the training computational graph, making it difficult for us to understand the real meaning of the attribution values. In this paper, we improve CAM by explicitly incorporating a latent variable encoding the location of the cue for recognition in the formulation, thereby subsuming the attribution map into the training computational graph. The resulting model, class activation latent mapping, or CALM, is trained with the expectation-maximization algorithm. Our experiments show that CALM identifies discriminative attributes for image classifiers more accurately than CAM and other visual attribution baselines. CALM also shows performance improvements over prior arts on the weakly-supervised object localization benchmarks. Our code is available at https://github.com/naver-ai/calm.

Citations (20)

View on Semantic Scholar

Summary

The paper presents a probabilistic framework using a latent variable to embed feature attribution directly into the training process, improving CAM's effectiveness.
It defines attribution maps based on likelihoods rather than post-hoc normalization, yielding clearer and more interpretable visual explanations.
Empirical evaluations on WSOL and fine-grained benchmarks demonstrate CALM's superior ability to localize discriminative image features compared to CAM.

Enhancing Visual Feature Attribution with CALM

The paper "Keep CALM and Improve Visual Feature Attribution" presents a novel approach for improving the robustness and interpretability of Class Activation Mapping (CAM), a well-known technique in feature attribution. CAM is predominantly utilized in visual recognition tasks to highlight the influential regions of an input image responsible for the model's predictions. Despite its widespread application and simplicity, CAM suffers from several interpretability issues, primarily due to its reliance on ad-hoc calibrations outside the core computational graph. The authors address these limitations by introducing Class Activation Latent Mapping (CALM), which incorporates a probabilistic modeling approach to recalibrate CAM within the training framework.

Core Contributions

Probabilistic Framework for Attribution: CALM introduces a latent variable, Z, representing the cue location critical for recognition, into a probabilistic graphical model alongside input X and class label Y. The model is trained using expectation-maximization (EM) to compute and maximize the marginal likelihood, effectively embedding the attribution process directly within the model's learning objectives. This integration allows for ground-truth conditional attributions that improve human interpretability.
Attribution Map Definition: Instead of deriving attribution values through post-hoc normalization like CAM, CALM offers a probabilistic explanation for the attribution values. The probabilities p(y, z | x) inherently represent the likelihood of class y being influenced by activation at position z in image x.
Empirical Superiority over Baselines: In various visual recognition benchmarks, such as weakly-supervised object localization (WSOL), CALM demonstrates superior performance compared to CAM. The method effectively localizes discriminative image features pertinent for classification tasks, outperforming traditional CAM, particularly in fine-grained datasets.
Independent Evaluation Metrics: The paper presents a rigorous experimental evaluation, adopting metrics that independently verify the efficacy of interpretation beyond mere qualitative assessment. This includes comparisons using tasks such as remove-and-classify, cue localization with ground truth masks on datasets like CUB-200-2011, and other fine-grained recognition benchmarks.
Sanity and Axioms Compliance: CALM adheres to interpretability axioms better than CAM. It addresses key axiomatic principles like implementation-invariance and sensitivity, which CAM struggles with, given its reliance on logit-scale feature maps that CAM derives its heatmaps from.

Implications and Future Directions

The introduction of CALM enhances the dialogue between AI systems and human users by offering clear, probabilistically sound visual attributions. This is especially pertinent in contexts requiring transparency, such as critical decision-support systems and applications under regulatory scrutiny for explainability. CALM's framework not only enhances theoretical rigor but also opens avenues for future work integrating deeper probabilistic modeling into the core of neural network architectures.

Avenues for future research highlighted by the authors include:

Exploring more complex latent variable structures that could encapsulate richer semantic contexts beyond individual cues.
Extending the EM-based training framework into more expansive model classes, particularly those without a global average pooling prerequisite.
Application to other domains where interpretability and model transparency have heightened importance, such as medical imagery and autonomous decision-making systems.

CALM provides a robust alternative for visual feature attributions with clear empirical benefits, setting a new standard for both practical applications and theoretical developments in interpretable AI.