LaFAM: Unsupervised Feature Attribution with Label-free Activation Maps (2407.06059v2)

Published 8 Jul 2024 in cs.CV

Abstract: Convolutional Neural Networks (CNNs) are known for their ability to learn hierarchical structures, naturally developing detectors for objects, and semantic concepts within their deeper layers. Activation maps (AMs) reveal these saliency regions, which are crucial for many Explainable AI (XAI) methods. However, the direct exploitation of raw AMs in CNNs for feature attribution remains underexplored in literature. This work revises Class Activation Map (CAM) methods by introducing the Label-free Activation Map (LaFAM), a streamlined approach utilizing raw AMs for feature attribution without reliance on labels. LaFAM presents an efficient alternative to conventional CAM methods, demonstrating particular effectiveness in saliency map generation for self-supervised learning while maintaining applicability in supervised learning scenarios.

Citations (1)

View on Semantic Scholar

Summary

LaFAM: Unsupervised Feature Attribution with Label-free Activation Maps

Introduction

The paper "LaFAM: Unsupervised Feature Attribution with Label-free Activation Maps" introduces a novel approach in the field of explainable AI, specifically targeting feature attribution within Convolutional Neural Networks (CNNs). Activation Maps (AMs), known for their ability to highlight saliency regions in CNNs, are central to many XAI methods. However, the literature has rarely explored the direct use of raw AMs for feature attribution. LaFAM is proposed as an efficient label-free alternative to traditional Class Activation Map (CAM) techniques, particularly suited for both self-supervised and supervised learning scenarios. By leveraging raw AMs without labels, LaFAM offers an enhanced method for generating saliency maps.

Methodology

LaFAM operates as a post hoc analytical tool that generates saliency maps by aggregating activations from a convolutional layer. It's designed to be computationally efficient, requiring a single forward pass to compute activations. This is achieved by averaging the AMs across channels within the convolutional layer, enabling the preservation of spatial coherence necessary for saliency mapping. The resultant map undergoes min-max normalization followed by upsampling to align with the input image size:

$\bar{A}^l_{i,j} = \frac{1}{K}\sum_{k=1}^{K} A_{i,j}^{l,k}$

The upsampled saliency map is thus obtained as:

$M_\text{LaFAM} = \text{Up} \left( N\left(\bar{A}^l\right) \right)$

where $N$ denotes min-max normalization, and $\text{Up}$ represents upsampling.

Experimental Evaluation

The method's efficacy was tested using SSL frameworks like SimCLR and SwAV, as well as a supervised ResNet50 framework on ImageNet-1k and PASCAL VOC 2012 datasets. The performance of LaFAM was compared with RELAX in SSL settings and Grad-CAM for supervised learning.

The results reveal that LaFAM consistently outperforms RELAX in the SSL setting, evidenced by superior scores across all evaluated metrics, such as Pointing-Game and Sparseness (Tables 1 and 2). These results suggest LaFAM's proficiency in producing precise and focused saliency maps, especially for scenes containing small objects.

Figure 1: Saliency maps comparison for scenes with two distinct objects. Left-hand labels indicate ImageNet labels predicted by ResNet50 classifier.

Compared to Grad-CAM in supervised scenarios, LaFAM remains competitive across several metrics. While Grad-CAM achieves a higher Sparseness score, indicating less scattered explanations, LaFAM's broader feature