How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations (2503.00641v1)

Published 1 Mar 2025 in cs.CV

Abstract: Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs) and are inherently based on the assumption that the explanations can be applied independently of how the models were trained. Contrarily, in this work we bring forward empirical evidence that challenges this very notion. Surprisingly, we discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer (less than 10 percent of model parameters) play a crucial role, much more than the pre-training scheme itself. This is of high practical relevance: (1) as techniques for pre-training models are becoming increasingly diverse, understanding the interplay between these techniques and attribution methods is critical; (2) it sheds light on an important yet overlooked assumption of post-hoc attribution methods which can drastically impact model explanations and how they are interpreted eventually. With this finding we also present simple yet effective adjustments to the classification layers, that can significantly enhance the quality of model explanations. We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training) and analyse how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.

Summary

The paper finds that the quality of post-hoc explanations for DNNs largely depends on the training of the classification layer, not just the pre-training paradigm.
Replacing Cross-Entropy with Binary Cross-Entropy for training linear probes significantly enhances explanation specificity and quality.
Using non-linear probes, such as B-cos MLPs, further boosts localization performance, with these techniques validated across diverse models and datasets.

Overview of "How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations"

The paper entitled "How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations" examines the intricacies of probing techniques used for enhancing the post-hoc explanations of Deep Neural Networks (DNNs). The crux of the paper highlights an overlooked aspect of post-hoc importance attribution methods, which are integral to interpreting DNN classifier decisions, revealing substantial dependencies on the training specifics of the classification layer rather than on the pre-training paradigm itself.

Core Findings and Techniques

Classification Layer Influence: Contrary to the assumption that explanation methods are generalizable across varied pre-training strategies, the research finds that the explanation quality—measured in localization accuracy and faithfulness to model decisions—is predominantly determined by how the classification layer, comprising less than 10% of model parameters, is trained.
Binary Cross-Entropy (BCE) vs. Cross-Entropy (CE): Through empirical evaluation, the authors demonstrate that replacing CE with BCE for training linear probes significantly enhances the specificity and quality of explanations. The findings suggest that BCE probes avoid the shift-invariance issue associated with softmax in CE, which leads to more reliable and class-specific attributions.
Impact of Non-linear Probes: The paper finds that integrating multi-layer probes, particularly three-layer Multi-Layer Perceptrons (MLPs), further augments the localization performance of model explanations. The use of B-cos MLPs—a variant employing cosine similarity—over conventionally weighted MLPs shows marked improvements in both downstream classification accuracy and interpretability metrics.
Evaluation Across Models and Methods: The robustness of these techniques is validated across various architectures (both convolutional networks like ResNet-50 and vision transformers) and on a broad array of attribution methods such as Layer-wise Relevance Propagation (LRP), Integrated Gradients, GradCAM, and B-cos. The evaluation extends over diverse datasets including ImageNet, Pascal VOC, and MS COCO.

Theoretical and Practical Implications

Reconciling Interpretability and Training Dynamics: The findings urge reconsideration of the implicit assumption in XAI that attribution methods can solely reflect inherent model properties irrespective of downstream adjustments. This highlights the need for rethinking model design and training phases to consistently achieve interpretability goals.
Probe Design: This work substantiates the importance of designing and selecting probing mechanisms and suggests that practitioners need to predominantly focus on the training objectives of probe layers to ensure the utility of post-hoc interpretability frameworks.
Guide for Deep Learning Explainers: By demonstrating that minor navigational tools like BCE training of probes and employing B-cos layers can significantly enhance transparency, these results can guide explainability aspects within AI applications extending into sensitive and high-stakes domains.

Future Prospects

The research opens avenues for developing new explainability paradigms tailored to diverse training strategies of neural networks. Additionally, it prompts the investigation of how foundations of DNN explanations could be directly intertwined with advancements in training methodologies, potentially ushering in highly interpretable yet flexible AI models applicable across various industrial and research settings.

This holistic revaluation of interpreting classification layers within diverse architectural setups marks a substantial step forward in the exegesis of machine learning models, reiterating the notion that understanding training mechanics is as pivotal as pioneering novel XAI techniques.

Tweets

https://twitter.com/sidgairo18/status/1896845890687848900

https://twitter.com/sidgairo18/status/1916427504258007354