Improving performance of deep learning models with axiomatic attribution priors and expected gradients (1906.10670v2)

Published 25 Jun 2019 in cs.LG and stat.ML

Abstract: Recent research has demonstrated that feature attribution methods for deep networks can themselves be incorporated into training; these attribution priors optimize for a model whose attributions have certain desirable properties -- most frequently, that particular features are important or unimportant. These attribution priors are often based on attribution methods that are not guaranteed to satisfy desirable interpretability axioms, such as completeness and implementation invariance. Here, we introduce attribution priors to optimize for higher-level properties of explanations, such as smoothness and sparsity, enabled by a fast new attribution method formulation called expected gradients that satisfies many important interpretability axioms. This improves model performance on many real-world tasks where previous attribution priors fail. Our experiments show that the gains from combining higher-level attribution priors with expected gradients attributions are consistent across image, gene expression, and health care data sets. We believe this work motivates and provides the necessary tools to support the widespread adoption of axiomatic attribution priors in many areas of applied machine learning. The implementations and our results have been made freely available to academic communities.

Authors (5)

Gabriel Erion (4 papers)
Joseph D. Janizek (5 papers)
Pascal Sturmfels (6 papers)
Scott Lundberg (18 papers)
Su-In Lee (37 papers)

Citations (78)

View on Semantic Scholar

Summary

Analyzing Axiomatic Attribution Priors and Expected Gradients in Deep Learning Models

This paper presents a novel approach for integrating feature attribution methods into the training processes of deep learning models to achieve enhanced model performance and interpretability. The core focus lies on implementing axiomatic attribution priors combined with a newly developed feature attribution method called expected gradients.

Attribution Methods in Model Training

Feature attribution methods are traditionally employed post-hoc to interpret machine learning models by determining the significance of each input feature in making specific predictions. The paper explores the embedding of these methods into the training phase, turning them into "attribution priors." Such priors enforce higher-level behavior in models, aiming for smoothness, sparsity, and other desirable properties. This approach shifts the paradigm where feature attributions guide model training rather than merely critique after the fact, thereby reducing reliance on features with potentially undesirable properties and promoting model behaviors more aligned with human intuition.

Expected Gradients

The newly introduced expected gradients propose an advancement over integrated gradients by addressing its dependency on baseline inputs. This new method approximates the contributions of features to the model's predictions by integrating gradients over the data distribution itself, thus eliminating the arbitrariness associated with picking baselines. Expected gradients satisfy critical interpretability axioms like completeness and implementation invariance, ensuring robust and consistent attributions realized through efficient computations.

Empirical Results

The paper presents strong experimental evidence across diverse datasets—images, gene expressions, and healthcare—demonstrating consistent performance improvements when leveraging expected gradients and attribution priors.

Image Data: Models trained with pixel smoothness priors showed enhanced robustness to noise, with smoother and more interpretable attribution maps compared to baseline models. This indicates improved generalization capabilities under domain shifts, albeit with slight compromises on predictive accuracy.
Gene Expression Data: Utilizing a graph-based attribution prior grounded in protein-protein interactions led to improved predictive performance for drug response tasks, as well as biologically plausible attributions that adhered closely to established pathways.
Healthcare Data: In scenarios with limited data, a sparsity prior yielded models that had superior performance and concentrated attribution values on fewer, crucial features. This highlights the potential of attribution priors in yielding parsimonious models that are beneficial in data-scarce environments.

Theoretical and Practical Implications

This work strongly supports the notion that feature attribution is not merely an interpretative tool but can be a significant component of the learning process. By embedding human knowledge in the form of priors into model training, this approach opens up new directions in building robust, interpretable, and practically valuable models across domains. The introduction of expected gradients, in particular, addresses computational inefficiencies and subjective parameter tuning present in earlier methods, making their deployment more viable at scale.

Future Prospects

The generality of the proposed framework for expected gradients and attribution priors may catalyze further research into custom priors tailored to specific domains and tasks. Future studies could explore more complex priors informed by domain-specific insights, enhancing model alignment with expert intuition. Moreover, there’s room to adaptation to more diverse model architectures and alternative attribution methods, potentially enriching the toolkit for transparent and accountable AI systems.

In conclusion, this paper provides a substantive contribution to the integration of feature attribution into model training, demonstrating compelling empirical results and paving the way for theoretically sound and practically robust advancements in AI interpretability.

PDF Markdown

Related Papers

GitHub

GitHub - suinleelab/attributionpriors: Tools for training explainable models using attribution priors. (124 stars)