Axiomatic Attribution for Deep Networks: A Professional Overview
The paper "Axiomatic Attribution for Deep Networks" by Mukund Sundararajan, Ankur Taly, and Qiqi Yan tackles the key problem of attributing predictions made by deep neural networks to their input features. The authors introduce the concept of axiomatic attribution and propose a method called Integrated Gradients, which ensures that the attributions satisfy certain fundamental axioms, making it both theoretically sound and practically applicable.
Problem Definition and Motivations
Feature attribution in neural networks is crucial for understanding model behavior, improving models, debugging, and providing rationales for model predictions. Specifically, feature attribution helps in identifying which input features most significantly affect the prediction. This is not only important for interpretability but also for trustworthiness in deploying models in critical domains such as healthcare or finance.
The authors define feature attribution formally by considering a function F:Rn→[0,1] that represents a deep network, and an input x=(x1,…,xn). An attribution method assigns a vector AF(x,x′)=(a1,…,an), where ai represents the contribution of xi to the prediction F(x).
Fundamental Axioms
The paper asserts two fundamental axioms necessary for any attribution method:
- Sensitivity: If an input change results in a different prediction, then the changed input feature should have a non-zero attribution.
- Implementation Invariance: Attribution should be identical for functionally equivalent networks regardless of their implementations.
The authors demonstrate that most existing attribution methods do not satisfy one or both of these axioms, leading to unreliable feature attributions. Methods like Deconvolutional networks and Guided back-propagation fail the Sensitivity axiom, while DeepLift and Layer-wise Relevance Propagation (LRP) fail the Implementation Invariance axiom.
Integrated Gradients
To address these issues, the authors propose Integrated Gradients, a simple and computation-friendly method. Integrated Gradients compute attributions by averaging gradients along the path from a baseline input (representing the absence of features) to the actual input. This method requires no network modifications and satisfies both fundamental axioms.
Formally, integrated gradients for an input x and baseline x′ are defined as: IntegratedGradsi(x)::=(xi−xi′)×∫01∂xi∂F(x′+α×(x−x′)) dα
This method also satisfies an additional axiom called Completeness, which ensures that the attributions sum up to the total difference in predictions between the input and the baseline.
Empirical Applications
The practical applicability of Integrated Gradients is demonstrated across diverse domains:
- Object Recognition Networks: The method is applied to the GoogleNet architecture trained on ImageNet, showing clear attributions reflecting distinctive image features.
- Diabetic Retinopathy Prediction: Integrated Gradients help retina specialists understand model predictions by highlighting relevant features in retinal fundus images.
- Question Classification: In natural LLMs, attributions provide insights into trigger phrases for different question types, aiding the development of new rules for semantic parsing.
- Neural Machine Translation: The method effectively aligns input and output tokens in translation tasks, showing how the deep network maps one language to another.
- Chemistry Models: For Ligand-Based Virtual Screening, attributions help understand how specific features of molecules contribute to their activity against targets.
Theoretical Justifications and Uniqueness
The paper argues that Integrated Gradients is not the sole method satisfying the axioms, but it is the only path method that preserves symmetry. This property ensures that symmetric input features receive identical attributions, which is crucial for fair and meaningful feature attributions.
Future Directions and Conclusion
Integrated Gradients stands out as a theoretically justified and easily implementable method suitable for a wide range of applications. Going forward, the challenge lies in extending the understanding of feature interactions and the logical reasoning embedded within the network. Moreover, the insights from axiomatic approaches can pave the way for future developments in explainable AI, ensuring that models are both interpretable and trustworthy.
In conclusion, the paper presents a robust framework for attribution in deep networks, providing a method that is both feasible and satisfies key theoretical properties. The Integrated Gradients approach is a meaningful step towards more interpretable and reliable deep learning models.