Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Axiomatic Attribution for Deep Networks (1703.01365v2)

Published 4 Mar 2017 in cs.LG

Abstract: We study the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works. We identify two fundamental axioms---Sensitivity and Implementation Invariance that attribution methods ought to satisfy. We show that they are not satisfied by most known attribution methods, which we consider to be a fundamental weakness of those methods. We use the axioms to guide the design of a new attribution method called Integrated Gradients. Our method requires no modification to the original network and is extremely simple to implement; it just needs a few calls to the standard gradient operator. We apply this method to a couple of image models, a couple of text models and a chemistry model, demonstrating its ability to debug networks, to extract rules from a network, and to enable users to engage with models better.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Mukund Sundararajan (27 papers)
  2. Ankur Taly (22 papers)
  3. Qiqi Yan (12 papers)
Citations (5,374)

Summary

Axiomatic Attribution for Deep Networks: A Professional Overview

The paper "Axiomatic Attribution for Deep Networks" by Mukund Sundararajan, Ankur Taly, and Qiqi Yan tackles the key problem of attributing predictions made by deep neural networks to their input features. The authors introduce the concept of axiomatic attribution and propose a method called Integrated Gradients, which ensures that the attributions satisfy certain fundamental axioms, making it both theoretically sound and practically applicable.

Problem Definition and Motivations

Feature attribution in neural networks is crucial for understanding model behavior, improving models, debugging, and providing rationales for model predictions. Specifically, feature attribution helps in identifying which input features most significantly affect the prediction. This is not only important for interpretability but also for trustworthiness in deploying models in critical domains such as healthcare or finance.

The authors define feature attribution formally by considering a function F:Rn[0,1]F: R^n \rightarrow [0,1] that represents a deep network, and an input x=(x1,,xn)x = (x_1,\ldots,x_n). An attribution method assigns a vector AF(x,x)=(a1,,an)A_F(x, x') = (a_1,\ldots,a_n), where aia_i represents the contribution of xix_i to the prediction F(x)F(x).

Fundamental Axioms

The paper asserts two fundamental axioms necessary for any attribution method:

  1. Sensitivity: If an input change results in a different prediction, then the changed input feature should have a non-zero attribution.
  2. Implementation Invariance: Attribution should be identical for functionally equivalent networks regardless of their implementations.

The authors demonstrate that most existing attribution methods do not satisfy one or both of these axioms, leading to unreliable feature attributions. Methods like Deconvolutional networks and Guided back-propagation fail the Sensitivity axiom, while DeepLift and Layer-wise Relevance Propagation (LRP) fail the Implementation Invariance axiom.

Integrated Gradients

To address these issues, the authors propose Integrated Gradients, a simple and computation-friendly method. Integrated Gradients compute attributions by averaging gradients along the path from a baseline input (representing the absence of features) to the actual input. This method requires no network modifications and satisfies both fundamental axioms.

Formally, integrated gradients for an input xx and baseline xx' are defined as: IntegratedGradsi(x)::=(xixi)×01F(x+α×(xx))xi dαIntegratedGrads_i(x) ::= (x_i - x'_i) \times \int_{0}^{1} \tfrac{\partial F(x' + \alpha \times (x - x'))}{\partial x_i}~d\alpha

This method also satisfies an additional axiom called Completeness, which ensures that the attributions sum up to the total difference in predictions between the input and the baseline.

Empirical Applications

The practical applicability of Integrated Gradients is demonstrated across diverse domains:

  1. Object Recognition Networks: The method is applied to the GoogleNet architecture trained on ImageNet, showing clear attributions reflecting distinctive image features.
  2. Diabetic Retinopathy Prediction: Integrated Gradients help retina specialists understand model predictions by highlighting relevant features in retinal fundus images.
  3. Question Classification: In natural LLMs, attributions provide insights into trigger phrases for different question types, aiding the development of new rules for semantic parsing.
  4. Neural Machine Translation: The method effectively aligns input and output tokens in translation tasks, showing how the deep network maps one language to another.
  5. Chemistry Models: For Ligand-Based Virtual Screening, attributions help understand how specific features of molecules contribute to their activity against targets.

Theoretical Justifications and Uniqueness

The paper argues that Integrated Gradients is not the sole method satisfying the axioms, but it is the only path method that preserves symmetry. This property ensures that symmetric input features receive identical attributions, which is crucial for fair and meaningful feature attributions.

Future Directions and Conclusion

Integrated Gradients stands out as a theoretically justified and easily implementable method suitable for a wide range of applications. Going forward, the challenge lies in extending the understanding of feature interactions and the logical reasoning embedded within the network. Moreover, the insights from axiomatic approaches can pave the way for future developments in explainable AI, ensuring that models are both interpretable and trustworthy.

In conclusion, the paper presents a robust framework for attribution in deep networks, providing a method that is both feasible and satisfies key theoretical properties. The Integrated Gradients approach is a meaningful step towards more interpretable and reliable deep learning models.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews