Interpretation of Neural Networks is Fragile (1710.10547v2)

Published 29 Oct 2017 in stat.ML and cs.LG

Abstract: In order for machine learning to be deployed and trusted in many applications, it is crucial to be able to reliably explain why the machine learning algorithm makes certain predictions. For example, if an algorithm classifies a given pathology image to be a malignant tumor, then the doctor may need to know which parts of the image led the algorithm to this classification. How to interpret black-box predictors is thus an important and active area of research. A fundamental question is: how much can we trust the interpretation itself? In this paper, we show that interpretation of deep learning predictions is extremely fragile in the following sense: two perceptively indistinguishable inputs with the same predicted label can be assigned very different interpretations. We systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that even small random perturbation can change the feature importance and new systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile. Our analysis of the geometry of the Hessian matrix gives insight on why fragility could be a fundamental challenge to the current interpretation approaches.

Authors (3)

Amirata Ghorbani (16 papers)
Abubakar Abid (18 papers)
James Zou (232 papers)

Citations (825)

View on Semantic Scholar

Summary

The paper demonstrates that minor adversarial perturbations can drastically alter neural network interpretations without changing the prediction labels.
It evaluates popular interpretation methods like simple gradients, integrated gradients, and DeepLIFT on datasets such as ImageNet and CIFAR-10.
The study calls for developing more robust interpretation techniques and regularization strategies to ensure reliability in critical applications.

Interpretation of Neural Networks is Fragile

The paper by Ghorbani, Abid, and Zou explores the reliability of the interpretability of neural networks, addressing an often-overlooked aspect of ML and neural network (NN) deployment - the robustness of interpretation methods. Interpretability is crucial for the deployment of ML systems in critical sectors such as medicine, finance, and law, where understanding the reasoning behind a model's decision can be as important as the prediction itself. This paper systematically evaluates the susceptibility of various interpretation methods to adversarial attacks, revealing significant vulnerabilities.

Key Contributions

Introduction of Adversarial Perturbations to Interpretation:
- The authors extend adversarial attack methodologies to target not only the predictions but also the interpretations of neural networks.
- The notion of fragility in NN interpretation is defined: a model's interpretation is fragile if small, adversarial perturbations can produce perceptually indistinguishable inputs with the same predicted label but substantially different interpretations.
Evaluation of Interpretation Methods:
- The paper evaluates the robustness of three widely-used feature importance interpretation methods: simple gradients, integrated gradients, and DeepLIFT, along with the exemplar-based method, influence functions.
- The experiments are conducted on two prominent datasets, ImageNet and CIFAR-10, with systematic characterization of the impact on interpretation robustness.
Design of Targeted Adversarial Attacks:
- The authors propose iterative attacks tailored to maximize dissimilarity between the original and perturbed interpretations while constraining changes to the input to be imperceptible.
- Specific attack methodologies include top- $k$ attacks, mass-center attacks, and targeted attacks that semantically alter the interpretation focus.

Numerical Results and Observations

Across 512 images from ImageNet, perturbations generated by the proposed adversarial methods consistently led to substantial changes in feature importance interpretations without altering the predicted label.
Feature importance measures were found to be highly sensitive, with relative drops in prediction confidence being minor.
Quantitative evaluations using metrics like Spearman’s rank order correlation and top- $k$ $k$ intersection demonstrated significant changes:
- For instance, random sign perturbations ( $L_\infty = 8$ ) retained less than 30% overlap in the top 1000 most salient pixels across all interpretation methods.
- Targeted and mass-center attacks showed even more pronounced effects.
Integrated gradients were observed to be the most resilient among the feature importance methods, although still susceptible to adversarial perturbations.

Implications of the Research

Practically, this fragility means that interpretations currently used to justify decisions in critical applications may not be reliable. Users must exercise caution in trusting interpretation methods, especially in contexts where explanations can influence real-world decisions, such as in medicine or financial services.

Theoretically, this research raises questions about the inherent deficiencies in the robustness of interpretability methods. The findings suggest that even if a neural network’s predictions are resilient to adversarial attacks, its interpretations might not be, creating a new dimension of vulnerability.

Future Developments in AI

The paper points to several directions for future research:

Developing More Robust Interpretation Methods: There's a clear need for interpretation methods that can withstand adversarial perturbations, ensuring reliability not just in predictions but in explanations.
Regularization Techniques: Theoretical insights from the paper suggest that regularizing the Lipschitz constant of interpretation functions could be a promising approach, hinting at potential methodologies akin to adversarial training but for interpretability.
Cross-domain Applications: Extending the paper to other domains beyond image data could help in developing universally robust interpretability standards applicable to diverse AI applications.

Conclusion

Ghorbani, Abid, and Zou’s paper is a pivotal reminder that as AI systems become more integral to decision-making processes, the robustness of their interpretability must not be neglected. The fragility of neural network interpretations to adversarial perturbations underscores the need for future research focussed on developing more reliable and robust interpretation methods, ensuring that neural networks can be trusted not only for their predictions but also for the explanations they provide.

PDF Markdown

Related Papers

YouTube

Show All Videos