Interpretable Deep Learning under Fire (1812.00891v3)

Published 3 Dec 2018 in cs.CR and cs.LG

Abstract: Providing explanations for deep neural network (DNN) models is crucial for their use in security-sensitive domains. A plethora of interpretation models have been proposed to help users understand the inner workings of DNNs: how does a DNN arrive at a specific decision for a given input? The improved interpretability is believed to offer a sense of security by involving human in the decision-making process. Yet, due to its data-driven nature, the interpretability itself is potentially susceptible to malicious manipulations, about which little is known thus far. Here we bridge this gap by conducting the first systematic study on the security of interpretable deep learning systems (IDLSes). We show that existing \imlses are highly vulnerable to adversarial manipulations. Specifically, we present ADV^2, a new class of attacks that generate adversarial inputs not only misleading target DNNs but also deceiving their coupled interpretation models. Through empirical evaluation against four major types of IDLSes on benchmark datasets and in security-critical applications (e.g., skin cancer diagnosis), we demonstrate that with ADV² the adversary is able to arbitrarily designate an input's prediction and interpretation. Further, with both analytical and empirical evidence, we identify the prediction-interpretation gap as one root cause of this vulnerability -- a DNN and its interpretation model are often misaligned, resulting in the possibility of exploiting both models simultaneously. Finally, we explore potential countermeasures against ADV^2, including leveraging its low transferability and incorporating it in an adversarial training framework. Our findings shed light on designing and operating IDLSes in a more secure and informative fashion, leading to several promising research directions.

Citations (161)

View on Semantic Scholar

Summary

The paper introduces Adv^2 attacks, a novel class capable of misleading both DNN classifiers and their interpretation models simultaneously using adversarial inputs.
The study reveals that adversarial attacks can force misclassification while producing interpretations resembling benign inputs, potentially misleading human oversight.
The authors identify the prediction-interpretation gap as a key vulnerability and note the surprisingly low transferability of Adv^2 attacks across different interpreters.

Security Vulnerabilities in Interpretable Deep Learning Systems

The paper "Interpretable Deep Learning under Fire" addresses the critical topic of security vulnerabilities in interpretable deep learning systems (IDLSes). In security-sensitive domains, deep neural networks (DNNs) are utilized extensively, yet they are often implemented as black-box models, raising concerns regarding their interpretability and susceptibility to adversarial attacks. This paper presents a systematic paper highlighting the vulnerabilities inherent in IDLSes, revealing potential risks that have been overlooked in some of these systems.

Weaknesses in Interpretable Deep Learning Systems

The authors introduce a novel class of attacks, referred to as Adv $^2$ , which are capable of misleading both the DNN classifiers and their allied interpretation models simultaneously. This dual deception is achieved through the generation of adversarial inputs designed to confuse both the prediction mechanism of the target DNNs and the interpretative output provided by their coupled interpreters. The essence of this paper lies in demonstrating that the promise of security claimed by IDLSes is often a false assurance, as both the data-driven nature and the interpretation frameworks themselves can be manipulated by adversaries.

Key Findings and Implications

Adversarial Input Manipulation: The paper illustrates that adversarial attacks can force a DNN to misclassify inputs, while concurrently producing interpretations that closely resemble those of benign inputs. This implies that even when IDLSes engage human oversight in the decision-making process, the interpretations provided may not reflect actual model behavior due to underlying adversarial manipulations.
Prediction-Interpretation Gap: The authors identify the prediction-interpretation gap as a fundamental vulnerability. This gap occurs when the inner workings of a DNN and its related interpretation model are misaligned, allowing adversaries to exploit discrepancies between model predictions and interpretations.
Low Transferability Across Interpreters: The paper uncovers that the transferability of adversarial inputs across different interpreters is surprisingly low. This stems from the fact that different interpretation models encapsulate DNN behavior distinctively and usually in a fragmented manner.

Experimental Analysis

Through empirical evaluation leveraging benchmark datasets across various classifiers and interpreters, the vulnerability of IDLSes becomes apparent. The performance of Adv $^2$ attacks against boundary-driven interpreters like gradient saliency and representation-driven ones like CAM is explored extensively. The findings indicate these adversarial inputs are adept at achieving their intended deceptions with high efficacy.

Theoretical and Practical Countermeasures

Reflecting on the implications for the AI community and practitioners in relevant fields, several promising research directions are proposed. The exploration of ensemble interpretations utilizing multiple complementary models is suggested to mitigate risks. Additionally, adversarial interpretation distillation (Aid), which incorporates adversarial inputs into training frameworks, leading to potentially more robust interpreters, opens a novel avenue for strengthening IDLSes.

Conclusion

This paper emphasizes the necessity of reevaluating current approaches to interpretability in deep learning systems. As AI models permeate areas demanding significant security, addressing both theoretical and practical challenges posed by deception tactics like Adv $^2$ underscores the importance of developing resilient interpretative frameworks. This foundational work serves as a vital push towards evolving more secure AI systems, with IDLSes remaining at the forefront of robust design innovation.