Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering (1811.03728v1)

Published 9 Nov 2018 in cs.LG, cs.CR, and stat.ML

Abstract: While ML models are being increasingly trusted to make decisions in different and varying areas, the safety of systems using such models has become an increasing concern. In particular, ML models are often trained on data from potentially untrustworthy sources, providing adversaries with the opportunity to manipulate them by inserting carefully crafted samples into the training set. Recent work has shown that this type of attack, called a poisoning attack, allows adversaries to insert backdoors or trojans into the model, enabling malicious behavior with simple external backdoor triggers at inference time and only a blackbox perspective of the model itself. Detecting this type of attack is challenging because the unexpected behavior occurs only when a backdoor trigger, which is known only to the adversary, is present. Model users, either direct users of training data or users of pre-trained model from a catalog, may not guarantee the safe operation of their ML-based system. In this paper, we propose a novel approach to backdoor detection and removal for neural networks. Through extensive experimental results, we demonstrate its effectiveness for neural networks classifying text and images. To the best of our knowledge, this is the first methodology capable of detecting poisonous data crafted to insert backdoors and repairing the model that does not require a verified and trusted dataset.

PDF Abstract

Detection and Mitigation of Backdoor Attacks in Deep Neural Networks Using Activation Clustering

The paper "Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering" by Chen et al. presents a novel methodology for identifying and mitigating backdoor attacks in deep neural networks (DNNs). With the growing deployment of ML models in various critical applications, the threat landscape has expanded, necessitating robust defenses against increasingly sophisticated attacks. This paper addresses the specific challenge of detecting training data poisoning designed to implant backdoors in neural networks.

Motivation and Context

As ML models become integral to decision-making in numerous domains, their security becomes paramount, especially when models are trained on datasets from potentially untrustworthy sources. The integrity of models can be compromised by adversaries inserting carefully crafted samples into the training set, known as poisoning attacks. Such attacks can embed backdoors or trojans in the model. This paper introduces the Activation Clustering (AC) method, offering a pioneering approach to detect and neutralize backdoor attacks without the need for a pre-validated and trusted dataset.

Key Contributions

The authors articulate several significant contributions:

Novel Detection Methodology: The AC method stands as the first technique capable of detecting and repairing backdoor-infested DNNs without relying on verified and trusted datasets.
Extensive Evaluation: The robustness and effectiveness of the AC method are demonstrated across multiple datasets, including text and image data.
Open Source Implementation: The method has been encapsulated within the IBM Adversarial Robustness Toolbox, facilitating broader adoption and further research.

Technical Approach

The AC method leverages the insight that while both backdoor and legitimate samples may receive the same classification by the compromised network, the internal activation patterns leading to these classifications differ. The method involves the following steps:

Training: Initially, the DNN is trained using the potentially poisoned dataset.
Activation Extraction: Activations from the last hidden layer are recorded as they represent the network's decision process.
Clustering: Dimensionality reduction (e.g., using Independent Component Analysis) is performed on the activations followed by clustering (using methods such as k-means) to segregate them.

Multiple cluster analysis techniques are proposed to distinguish between clusters that comprise poisoned data versus legitimate data:

Exclusionary Reclassification: Retrain the model excluding suspect clusters and analyze misclassifications.
Relative Size Comparison: Evaluate the size ratio of clusters to flag anomalies.
Silhouette Score: Metric to assess the cohesiveness of clusters to determine poison presence.

Experimental Results

Extensive experiments were conducted on datasets such as MNIST, LISA (traffic sign data), and Rotten Tomatoes (text reviews). The AC method achieved near-perfect detection rates in distinguishing poisoned samples:

Accuracy and F1 Scores: For MNIST (10% poisoned), accuracy and F1 scores were close to 100%. Clustering the raw inputs fared significantly worse, underscoring the effectiveness of analyzing activations.
Resilience to Multimodal Classes and Multiple Poisons: The method proved robust even when target classes contained diverse subpopulations or multiple sources of poison.

Implications and Future Work

The AC method enhances the security of ML deployments against backdoor attacks by providing a practical, dataset-agnostic solution. It supports safe model deployment in environments where data verification is challenging or infeasible. Future research might explore the efficacy of this method in other ML models and adversarial settings, including generative adversarial frameworks and reinforcement learning. Additionally, refining cluster analysis techniques to further reduce computational overheads and improving repair mechanisms could broaden the method's applicability and efficiency.

In conclusion, the AC method provides a critical advancement in the defense against poisoning attacks, safeguarding the integrity of ML models in increasingly adversarial settings. This paper's contributions lay the groundwork for further innovations in robust ML security protocols.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Bryant Chen (7 papers)
Wilka Carvalho (9 papers)
Nathalie Baracaldo (34 papers)
Heiko Ludwig (17 papers)
Benjamin Edwards (11 papers)
Taesung Lee (9 papers)
Ian Molloy (11 papers)
Biplav Srivastava (57 papers)

Citations (729)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/briandcolwell/status/1910130151620522039