Adversarial Neuron Pruning Purifies Backdoored Deep Models (2110.14430v1)

Published 27 Oct 2021 in cs.LG, cs.CR, and cs.CV

Abstract: As deep neural networks (DNNs) are growing larger, their requirements for computational resources become huge, which makes outsourcing training more popular. Training in a third-party platform, however, may introduce potential risks that a malicious trainer will return backdoored DNNs, which behave normally on clean samples but output targeted misclassifications whenever a trigger appears at the test time. Without any knowledge of the trigger, it is difficult to distinguish or recover benign DNNs from backdoored ones. In this paper, we first identify an unexpected sensitivity of backdoored DNNs, that is, they are much easier to collapse and tend to predict the target label on clean samples when their neurons are adversarially perturbed. Based on these observations, we propose a novel model repairing method, termed Adversarial Neuron Pruning (ANP), which prunes some sensitive neurons to purify the injected backdoor. Experiments show, even with only an extremely small amount of clean data (e.g., 1%), ANP effectively removes the injected backdoor without causing obvious performance degradation.

Authors (2)

Dongxian Wu (12 papers)
Yisen Wang (120 papers)

Citations (244)

View on Semantic Scholar

Summary

The paper demonstrates that backdoored DNNs exhibit heightened sensitivity to adversarial perturbations, enabling effective neuron pruning.
The introduced Adversarial Neuron Pruning (ANP) method leverages minimal clean data (~1%) to remove vulnerable neurons while preserving overall model performance.
Experimental results show that ANP significantly reduces attack success across various backdoor techniques such as BadNets, Blend, and Input-aware attacks.

An Overview of "Adversarial Neuron Pruning Purifies Backdoored Deep Models"

The paper "Adversarial Neuron Pruning Purifies Backdoored Deep Models" addresses a significant concern in the use of deep neural networks (DNNs) trained via third-party platforms: the risk of backdoor attacks. In these attacks, a DNN exhibits expected behavior on standard inputs but produces pre-determined, incorrect outputs on inputs embedded with a specific trigger. This research presents a novel method termed Adversarial Neuron Pruning (ANP), aimed at mitigating these backdoor vulnerabilities without prior knowledge of the trigger pattern.

Key Contributions and Findings

Sensitivity of Backdoored Models: The authors identify that backdoored DNNs exhibit heightened sensitivity — collapsing into misclassifications targeting the pre-determined label — when their neurons are subjected to adversarial perturbations. This phenomenon occurs even in the absence of the trigger, suggesting a robust correlation between adversary-targeted behaviors and neuron perturbation sensitivity.
Adversarial Neuron Pruning (ANP): Leveraging this sensitivity, the paper proposes ANP as a defense mechanism, which prunes the most susceptible neurons — effectively purifying the model of its adversarial components. The method is demonstrated to be effective with access to only a minimal amount (~1%) of clean data, highlighting its feasibility under resource constraints.
Experimental Validation: The paper includes comprehensive experiments showcasing that ANP markedly reduces the attack success rate across several sophisticated backdoor techniques, including BadNets, Blend, and Input-aware Backdoor Attacks, without significant deterioration in model performance on untainted data.

Implications and Future Directions

The implications of this research are profound, both practically and academically. On a practical level, ANP can strengthen the trust in outsourced DNN models, ensuring that they are free from hidden adversarial influences. This is particularly crucial in domains with safety-critical applications, such as autonomous driving and healthcare.

From a theoretical perspective, ANP challenges the traditional belief that addressing backdoors requires knowledge of the trigger patterns. Instead, it shifts the focus towards exploiting intrinsic vulnerabilities via neuron sensitivity analysis.

Future work might explore:

Optimization and Efficiency: Further refinement of the ANP method could improve its efficiency, making it viable for large-scale network architectures or real-time applications.
Extending to Other Types of Neural Perturbations: Investigating whether similar sensitivity exists across various neural architecture manipulations could generalize ANP’s principles.
Broader Application Spectrum: While primarily focused on vision models currently, ANP’s principles could be extended to other domains such as LLMs or reinforcement learning agents.

In conclusion, the research presents a robust, data-driven approach to combatting insidious backdoor attacks, advancing both the methodology and philosophy of model purification in artificial intelligence.

PDF Markdown

Related Papers

GitHub

GitHub - csdongxian/ANP_backdoor: Codes for NeurIPS 2021 paper "Adversarial Neuron Pruning Purifies Backdoored Deep Models" (58 stars)