Improving Adversarial Transferability via Neuron Attribution-Based Attacks (2204.00008v1)

Published 31 Mar 2022 in cs.LG and cs.CR

Abstract: Deep neural networks (DNNs) are known to be vulnerable to adversarial examples. It is thus imperative to devise effective attack algorithms to identify the deficiencies of DNNs beforehand in security-sensitive applications. To efficiently tackle the black-box setting where the target model's particulars are unknown, feature-level transfer-based attacks propose to contaminate the intermediate feature outputs of local models, and then directly employ the crafted adversarial samples to attack the target model. Due to the transferability of features, feature-level attacks have shown promise in synthesizing more transferable adversarial samples. However, existing feature-level attacks generally employ inaccurate neuron importance estimations, which deteriorates their transferability. To overcome such pitfalls, in this paper, we propose the Neuron Attribution-based Attack (NAA), which conducts feature-level attacks with more accurate neuron importance estimations. Specifically, we first completely attribute a model's output to each neuron in a middle layer. We then derive an approximation scheme of neuron attribution to tremendously reduce the computation overhead. Finally, we weight neurons based on their attribution results and launch feature-level attacks. Extensive experiments confirm the superiority of our approach to the state-of-the-art benchmarks.

Citations (109)

View on Semantic Scholar

Summary

The paper introduces Neuron Attribution-Based Attacks (NAA), a novel method that uses neuron attribution to estimate neuron importance more accurately, enhancing the transferability of adversarial examples.
NAA demonstrates superior performance against existing state-of-the-art feature-level attacks, achieving higher attack success rates on both undefended and defended models in empirical experiments.
This research provides a new perspective for evaluating DNN vulnerabilities and potentially aids in model interpretability, offering a transferable attack applicable in black-box settings.

Overview of Improving Adversarial Transferability via Neuron Attribution-Based Attacks

This paper introduces a novel method for increasing the transferability of adversarial attacks on deep neural networks (DNNs) by leveraging neuron attribution. The proposed Neuron Attribution-Based Attack (NAA) focuses on crafting adversarial examples that can more effectively transfer across different models, particularly in black-box settings where model parameters and architectures are not accessible.

DNNs have been recognized for their vulnerability to adversarial examples, which are inputs modified to cause incorrect outputs without being perceptible to human observers. Enhancing the ability to transfer these adversarial examples between different DNN architectures—without requiring queries—is of crucial importance in real-world applications. To address the shortcomings of previous feature-level attacks, which suffer from inadequate neuron importance estimations, the paper proposes estimating neuron importance more precisely using neuron attribution techniques.

Key Contributions

Neuron Attribution Method: The paper employs neuron attribution methods originally intended for understanding model decisions to estimate neuron importance. By fully attributing a model's output to each neuron, more accurate estimations are achieved, which are both scalable and computationally efficient through a proposed approximation scheme.
Feature-Level Attack: Utilizing enhanced neuron importance estimation, the NAA method weights neurons based on their attribution results, hence attacking feature-level outputs with refined strategies, enhancing the transferability of generated adversarial examples.
Empirical Validation: Extensive experiments demonstrate that NAA outperforms existing state-of-the-art feature-level attacks in terms of transferability on both undefended and defended models. Notably, NAA demonstrates substantial improvements in attack success rates against adversarially trained models and models with advanced defense mechanisms.

Experimental Results

The experimental results highlight NAA's superiority, achieving high attack success rates against multiple models. Particularly, NAA surpasses other feature-level attacks such as NRDM, FDA, and FIA in both white-box and black-box settings. When integrated with input transformation techniques like DIM and PIM, NAA-PD further amplifies the transferability, showing enhanced robustness against recent adversarial defenses.

Implications and Future Directions

This research provides a compelling method for understanding and exploiting the inner workings of DNNs via neuron attribution, a perspective that may lead to more transparent and efficient ways to assess model vulnerabilities. Practically, it strengthens DNN robustness evaluation processes by offering a transferable attack model applicable in scenarios with limited model access.

Theoretically, the framework suggests that neuron attribution can serve as a basis for various tasks beyond adversarial attacks, potentially aiding in model interpretability and debugging. Future work might explore deeper integration with model defense strategies, adjusting neuron attribution techniques under adversarial settings to aid defensive measures.

In summary, this paper contributes a significant advancement in adversarial machine learning, enhancing the understanding and exploitation of neural architecture to craft highly transferable adversarial examples, marking a step forward in the continual arms race of AI adversarial research.

Related Papers

GitHub

GitHub - jpzhang1810/NAA: Official Tensorflow implementation for "Improving Adversarial Transferability via Neuron Attribution-based Attacks" (CVPR 2022) (34 stars)