Exploring the Space of Black-box Attacks on Deep Neural Networks (1712.09491v1)

Published 27 Dec 2017 in cs.LG, cs.CR, and cs.CV

Abstract: Existing black-box attacks on deep neural networks (DNNs) so far have largely focused on transferability, where an adversarial instance generated for a locally trained model can "transfer" to attack other learning models. In this paper, we propose novel Gradient Estimation black-box attacks for adversaries with query access to the target model's class probabilities, which do not rely on transferability. We also propose strategies to decouple the number of queries required to generate each adversarial sample from the dimensionality of the input. An iterative variant of our attack achieves close to 100% adversarial success rates for both targeted and untargeted attacks on DNNs. We carry out extensive experiments for a thorough comparative evaluation of black-box attacks and show that the proposed Gradient Estimation attacks outperform all transferability based black-box attacks we tested on both MNIST and CIFAR-10 datasets, achieving adversarial success rates similar to well known, state-of-the-art white-box attacks. We also apply the Gradient Estimation attacks successfully against a real-world Content Moderation classifier hosted by Clarifai. Furthermore, we evaluate black-box attacks against state-of-the-art defenses. We show that the Gradient Estimation attacks are very effective even against these defenses.

Authors (4)

Arjun Nitin Bhagoji (25 papers)
Warren He (8 papers)
Bo Li (1107 papers)
Dawn Song (229 papers)

Citations (65)

View on Semantic Scholar

Summary

Overview of Black-box Attacks on Deep Neural Networks

In the presented paper, the authors delineate a thorough exploration of black-box attacks on deep neural networks (DNNs), emphasizing the development of novel Gradient Estimation attacks. These attacks signify a departure from traditional black-box attacks reliant on transferability, illustrating that adversaries can achieve high success rates with mere query access to the target model’s class probabilities. The research systematically evaluates these attacks across well-known datasets, notably MNIST and CIFAR-10, and against robust, real-world classifiers like Clarifai's Content Moderation model. A striking highlight is the formulation of strategies to reduce the number of queries, decoupling the requirements from the input dimensionality, therefore expanding the attack's feasibility and efficiency substantially.

Key Contributions

Novel Gradient Estimation Attacks: The paper introduces Gradient Estimation black-box attacks that utilize query access to model predictions. These attacks estimate gradients through finite differences without explicit knowledge of model architecture or dataset representation, emulating the efficacy of white-box attacks.
Query-reduction Strategies: The authors propose query-reduction strategies using random feature grouping and principal component analysis (PCA). These tactics markedly mitigate the computational expense associated with query-based gradient estimation, whilst sustaining adversarial success rates comparable to iteration-based white-box tactics.
Evaluation Across Datasets and Models: A comprehensive assessment is conducted on the effectiveness of these attacks against state-of-the-art models trained on MNIST and CIFAR-10 datasets. The experiments describe how the devised attacks outperform existing transferability-based black-box attacks.
Real-world System Attacks: Demonstrating practical adversarial exploit potential, the Gradient Estimation attacks are successfully tested against Clarifai's NSFW and Content Moderation classifiers. This underscores the vulnerability of deployed systems to intelligently crafted adversarial inputs, derived without prior access to training data.
Robustness Against Defenses: Even against modern adversarial defenses—standard adversarial training, ensemble approaches, and iterative adversarial training—the Gradient Estimation attacks exhibit robustness. While adversarial training confers some resilience, iterative forms show significant susceptibility especially when conventional strategies are deployed.

Results and Implications

Quantitative results reveal that Gradient Estimation attacks can closely achieve 100% adversarial success rates in iterative forms—both targeted and untargeted—across evaluated datasets, paralleling the performance of white-box analytical techniques. Notably, the success rates against real-world models are achieved with minimal queries, as low as approximately 200 queries per image, affirming their practical viability.

This research marks a critical step forward in reinforcing the understanding of DNN vulnerabilities in black-box environments and posits implications for AI security measures. With adversarial robustness being crucial for AI deployed in sensitive applications, the findings bolster the impetus for developing strengthened AI defenses against an evolving landscape of attack mechanisms.

Future Directions

The exploration presented establishes groundwork conducive to further examinations into enhanced adaptive learning systems that can potentially integrate real-time adversarial detection and mitigation. Additionally, the findings instigate potential advancement in optimising attack strategies, further refining query and computational efficiency, and exploring novel defense algorithms designed to thwart such precise, gradient-based estimation methods.

The paper offers meaningful insights into the extent of threats posed by adversarial samples and delineates the dynamic interplay between attacker strategies and defense mechanisms in AI, highlighting the perpetual need for algorithmic evolution to safeguard technological advancements.

PDF Markdown

Related Papers

GitHub

GitHub - sunblaze-ucb/blackbox-attacks: Code used in 'Exploring the Space of Black-box Attacks on Deep Neural Networks' (https://arxiv.org/abs/1712.09491) (61 stars)