Practical Black-Box Attacks against Machine Learning (1602.02697v4)

Published 8 Feb 2016 in cs.CR and cs.LG

Abstract: Machine learning (ML) models, e.g., deep neural networks (DNNs), are vulnerable to adversarial examples: malicious inputs modified to yield erroneous model outputs, while appearing unmodified to human observers. Potential attacks include having malicious content like malware identified as legitimate or controlling vehicle behavior. Yet, all existing adversarial example attacks require knowledge of either the model internals or its training data. We introduce the first practical demonstration of an attacker controlling a remotely hosted DNN with no such knowledge. Indeed, the only capability of our black-box adversary is to observe labels given by the DNN to chosen inputs. Our attack strategy consists in training a local model to substitute for the target DNN, using inputs synthetically generated by an adversary and labeled by the target DNN. We use the local substitute to craft adversarial examples, and find that they are misclassified by the targeted DNN. To perform a real-world and properly-blinded evaluation, we attack a DNN hosted by MetaMind, an online deep learning API. We find that their DNN misclassifies 84.24% of the adversarial examples crafted with our substitute. We demonstrate the general applicability of our strategy to many ML techniques by conducting the same attack against models hosted by Amazon and Google, using logistic regression substitutes. They yield adversarial examples misclassified by Amazon and Google at rates of 96.19% and 88.94%. We also find that this black-box attack strategy is capable of evading defense strategies previously found to make adversarial example crafting harder.

PDF Abstract

Practical Black-Box Attacks against Machine Learning

The research paper "Practical Black-Box Attacks against Machine Learning" authored by Nicolas Papernot et al. introduces an innovative approach for crafting adversarial examples targeting machine learning models, such as deep neural networks (DNNs), in scenarios where the adversary has no knowledge of the model internals or training data. This research addresses and proposes solutions to limitations of previous work which required detailed knowledge of the models or specific training data.

Summary of the Methodology

The presented attack strategy operates under the assumption that the adversary can only observe the labels assigned by the targeted DNN to chosen inputs. The approach is divided into two main steps:

Substitute Model Training: The adversary trains a local substitute model that approximates the target DNN’s decision-making process. This involves generating a synthetic dataset using inputs selected by a Jacobian-based heuristic and labeled by the target DNN.
Adversarial Sample Crafting: The adversary uses the trained substitute model to craft adversarial examples, which are then expected to transfer and mislead the target DNN due to the similarity in decision boundaries between the substitute and the target.

Empirical Evaluation

The paper demonstrates the practicality and effectiveness of the proposed method by attacking DNNs hosted by various platforms, including MetaMind, Amazon, and Google. For instance, the attack against MetaMind's remotely hosted DNN classifier yielded a high misclassification rate of 84.24% when adversarial examples were crafted using the local substitute model. Similarly, high rates of misclassification were observed for models hosted by Amazon and Google, achieving 96.19% and 88.94% respectively.

Key Findings and Numerical Results

General Applicability: The attack strategy was validated across different machine learning techniques, not just DNNs. For example, logistic regression was used as a substitute model for attacking logistic regression, SVM, decision tree, and nearest neighbor classifiers.
Refinements: The paper found that implementing periodic step sizes and reservoir sampling during substitute training significantly reduces the number of queries needed from the target DNN, thus rendering the attack more efficient.
Adversarial Training: Evaluations demonstrated that even models strengthened by adversarial training and defensive distillation could still be successfully attacked using the proposed black-box method.

Implications and Future Work

The theoretical and practical implications of this research are substantial. The ability to successfully attack black-box models without in-depth knowledge or access to training data indicates a significant vulnerability in current machine learning systems, especially those offered as services by major platforms. This necessitates the exploration of more robust defense mechanisms beyond gradient masking, potentially involving fundamental redesigns in how models are trained and evaluated against adversarial inputs.

Future developments in artificial intelligence will need to account for these vulnerabilities, exploring not only new defensive techniques but also continuously evolving attack strategies. The insights garnered from this paper suggest that defense mechanisms should aim at creating robustness against finite perturbations rather than focusing solely on infinitesimal perturbations.

Overall, this research contributes a practical and scalable method for real-world adversaries to exploit machine learning models, urging the community to advance towards more resilient AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Nicolas Papernot (123 papers)
Patrick McDaniel (70 papers)
Ian Goodfellow (54 papers)
Somesh Jha (112 papers)
Z. Berkay Celik (23 papers)
Ananthram Swami (97 papers)

Citations (3,519)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos