Adversarial Examples Are Not Bugs, They Are Features (1905.02175v4)

Published 6 May 2019 in stat.ML, cs.CR, cs.CV, and cs.LG

Abstract: Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features derived from patterns in the data distribution that are highly predictive, yet brittle and incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a misalignment between the (human-specified) notion of robustness and the inherent geometry of the data.

Citations (1,726)

View on Semantic Scholar

Summary

The paper shows that adversarial examples arise from non-robust features inherent in datasets, rather than being mere anomalies.
It demonstrates through empirical analysis that training on robust features can significantly enhance model stability and accuracy.
The study formalizes a framework distinguishing robust from non-robust features, providing insights into adversarial transferability across models.

Overview of "Adversarial Examples Are Not Bugs, They Are Features"

The paper "Adversarial Examples Are Not Bugs, They Are Features" by Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry challenges the prevailing perspective on adversarial examples in machine learning. Rather than viewing these perturbations as anomalous or merely statistical artifacts, the authors posit that adversarial examples stem from the inherent properties of the datasets. They suggest that these examples exploit non-robust features that, although useful for classification, are brittle and incomprehensible to humans.

Key Insights and Claims

Non-Robust Features: The authors introduce the concept of non-robust features, which are patterns in the data that are predictive but can be manipulated easily by adversaries. These features contrast with robust features that remain stable under perturbations.
Empirical Demonstrations: By generating new datasets, the authors empirically separate robust and non-robust features. They show that:
- Training on a dataset filtered for robust features leads to classifiers with significantly improved robustness.
- Non-robust features alone can be sufficiently predictive to classify standard test sets accurately.
Conceptual Framework and Formalization: The authors provide a formal definition of non-robust features and illustrate their penetration in standard datasets through theoretical models. They argue that the presence of non-robust features justifies why adversarial examples are consistently transferable across different models.

Experiments and Results

Robust Dataset Construction

Method: They create a "robust" version of training datasets by retaining only the features used by robust classifiers (trained via adversarial training).
Outcome: Models trained on these robust datasets using standard training methods exhibit good accuracy and robustness.
Performance: On CIFAR-10, a model trained on the robust dataset achieved 85.4% accuracy, with 21.85% robust accuracy under $\ell_2$ perturbations of $\epsilon=0.5$ .

Non-Robust Features

Method: The authors construct datasets where input-label relationships are based solely on non-robust features by adversarially perturbing the inputs to match random or deterministic labels.
Outcome: Models trained on these datasets generalize well to the original test set, indicating that non-robust features are inherently predictive.
Performance: Classifiers trained on these datasets achieve up to 63.3% accuracy on CIFAR-10 and 87.9% on Restricted ImageNet under the standard test setting.

Adversarial Transferability

Observation: Different models trained on the non-robust datasets exhibit a stronger susceptibility to transfer attacks, supporting the notion that transferability arises from common non-robust features.
Implication: This insight sheds light on why adversarial examples crafted for one model often succeed against another.

Theoretical Contributions

Robust Features Model:
- Through a Gaussian classification scenario, the authors show that robust training aligns models more closely with human-interpretable features and reduces reliance on non-robust features.
- Mathematical formalism is provided to quantify the relationship between adversarial vulnerability and the nature of features learned, demonstrating the misalignment between the intrinsic data geometry and the adversary's perturbation constraints.
Gradient Interpretability:
- Robustly trained models exhibit gradient directions more aligned with the inter-class direction, making them semantically meaningful and interpretable.

Implications and Future Directions

The paper has significant implications:

Model Training: It suggests a paradigm shift in training models to discount non-robust features implicitly, enhancing model robustness.
Interpretability: Since non-robust features can be predictive, efforts to interpret models must consider these features' roles.
Transferability Understanding: The empirical phenomena observed could drive new defensive strategies by focusing on aligning features across model architectures.

The authors' work leads to a more nuanced understanding of adversarial machine learning. It points out the necessity of involving human priors directly in the learning process to ensure both robustness and interpretability. Future research could focus on formalizing the explicit encoding of such priors and studying their impact on broader AI applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/todesking/status/1879586231073587538

https://twitter.com/AMakelov/status/1771234653837439003

https://twitter.com/imabit_inc/status/1802729712390640080

https://twitter.com/imabit_inc/status/1829999386627109017

YouTube

Show All Videos