Ensemble Adversarial Training: Attacks and Defenses (1705.07204v5)

Published 19 May 2017 in stat.ML, cs.CR, and cs.LG

Abstract: Adversarial examples are perturbed inputs designed to fool machine learning models. Adversarial training injects such examples into training data to increase robustness. To scale this technique to large datasets, perturbations are crafted using fast single-step methods that maximize a linear approximation of the model's loss. We show that this form of adversarial training converges to a degenerate global minimum, wherein small curvature artifacts near the data points obfuscate a linear approximation of the loss. The model thus learns to generate weak perturbations, rather than defend against strong ones. As a result, we find that adversarial training remains vulnerable to black-box attacks, where we transfer perturbations computed on undefended models, as well as to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step. We further introduce Ensemble Adversarial Training, a technique that augments training data with perturbations transferred from other models. On ImageNet, Ensemble Adversarial Training yields models with strong robustness to black-box attacks. In particular, our most robust model won the first round of the NIPS 2017 competition on Defenses against Adversarial Attacks. However, subsequent work found that more elaborate black-box attacks could significantly enhance transferability and reduce the accuracy of our models.

Citations (2,594)

View on Semantic Scholar

Summary

The paper identifies the limitations of single-step adversarial training, showing that models converge to suboptimal minima which weaken defenses.
The paper introduces Ensemble Adversarial Training, which augments training with adversarial examples from diverse pre-trained models to prevent overfitting.
Experimental results on ImageNet and MNIST demonstrate significantly improved robustness and reduced error rates against black-box adversarial attacks.

Ensemble Adversarial Training: Attacks and Defenses

The paper presented in "Ensemble Adversarial Training: Attacks and Defenses" tackles key limitations in the robustness of ML models against adversarial attacks. Adversarial examples, which are inputs intentionally perturbed to mislead ML models, pose significant threats to model reliability and security. The authors emphasize the vulnerability of adversarial training—a defense technique that involves injecting adversarial examples into training data—when scaled to large models and datasets such as ImageNet. Specifically, adversarially trained models tend to converge to suboptimal minima, resulting in poor robustness against even simple black-box attacks. This paper introduces a novel training strategy named Ensemble Adversarial Training to mitigate these issues.

Adversarial Training and Its Limitations

The paper begins by establishing the context and background related to adversarial examples and adversarial training. Models exposed to adversarial examples during training have shown increased robustness to similar white-box attacks; however, scaling this approach to larger datasets remains inadequate. Prior work has demonstrated that models trained with adversarial examples derived from fast, single-step methods (e.g., FGSM) tend to learn suboptimal decision boundaries, resulting in weak perturbations that fail against stronger adversarial examples or black-box attacks.

The authors argue and empirically demonstrate that adversarial training relying on single-step attacks leads to degenerate minima. Here the model's loss landscape near data points exhibits sharp curvatures which degrade the effectiveness of linear approximations used in single-step methods. Consequently, adversarially trained models remain susceptible to attacks crafted on undefended models and novel attacks like the R+FGSM, which combine a small initial random perturbation followed by linear approximation. These insights highlight the limitations of single-step adversarial training in achieving resilient models.

Ensemble Adversarial Training

In response to these shortcomings, the authors propose Ensemble Adversarial Training. This advanced technique augments the training data with adversarial examples transferred from a diverse set of pre-trained models. By involving perturbations calculated from external, static models, this approach decouples the generation of adversarial examples from the model being trained. This decoupling precludes the model from overfitting to specific perturbations, thereby enhancing robustness against various unseen black-box attacks.

Experimental Validation

The effectiveness of Ensemble Adversarial Training is validated through a comprehensive suite of experiments on ImageNet and MNIST datasets. For ImageNet, models trained using Ensemble Adversarial Training achieved heightened robustness against black-box attacks with substantially lower accuracy loss compared to standard adversarially trained models. For example, the Inception ResNet v2 model, when trained with Ensemble Adversarial Training, demonstrated a significant reduction in error rates against transferred attacks, maintaining a robust accuracy even under strong black-box adversarial conditions.

Similarly robust trends are observed in MNIST, with the models showing resilience to a range of black-box attacks, although with some limitations in combating highly sophisticated transfer-based attacks. The results underscore the method's capability to generalize well across different model architectures and datasets.

Theoretical Implications and Future Directions

This paper contributes a robust theoretical grounding by situating Ensemble Adversarial Training within frameworks of domain adaptation and providing generalization bounds that assert the method's efficacy against future black-box adversaries. The approach leverages combinatorial designs to ensure optimally aligned gradient-based adversarial subspaces, extending its robustness across varying threat models.

Future developments in AI robustness can explore the expansion of Ensemble Adversarial Training to incorporate more diverse and complex adversarial examples, including those generated via generative models or interactive black-box methods. Moreover, scaling this technique further to more complex vision and NLP tasks could solidify its practicality and industry adoption. As adversarial tactics advance, continually evolving defense mechanisms like Ensemble Adversarial Training will play a pivotal role in ensuring the security and reliability of ML systems.

In conclusion, "Ensemble Adversarial Training: Attacks and Defenses" presents a significant advancement in the defensive methodologies against adversarial examples, showcasing empirical success and paving the way for future research in robust AI development.

PDF Markdown

Related Papers

Adversarial Machine Learning at Scale (2016)
Attacking Adversarial Attacks as A Defense (2021)
Adversarial Robustness through Local Linearization (2019)
Efficient Two-Step Adversarial Defense for Deep Neural Networks (2018)
Gray-box Adversarial Training (2018)