Provable defenses against adversarial examples via the convex outer adversarial polytope (1711.00851v3)

Published 2 Nov 2017 in cs.LG, cs.AI, and math.OC

Abstract: We propose a method to learn deep ReLU-based classifiers that are provably robust against norm-bounded adversarial perturbations on the training data. For previously unseen examples, the approach is guaranteed to detect all adversarial examples, though it may flag some non-adversarial examples as well. The basic idea is to consider a convex outer approximation of the set of activations reachable through a norm-bounded perturbation, and we develop a robust optimization procedure that minimizes the worst case loss over this outer region (via a linear program). Crucially, we show that the dual problem to this linear program can be represented itself as a deep network similar to the backpropagation network, leading to very efficient optimization approaches that produce guaranteed bounds on the robust loss. The end result is that by executing a few more forward and backward passes through a slightly modified version of the original network (though possibly with much larger batch sizes), we can learn a classifier that is provably robust to any norm-bounded adversarial attack. We illustrate the approach on a number of tasks to train classifiers with robust adversarial guarantees (e.g. for MNIST, we produce a convolutional classifier that provably has less than 5.8% test error for any adversarial attack with bounded $\ell_\infty$ norm less than $\epsilon = 0.1$), and code for all experiments in the paper is available at https://github.com/locuslab/convex_adversarial.

Authors (2)

Eric Wong (47 papers)
J. Zico Kolter (151 papers)

Citations (1,461)

View on Semantic Scholar

Summary

The paper introduces a convex outer approximation of the adversarial polytope, enabling provable defense guarantees for ReLU classifiers against norm-bounded perturbations.
It employs a dual formulation that mirrors backpropagation, allowing efficient computation of tight activation bounds during training.
Experimental results on datasets like MNIST show that the method achieves provable robustness, with test errors below 5.8% for adversarial attacks under small perturbations.

Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope

In the rapidly evolving field of machine learning, the robustness of deep neural networks against adversarial examples remains a critical issue. The paper "Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope" by Eric Wong and J. Zico Kolter addresses this pressing challenge by proposing a method to train ReLU-based classifiers that are provably robust to norm-bounded adversarial perturbations.

Summary of Key Contributions

The authors offer a method that provides guaranteed robustness through a convex outer approximation of the so-called "adversarial polytope." Their contributions can be summarized as follows:

Convex Approximation of the Adversarial Polytope:
- The approach involves constructing a convex outer bound for the set of all final-layer activations that a norm-bounded perturbation to the input could achieve. This polytope is originally non-convex and challenging to optimize over. By creating a convex approximation, the authors make the problem tractable.
Dual Formulation:
- A significant theoretical contribution is the formulation of the dual to the optimization problem over this convex polytope. Intriguingly, they show that the dual problem can be represented as a feedforward network similar to the backpropagation network, allowing for efficient computation utilizing existing deep learning frameworks.
Efficient Computation of Bounds:
- The method involves computing bounds on the possible activation values during forward propagation through the network. These bounds ensure that the outer approximation remains tight and mitigate issues of scalability typically associated with robust optimization.
Training with Provable Guarantees:
- By integrating these bounds into the training process via robust optimization techniques, the resulting classifiers are not only more robust but also come with provable guarantees of robustness against any norm-bounded adversarial attacks.
Experimental Validation:
- The effectiveness and scalability of the proposed method are demonstrated across several tasks including MNIST, Fashion-MNIST, Human Activity Recognition (HAR), and Street View House Numbers (SVHN). In particular, for the MNIST dataset, the authors produce a convolutional network that achieves less than 5.8% test error for adversarial attacks with bounded \ell_infty norm of less than \epsilon = 0.1.

Implications and Future Directions

Practical Implications:

The primary practical implication of this work is that it enables the training of robust classifiers in a more efficient manner compared to existing combinatorial approaches, such as those using SMT solvers or integer programming. By leveraging techniques from convex optimization and duality theory, this approach scales to moderate-sized neural networks and can be trained using standard deep learning tools like stochastic gradient descent.

Theoretical Implications:

This work bridges a gap between robust optimization and adversarial learning, demonstrating the synergy between these domains. The derivation of the dual network is particularly insightful, showing how adversarial robustness can be efficiently incorporated into the training process without relying on expensive combinatorial methods.

Future Developments:

Scalability: While the method scales well to moderate-sized networks, further work is required to apply it to very large networks, such as those used in ImageNet classification. Future research could explore techniques like bottleneck layers and random projections to enhance scalability.
Generalization to Other Norms: Although the paper focuses primarily on the \ell_infty norm, adapting the methodology to other norms (e.g., \ell_2 or Wasserstein distances) could broaden its applicability.
Broader Adversarial Defense Mechanisms: Beyond norm-bounded perturbations, other forms of adversarial attacks including transformations like rotations and translations should be considered. This generalization would make the defensive framework more robust to a wider variety of real-world perturbations.
Application to Other ML Tasks: The techniques presented could be extended to other tasks beyond classification, such as regression or other supervised learning tasks, providing robust solutions in broader machine learning contexts.

Conclusion

The paper by Wong and Kolter introduces a sophisticated yet practical framework for enhancing the robustness of neural networks against adversarial attacks. By leveraging convex approximations and duality, they attain a methodology that promises both efficiency and provable guarantees of robustness, marking a significant step toward more secure machine learning systems. As the discourse in AI safety and robustness continues to evolve, the insights provided by this research will likely spur further innovations and refinements in the domain.

PDF Markdown

Related Papers

GitHub

GitHub - locuslab/convex_adversarial: A method for training neural networks that are provably robust to adversarial attacks. (380 stars)