Certified Defenses against Adversarial Examples (1801.09344v2)

Published 29 Jan 2018 in cs.LG

Abstract: While neural networks have achieved high accuracy on standard image classification benchmarks, their accuracy drops to nearly zero in the presence of small adversarial perturbations to test inputs. Defenses based on regularization and adversarial training have been proposed, but often followed by new, stronger attacks that defeat these defenses. Can we somehow end this arms race? In this work, we study this problem for neural networks with one hidden layer. We first propose a method based on a semidefinite relaxation that outputs a certificate that for a given network and test input, no attack can force the error to exceed a certain value. Second, as this certificate is differentiable, we jointly optimize it with the network parameters, providing an adaptive regularizer that encourages robustness against all attacks. On MNIST, our approach produces a network and a certificate that no attack that perturbs each pixel by at most \epsilon = 0.1 can cause more than 35% test error.

Authors (3)

Aditi Raghunathan (56 papers)
Jacob Steinhardt (88 papers)
Percy Liang (239 papers)

Citations (941)

View on Semantic Scholar

Summary

The paper introduces a semidefinite programming method to certify neural networks against ℓ∞ adversarial examples.
It proposes differentiable certificates that jointly optimize network parameters and robustness during training.
Experimental results on MNIST show that SDP-regularized networks can restrict adversarial test error to 35% for ε = 0.1.

Certified Defenses against Adversarial Examples

The paper "Certified Defenses against Adversarial Examples" by Aditi Raghunathan, Jacob Steinhardt, and Percy Liang addresses a critical vulnerability in neural networks—adversarial examples. These are small perturbations to the input data that drastically reduce the accuracy of otherwise high-performing networks. The paper proposes and evaluates a novel approach to create neural networks that are certifiably robust against such adversarial perturbations.

Summary of Contributions

The paper makes several key contributions to the field of adversarial robustness in neural networks:

Certified Robustness via Semidefinite Relaxation:
- The authors introduce a method based on semidefinite programming (SDP) for generating certificates of robustness for neural networks with one hidden layer. This method asserts that no adversarial perturbation, bounded by a given ε in the ℓ∞-norm, can force the classification error to exceed a certain value.
Differentiable Certificates:
- The certification method yields a differentiable robustness measure, which allows the network parameters to be optimized jointly with the certificate during training. This creates an adaptive regularizer that encourages robustness against all possible attacks within a specified class.
Empirical Validation:
- The authors demonstrate their approach on the MNIST dataset. They produce a network that, assigned with an ε = 0.1, has a certificate guaranteeing that no adversarial perturbation can cause more than 35% test error.

Detailed Technical Approach

Setup:

The problem of adversarial examples is highlighted by considering small perturbations to inputs that lead to misclassification. The authors focus on ℓ∞-bounded perturbations and aim to break the arms race between attackers and defenders by providing formal guarantees of robustness.

Score-based Classifiers and Attack Model:

The approach generalizes score-based classifiers and considers white-box attackers capable of perturbations within ℓ∞-balls around original inputs. The worst-case adversarial example is defined by maximizing the margin of incorrect class scores over the correct class.

Certificate Formulation:

For linear classifiers, the robustness is certified by leveraging the ℓ1-norm of the weight differences. Extending to neural networks, the authors bounded the norm of the gradient of the margin function over the perturbation ball, translating the problem to a semidefinite program that provides a tractable upper bound on the worst-case loss.

Optimization Strategy:

They proposed a novel dual formulation for efficiently optimizing the robustness certificate during training. This dual formulation circumvents the computational burden of repeated semidefinite programming by leveraging stochastic gradient methods.

Experimental Results

The authors conducted extensive experiments on the MNIST dataset to evaluate the proposed method against standard baselines:

Comparison of Bounds:
- The SDP-based upper bounds were compared to simpler Frobenius and spectral bounds on multiple network training strategies. The SDP bounds were tighter across all baseline networks.
- For example, a naturally trained network had a large gap between the PGD attack's lower bound and the SDP upper bound, indicating the need for specialized training to achieve meaningful robustness.
Performance on Adversarial Training:
- Networks trained using the SDP-regularization (SDP-NN) achieved a certified upper bound of 35% error for ε = 0.1, which was far lower than adversarially trained networks without certification.
- The SDP-NN network also displayed substantial robustness to various strong attacks such as PGD and the Carlini-Wagner attack, reflecting the practical impact of the proposed training method.

Implications and Future Work

Practical Implications:

The proposed certification approach provides a robust defense mechanism against adversarial perturbations. Unlike empirical methods, this approach offers provable guarantees, making it applicable in security-sensitive domains such as autonomous driving and financial systems.

Theoretical Implications:

The combinatory use of semidefinite programming and duality provides a comprehensive framework that theoretically ensures convergence to robust solutions. The approach bridges gaps between robustness certification and practical training routines.

Future Directions:

While the proposed method is evaluated on shallow networks (two-layer), extending this framework to deeper architectures remains a crucial direction. Further research could also explore integrating these certified defense techniques into larger, more complex domains (e.g., CIFAR-10, ImageNet) and adapting the method to various norm constraints (ℓ2, ℓ1).

In conclusion, the paper contributes a solid foundation for designing and training certifiably robust neural networks, moving a step closer to resilient AI systems capable of withstanding adversarial manipulations.

PDF Markdown