- The paper introduces a semidefinite programming method to certify neural networks against ℓ∞ adversarial examples.
- It proposes differentiable certificates that jointly optimize network parameters and robustness during training.
- Experimental results on MNIST show that SDP-regularized networks can restrict adversarial test error to 35% for ε = 0.1.
Certified Defenses against Adversarial Examples
The paper "Certified Defenses against Adversarial Examples" by Aditi Raghunathan, Jacob Steinhardt, and Percy Liang addresses a critical vulnerability in neural networks—adversarial examples. These are small perturbations to the input data that drastically reduce the accuracy of otherwise high-performing networks. The paper proposes and evaluates a novel approach to create neural networks that are certifiably robust against such adversarial perturbations.
Summary of Contributions
The paper makes several key contributions to the field of adversarial robustness in neural networks:
- Certified Robustness via Semidefinite Relaxation:
- The authors introduce a method based on semidefinite programming (SDP) for generating certificates of robustness for neural networks with one hidden layer. This method asserts that no adversarial perturbation, bounded by a given
ε
in the ℓ∞
-norm, can force the classification error to exceed a certain value.
- Differentiable Certificates:
- The certification method yields a differentiable robustness measure, which allows the network parameters to be optimized jointly with the certificate during training. This creates an adaptive regularizer that encourages robustness against all possible attacks within a specified class.
- Empirical Validation:
- The authors demonstrate their approach on the MNIST dataset. They produce a network that, assigned with an
ε = 0.1
, has a certificate guaranteeing that no adversarial perturbation can cause more than 35% test error.
Detailed Technical Approach
Setup:
The problem of adversarial examples is highlighted by considering small perturbations to inputs that lead to misclassification. The authors focus on ℓ∞
-bounded perturbations and aim to break the arms race between attackers and defenders by providing formal guarantees of robustness.
Score-based Classifiers and Attack Model:
The approach generalizes score-based classifiers and considers white-box attackers capable of perturbations within ℓ∞
-balls around original inputs. The worst-case adversarial example is defined by maximizing the margin of incorrect class scores over the correct class.
Certificate Formulation:
For linear classifiers, the robustness is certified by leveraging the ℓ1
-norm of the weight differences. Extending to neural networks, the authors bounded the norm of the gradient of the margin function over the perturbation ball, translating the problem to a semidefinite program that provides a tractable upper bound on the worst-case loss.
Optimization Strategy:
They proposed a novel dual formulation for efficiently optimizing the robustness certificate during training. This dual formulation circumvents the computational burden of repeated semidefinite programming by leveraging stochastic gradient methods.
Experimental Results
The authors conducted extensive experiments on the MNIST dataset to evaluate the proposed method against standard baselines:
- Comparison of Bounds:
- The SDP-based upper bounds were compared to simpler Frobenius and spectral bounds on multiple network training strategies. The SDP bounds were tighter across all baseline networks.
- For example, a naturally trained network had a large gap between the PGD attack's lower bound and the SDP upper bound, indicating the need for specialized training to achieve meaningful robustness.
- Performance on Adversarial Training:
- Networks trained using the SDP-regularization (SDP-NN) achieved a certified upper bound of 35% error for
ε = 0.1
, which was far lower than adversarially trained networks without certification.
- The SDP-NN network also displayed substantial robustness to various strong attacks such as PGD and the Carlini-Wagner attack, reflecting the practical impact of the proposed training method.
Implications and Future Work
Practical Implications:
The proposed certification approach provides a robust defense mechanism against adversarial perturbations. Unlike empirical methods, this approach offers provable guarantees, making it applicable in security-sensitive domains such as autonomous driving and financial systems.
Theoretical Implications:
The combinatory use of semidefinite programming and duality provides a comprehensive framework that theoretically ensures convergence to robust solutions. The approach bridges gaps between robustness certification and practical training routines.
Future Directions:
While the proposed method is evaluated on shallow networks (two-layer), extending this framework to deeper architectures remains a crucial direction. Further research could also explore integrating these certified defense techniques into larger, more complex domains (e.g., CIFAR-10, ImageNet) and adapting the method to various norm constraints (ℓ2
, ℓ1
).
In conclusion, the paper contributes a solid foundation for designing and training certifiably robust neural networks, moving a step closer to resilient AI systems capable of withstanding adversarial manipulations.