Adversarial Logit Pairing (1803.06373v1)

Published 16 Mar 2018 in cs.LG and stat.ML

Abstract: In this paper, we develop improved techniques for defending against adversarial examples at scale. First, we implement the state of the art version of adversarial training at unprecedented scale on ImageNet and investigate whether it remains effective in this setting - an important open scientific question (Athalye et al., 2018). Next, we introduce enhanced defenses using a technique we call logit pairing, a method that encourages logits for pairs of examples to be similar. When applied to clean examples and their adversarial counterparts, logit pairing improves accuracy on adversarial examples over vanilla adversarial training; we also find that logit pairing on clean examples only is competitive with adversarial training in terms of accuracy on two datasets. Finally, we show that adversarial logit pairing achieves the state of the art defense on ImageNet against PGD white box attacks, with an accuracy improvement from 1.5% to 27.9%. Adversarial logit pairing also successfully damages the current state of the art defense against black box attacks on ImageNet (Tramer et al., 2018), dropping its accuracy from 66.6% to 47.1%. With this new accuracy drop, adversarial logit pairing ties with Tramer et al.(2018) for the state of the art on black box attacks on ImageNet.

Authors (3)

Harini Kannan (5 papers)
Alexey Kurakin (19 papers)
Ian Goodfellow (54 papers)

Citations (613)

View on Semantic Scholar

Summary

The paper demonstrates that adversarial logit pairing significantly enhances model robustness by aligning logits for both adversarial and clean image pairs.
It scales adversarial training on datasets like ImageNet, boosting white-box accuracy dramatically (e.g., from 1.5% to 27.9%).
The study challenges current defenses by revealing vulnerabilities in black-box scenarios, paving the way for more resilient deep learning models.

Overview of Adversarial Logit Pairing

The paper "Adversarial Logit Pairing" by Kannan, Kurakin, and Goodfellow addresses the vulnerability of deep learning models to adversarial examples, particularly in the field of image classification. The authors propose improved techniques for defending against such adversarial attacks, focusing on a method called logit pairing.

Key Contributions

Adversarial Training at Scale: The researchers implement state-of-the-art adversarial training methods on the ImageNet dataset and evaluate their effectiveness. This implementation at such a large scale provides insights into the robustness of these models when subjected to adversarial attacks, an area that had not been adequately explored.
Introduction of Logit Pairing:

The paper introduces "logit pairing," a technique designed to enhance model robustness by encouraging similarity between logits for adversarial and clean image pairs. There are two variants: - Adversarial Logit Pairing (ALP): Targets both clean and adversarially perturbed images. - Clean Logit Pairing (CLP): Targets pairs of clean examples, offering competitive robustness with minimal computational cost.

Performance Enhancement: The paper demonstrates substantial improvements in adversarial robustness across multiple datasets. Notably, ALP achieves a significant increase in accuracy on ImageNet PGD white-box attacks, from 1.5% to 27.9%.
Challenges Current Defenses: ALP is shown to degrade existing defenses against black-box attacks, such as those described by Tramer et al., dropping accuracy from 66.6% to 47.1%.

Technical Approach

The paper provides a detailed exploration of two threat models—white box and black box—detailing the capabilities of potential adversaries. Adversarial examples in this research are generated using Projected Gradient Descent (PGD), offering a robust baseline against which the proposed defenses are evaluated.

Numerical Results

The experimental results are revealing:

On the MNIST dataset, ALP increases white-box accuracy from 93.2% to 96.4%.
On SVHN, the paper reports improvements in adversarial accuracy with ALP compared to traditional adversarial training.

Implications and Future Work

The implications of this research are twofold:

Practical: The scalability of adversarial training to large datasets like ImageNet allows for real-world application enhancements, where model deployment in adversarial settings becomes feasible.
Theoretical: The introduction of logit pairing suggests new pathways in understanding model robustness, providing additional priors that regularize model training beyond conventional expectations.

Future research could explore:

Extensions of logit pairing to other forms of data and neural network architectures.
The integration of logit pairing with certified defenses for provable robustness.

Conclusion

The authors present a comprehensive paper that scales adversarial training and introduces a novel method enhancing robustness. The work sets a precedent for large-scale applications and provides a foundation for developing more resilient machine learning models in adversarial contexts. While promising, the techniques warrant further investigation to certify robustness and explore broader applicability.

PDF Markdown