PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples (1710.10766v3)

Published 30 Oct 2017 in cs.LG

Abstract: Adversarial perturbations of normal images are usually imperceptible to humans, but they can seriously confuse state-of-the-art machine learning models. What makes them so special in the eyes of image classifiers? In this paper, we show empirically that adversarial examples mainly lie in the low probability regions of the training distribution, regardless of attack types and targeted models. Using statistical hypothesis testing, we find that modern neural density models are surprisingly good at detecting imperceptible image perturbations. Based on this discovery, we devised PixelDefend, a new approach that purifies a maliciously perturbed image by moving it back towards the distribution seen in the training data. The purified image is then run through an unmodified classifier, making our method agnostic to both the classifier and the attacking method. As a result, PixelDefend can be used to protect already deployed models and be combined with other model-specific defenses. Experiments show that our method greatly improves resilience across a wide variety of state-of-the-art attacking methods, increasing accuracy on the strongest attack from 63% to 84% for Fashion MNIST and from 32% to 70% for CIFAR-10.

Authors (5)

Yang Song (299 papers)
Taesup Kim (35 papers)
Sebastian Nowozin (45 papers)
Stefano Ermon (279 papers)
Nate Kushman (14 papers)

Citations (752)

View on Semantic Scholar

Summary

PixelDefend: Leveraging Generative Models to Defend Against Adversarial Examples

The paper "PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples" addresses the pervasive issue of adversarial examples, which are small, often imperceptible perturbations to images that can cause state-of-the-art machine learning models to make erroneous predictions. The authors present an innovative approach leveraging generative models for both detecting and defending against these adversarial examples, specifically by employing PixelCNN to "purify" input images before classification.

Key Hypotheses and Discoveries

The authors hypothesize that adversarial examples, although having minor deviations from clean images, predominantly lie in low-probability regions of the training data distribution. This hypothesis was validated through empirical studies using a modern neural density model, PixelCNN, demonstrating the model's sensitivity to adversarial perturbations. Key findings include:

Adversarial examples residing in low-probability regions: Across various attack types and targeted models, adversarial examples were assigned significantly lower likelihoods by the PixelCNN model compared to clean images.
Detection through statistical hypothesis testing: By utilizing permutation tests and evaluating $p$ -values, the authors show that adversarial images can be effectively detected based on their probabilities under the PixelCNN model.

PixelDefend Approach

The central contribution of the paper is PixelDefend, a defense mechanism that purifies an input image by moving it closer to high-probability regions of the training data distribution. This process ensures that the purified image is more likely to be correctly classified by the original, unmodified classifier. The following points encapsulate the PixelDefend approach:

Purification Algorithm: PixelDefend employs a greedy optimization process to maximize the likelihood of an image under the PixelCNN model while constraining the changes to be within an $\epsilon$ -ball around the original image.
Combination with other defenses: The algorithm does not modify the target classifier and is agnostic to the attacking method, making it complementary to other defensive techniques such as adversarial training.

Empirical Results

The authors thoroughly evaluate PixelDefend across two datasets (Fashion MNIST and CIFAR-10) and multiple attack methods (including FGSM, BIM, DeepFool, and CW). The strong numerical results obtained highlight the effectiveness of PixelDefend:

On Fashion MNIST, PixelDefend improves the accuracy against the strongest attack from 63% to 84% for a standard classifier (ResNet) and from 76% to 85% for classifiers trained with basic adversarial training.
On CIFAR-10, PixelDefend's results are even more marked, raising accuracy under the strongest attack from 32% to 70% for standard classifiers.

Implications and Future Directions

The findings and methodology introduced in this paper have far-reaching implications for AI security, particularly in the field of robust image classification. The demonstration that PixelCNN can both detect and purify adversarial images opens up new avenues for using generative models in adversarial defense strategies. Given PixelDefend’s model-agnostic nature, it can be seamlessly integrated with any existing classifier framework, enhancing its practical utility.

Looking forward, further research could investigate more efficient optimization techniques for image purification, potentially reducing the computational overhead associated with the greedy decoding process. Additionally, exploring other generative models and their utility in adversarial defense could provide deeper insights and more robust algorithms for real-world deployment.

Conclusion

The paper makes significant strides in addressing the problem of adversarial perturbations by introducing PixelDefend, a novel method leveraging generative models for image purification. Through comprehensive empirical evaluation, the authors demonstrate PixelDefend’s efficacy in improving robustness across a variety of attacks and models, providing a valuable contribution to the field of adversarial machine learning defenses.

PDF Markdown