This paper, "Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey" (Akhtar et al., 2018 ), provides the first comprehensive overview of adversarial attacks on deep learning models, with a particular focus on computer vision applications. The core finding highlighted throughout the survey is that despite achieving remarkable performance, deep neural networks are surprisingly vulnerable to small, often imperceptible, perturbations of their inputs, known as adversarial examples. These perturbations can cause models to make incorrect predictions with high confidence, posing a significant threat to the deployment of deep learning in safety and security-critical systems like autonomous vehicles, surveillance, and medical imaging.
The survey defines key terminology, such as 'adversarial example,' 'adversarial perturbation,' 'black-box attack' (no knowledge of the target model's parameters/architecture), 'white-box attack' (full knowledge), 'transferability' (an adversarial example fooling multiple models), and 'universal perturbation' (a single perturbation fooling many images). Understanding these terms is fundamental to discussing the landscape of attacks and defenses.
The paper details various adversarial attack methods, broadly categorized by the tasks they target. For image classification, influential early methods include:
- Box-constrained L-BFGS: An iterative optimization approach [Szegedy_2014] to find minimal -norm perturbations that change the classification. It's computationally expensive but demonstrated the phenomenon.
- Fast Gradient Sign Method (FGSM): A one-step method [Goodfellow_2015] that computes the gradient of the loss with respect to the input image and adds a scaled version of the sign of the gradient. It's fast and leverages the perceived linearity of deep networks. The perturbation is .
- Basic Iterative Method (BIM) / Iterative Least-likely Class Method (ILCM): Iterative extensions of FGSM [Kurakin_2016a] that take multiple small steps, often more effective than one-step methods but computationally costlier.
- Jacobian-based Saliency Map Attack (JSMA): An attack [Papernot_2016c] that iteratively modifies minimal pixels to achieve a targeted misclassification, useful when only a few pixel changes are acceptable.
- One Pixel Attack: An extreme attack [Su_2017] using differential evolution to find a single pixel change sufficient to fool a classifier, demonstrating vulnerability even under severe constraints.
- Carlini and Wagner (C&W) Attacks: A suite of powerful iterative attacks [Carlini_2016] designed to bypass specific defenses, particularly defensive distillation. They minimize perturbations under various norms () while ensuring successful misclassification. They are computationally expensive but highly effective.
- DeepFool: An iterative method [DeepFool] to find minimal or norm perturbations by pushing the image iteratively across the decision boundary.
- Universal Adversarial Perturbations: Demonstrated the existence of image-agnostic perturbations [Uni] that can fool a network on a large fraction of images, implying a shared vulnerability across the input distribution.
The paper also surveys attacks beyond image classification, showing that other computer vision tasks are susceptible. Examples include attacks on autoencoders and generative models [Tabacof_2016a, Kos_2017], recurrent neural networks [Papernot_craft], deep reinforcement learning agents [Lin_2017Tactics, Huang_2017a], semantic segmentation, and object detection [Metzen, Xie_2017]. This suggests that adversarial vulnerability is a general phenomenon affecting various deep learning architectures and applications.
A critical section focuses on the practical implications of adversarial attacks in the real world. While many attacks are demonstrated in 'laboratory settings' on standard datasets, their real-world feasibility is a key concern. The survey highlights experiments demonstrating successful physical attacks:
- Cell-phone camera attacks: Printing adversarial images and showing they fool classifiers when photographed by a phone camera [Kurakin_2016a].
- Road sign attacks: Creating adversarial stickers or posters that cause misclassification of physical stop signs by a road sign classifier [Evtimov_2017]. This directly challenges the assumption that environmental variations would neutralize digital perturbations.
- Generic adversarial 3D objects: Constructing 3D-printed objects (like a turtle) that are consistently misclassified as another object (like a rifle) across various viewpoints and distances [Athalye_2017] using the Expectation Over Transformation (EOT) framework. This provides strong evidence for the feasibility of physical attacks robust to environmental factors.
- Cyberspace attacks: Demonstrating successful black-box attacks against commercial machine learning services like MetaMind, Amazon, and Google [Papernot_2017b] by training substitute models or using ensemble methods [Liu_2017b].
The paper discusses the ongoing investigation into why adversarial examples exist. Various hypotheses are presented, including the linearity of deep networks [Goodfellow_2015], the geometry and curvature of decision boundaries [Fawzi_2016, Analysis], inherent prediction uncertainty [Cubuk_2017a], and evolutionary stalling during training [Rozsa_2017c]. There is not yet a full consensus, indicating this is an active area of theoretical research. The observation that higher accuracy often correlates with higher adversarial robustness [Rozsa_2016b] is also noted.
Finally, the survey reviews defenses against adversarial attacks, broadly categorized into:
- Modified training/input:
- Adversarial training: Including adversarial examples alongside clean data during training [Szegedy_2014, Goodfellow_2015, Madry_2017]. This is a widely used technique but requires strong adversaries during training to be effective and increases training time.
- Data compression: Using techniques like JPEG compression [Dziugaite_2016, Guo_2017] or PCA [Bhagoji_2017] on the input image. While sometimes effective against weaker attacks, it can reduce accuracy on clean images or be bypassed by stronger attacks [Shin_2017].
- Randomization: Applying random resizing or padding to inputs [Xie_2017].
- Modifying networks:
- Gradient regularization/masking: Modifying the training objective to penalize large input gradients or mask sensitive features [Ross_2017, Lyu_2015, Nguyen_2017, Gao_2017DeepCloak].
- Defensive distillation: Training a network using softened probability outputs from another network (or itself) [Papernot_2016]. This initially showed promise but was broken by C&W attacks [Carlini_2016].
- Biologically inspired defenses: Using highly non-linear activations [Nayebi_2017].
- Parseval Networks: Regularizing layers to control the network's Lipschitz constant [Cisse_2017].
- Provable defenses: Methods that offer mathematical guarantees of robustness within a certain perturbation bound, albeit often limited to smaller networks or datasets [Kolter_2017, Certified].
- Detection-only methods: Training subnetworks [Metzen_Ondetect] or using network statistics [Li_ICCV17] or output properties [Lu_2017b] to identify and reject adversarial examples. SafetyNet [Lu_2017b] uses RBF SVMs on ReLU activations; detector subnetworks [Metzen_Ondetect] are trained for binary classification; additional class augmentation [Grosse_2017] trains the model to classify adversarial inputs into a new category.
- Network add-ons:
- Perturbation Rectifying Network (PRN): Appending layers before the target network to attempt to reverse the adversarial perturbation [Akhtar_2017].
- GAN-based defense: Using Generative Adversarial Networks to either train robust classifiers or reconstruct clean images from adversarial ones [GANbased, withGAN].
- Detection-only methods: Using external models for pre-processing or analysis, such as feature squeezing (reducing color depth or spatial smoothing) [Squeez, Xu_2017Squueze] or MagNet (detecting points far from the training data manifold) [Magnet]. However, many of these detection methods have also been shown to be defeatable [Ensemble, MagNetBreak].
The survey concludes that adversarial attacks are a real, widespread, and transferable threat to deep learning, extending beyond simple image classification to affect various tasks and even physical systems. While various defenses have been proposed, it's an ongoing arms race, with many defenses being subsequently bypassed by new attack methods. This highlights the need for continued research to develop more robust deep learning models suitable for critical applications. The high level of research activity in this area provides hope for future improvements in robustness.