Towards the first adversarially robust neural network model on MNIST (1805.09190v3)

Published 23 May 2018 in cs.CV

Abstract: Despite much effort, deep neural networks remain highly susceptible to tiny input perturbations and even for MNIST, one of the most common toy datasets in computer vision, no neural network model exists for which adversarial perturbations are large and make semantic sense to humans. We show that even the widely recognized and by far most successful defense by Madry et al. (1) overfits on the L-infinity metric (it's highly susceptible to L2 and L0 perturbations), (2) classifies unrecognizable images with high certainty, (3) performs not much better than simple input binarization and (4) features adversarial perturbations that make little sense to humans. These results suggest that MNIST is far from being solved in terms of adversarial robustness. We present a novel robust classification model that performs analysis by synthesis using learned class-conditional data distributions. We derive bounds on the robustness and go to great length to empirically evaluate our model using maximally effective adversarial attacks by (a) applying decision-based, score-based, gradient-based and transfer-based attacks for several different Lp norms, (b) by designing a new attack that exploits the structure of our defended model and (c) by devising a novel decision-based attack that seeks to minimize the number of perturbed pixels (L0). The results suggest that our approach yields state-of-the-art robustness on MNIST against L0, L2 and L-infinity perturbations and we demonstrate that most adversarial examples are strongly perturbed towards the perceptual boundary between the original and the adversarial class.

Authors (4)

Lukas Schott (14 papers)
Jonas Rauber (13 papers)
Matthias Bethge (103 papers)
Wieland Brendel (55 papers)

Citations (363)

View on Semantic Scholar

Summary

The paper introduces a novel analysis by synthesis method using class-conditional VAEs to boost adversarial robustness.
The paper critiques prevailing defenses, revealing limitations such as overfitting to the L∞ norm and vulnerability to L₂ and L₀ attacks.
The paper demonstrates that adversarial examples in the proposed model are more semantically aligned with human perception, paving the way for more interpretable AI.

An Analysis of Adversarial Robustness in Neural Networks for MNIST

This paper presents a critical evaluation of adversarial robustness in neural networks applied to the MNIST dataset, traditionally considered a solved problem domain. Despite the simplicity of MNIST, the paper argues that adversarial robustness is an unsolved issue, challenging the assumption that conventional defenses are adequate. The authors critique existing methods and introduce a novel approach grounded in generative models, promising improvements in adversarial resistance.

Adversarial Vulnerabilities in Established Defenses

The exploration begins by assessing the adversarial resilience of models trained on the MNIST dataset. Current leading defenses, notably the adversarial training approach by Madry et al., are scrutinized. The research identifies a significant overfitting to the $L_\infty$ norm, rendering defenses less effective against $L_2$ and $L_0$ perturbations. Moreover, models often misclassify unrecognizable inputs with high confidence, demonstrating a disconnect between model perception and human semantic understanding.

Proposed Methodology: Analysis by Synthesis

To address these shortcomings, the authors propose a new model utilizing an Analysis by Synthesis (ABS) approach with class-conditional variational autoencoders (VAEs). This model seeks to enhance both the accuracy and robustness by modeling input distributions class-wise, offering nuanced bounds on adversarial robustness. The proposed architecture leverages variational inference during training and optimization-based inference during evaluation, distinguishing it from conventional methods.

Empirical Evaluation and Results

The robustness of the proposed model is thoroughly tested against a diverse set of adversarial attacks, including various decision-based, score-based, and gradient-based strategies. Notably, the ABS model shows enhanced robustness to $L_2$ , $L_\infty$ , and $L_0$ adversarial perturbations compared to the existing state-of-the-art methods. For instance, adversarial examples in ABS models tend to be more semantically meaningful and aligned with human perception, suggesting a step closer to robustness with genuine interpretability.

Implications and Future Prospects

The findings advocate for reconsidering the problem of adversarial robustness even in seemingly trivial datasets like MNIST. By highlighting weaknesses in prevailing defenses and demonstrating a promising alternative approach through ABS, the paper sets a precedent for future research in adversarial machine learning. The introduction of a customizable, generative-model-based method points to potential applications beyond MNIST, encouraging the adoption of similar strategies for more complex datasets.

The authors acknowledge existing limitations in robustness evaluations and invite further scrutiny by releasing their model for external validation. This underscores a collaborative effort towards advancing resilient AI systems. Future work could explore scalability enhancements to apply the ABS models effectively across more intricate dataset challenges.

In conclusion, this paper enriches the discourse on adversarial robustness, offering insights into both the limitations of current defenses and introducing innovative contributions through a generative approach. As the field progresses, these findings are significant for developing AI systems that are not only robust but also semantically aligned with human interpretation and understanding.

PDF Markdown