Vision Transformers are Robust Learners (2105.07581v3)

Published 17 May 2021 in cs.CV and cs.LG

Abstract: Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy. What remains largely unexplored is their robustness evaluation and attribution. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available at https://git.io/J3VO0.

PDF Abstract

Insights into Vision Transformers as Robust Learners

The paper "Vision Transformers are Robust Learners" by Sayak Paul and Pin-Yu Chen focuses on analyzing the robustness of Vision Transformers (ViT) compared to state-of-the-art convolutional neural networks (CNNs), specifically the Big Transfer (BiT) models. While Vision Transformers have established themselves as strong contenders in achieving state-of-the-art accuracy in computer vision tasks, this work is among those that delve into the less explored territory of their robustness against common corruptions, adversarial examples, and distributional shifts.

The research sets out to answer critical questions about the inherent robustness of ViTs and to provide empirical evidence supporting their robustness claims. The authors present a comprehensive performance comparison using six diverse ImageNet datasets assessing ViT's robustness concerning semantic shifts, common corruptions, and natural adversarial examples, among others.

Quantitative Results

The paper reports strong numerical results emphasizing ViT's superior robustness across various datasets compared to BiT models with similar parameters. For instance, the top-1 accuracy of ViT on the ImageNet-A dataset is 28.10%, a notable 4.3x improvement over comparable BiT variants. Additionally, ViT exhibits lower mean corruption errors (mCE) and mean flip rates (mFR) in robustness evaluations compared to BiT models and baseline CNN architectures.

Robustness Analysis

The authors conducted a series of systematic experiments to unravel why ViTs manifest enhanced robustness. Key experiments include:

Attention Mechanisms: The research highlights that self-attention, a core component of Transformers, inherently supports capturing global information more effectively, contributing to better contextual understanding and robustness. Visualization techniques like Attention Rollout and Grad-CAM reinforce this point by showcasing broader attention spans in ViTs compared to more localized attention in CNNs.
Pre-training on Large Scale Datasets: ViT models pre-trained on larger datasets (e.g., ImageNet-21k) demonstrate superior robustness compared to those pre-trained on ImageNet-1k, underlining the importance of leveraging large-scale pre-training.
Fourier and Energy Spectrum Analyses: ViTs maintain robustness against high-frequency perturbations and exhibit a smoother loss landscape under adversarial perturbations, suggesting lower sensitivity to adversarial noise.

Implications and Future Directions

The findings in this paper have substantial implications for the design of robust neural network architectures. By illustrating the conditions under which ViTs outperform CNNs in robustness, the paper lays groundwork for exploring the integration of attention mechanisms and large-scale data pre-training in future architectural designs.

Furthermore, the research opens avenues for exploring how pairing ViTs with advanced data augmentation techniques and regularization strategies might further bolster robustness. Additionally, the marked improvement in robustness against adversarial examples makes ViTs promising candidates for deployment in security-critical applications where model reliability is crucial.

Conclusion

The paper provides a thorough analysis of Vision Transformers, revealing their robustness advantages over traditional CNN architectures like BiT, particularly under similar data and computational regimes. The paper’s methodological approach to experimentation and evaluation provides both empirical and theoretical insights into the robustness attribution of ViTs, serving to guide future explorations and advancements in adaptive and resilient computer vision systems.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Sayak Paul (18 papers)
Pin-Yu Chen (311 papers)

Citations (284)

View on Semantic Scholar