Understanding Robustness of Transformers for Image Classification (2103.14586v2)

Published 26 Mar 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture -- such as the use of non-overlapping patches -- lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.

PDF Abstract

Analyzing the Robustness of Vision Transformers in Image Classification

The paper "Understanding Robustness of Transformers for Image Classification" conducts a thorough investigation into the robustness of the Vision Transformer (ViT) architecture, comparing it to the conventional ResNet architectures. This paper explores the response of these architectures to perturbations, both in input data and model parameters, to better understand the potential strengths and weaknesses of Transformers in image classification tasks.

The research aims to elucidate how Transformers, known primarily for their success in NLP tasks, can be applied to vision tasks. It examines their robustness across various scenarios, including standard image corruptions, real-world distribution shifts, adversarial attacks, and model perturbations. The emergent properties of ViT models when pre-trained on large datasets such as JFT-300M are juxtaposed against ResNets, providing insights into their efficacy and scalability.

Key Findings and Methodology

Numerous experiments were conducted to evaluate ViT models' response to natural and adversarial perturbations. A broad spectrum of both ViT and ResNet models were trained on different datasets, including ILSVRC-2012, ImageNet-21k, and JFT-300M, and tested for robustness. Several established benchmarks, such as ImageNet-C, ImageNet-R, and ImageNet-A, were utilized to measure the models' resilience to known corruptions and distribution shifts.

Robustness to Input Perturbations:

Natural Corruptions (ImageNet-C): The analysis demonstrates that ViTs, especially those pre-trained on large datasets, exhibit robustness comparable to or superior to ResNets when facing a wide array of natural corruptions. This improves significantly with the scale of the dataset employed during pre-training.
Real-World Distribution Shifts (ImageNet-R): ViT models show improved stability and outperform ResNets under distribution shifts when pre-trained on substantial datasets.
Adversarial Attacks: Both ViT and ResNet models display vulnerabilities to adversarial perturbations. However, the nature of these perturbations significantly differs, indicating a non-transferability of adversarial patterns between these architectures.
Texture Bias and Spatial Robustness: The paper finds that ViTs configured with smaller patches maintain spatial robustness more effectively than larger ones and have a notable tendency towards reduced texture bias compared to ResNets.

Robustness to Model Perturbations:

Layer Correlation and Redundancy: A correlation analysis of ViT layers indicates a high degree of redundancy, similar to what has been observed in ResNets. The later layers contribute mainly to consolidating information in the CLS token rather than updating patch representations.
Lesion Studies: By selectively removing layers from trained ViT models, it was found that ViTs are robust to significant model perturbations. Notably, the models exhibit more resilience to the removal of MLP layers compared to self-attention layers.
Localization of Attention: Restricting the attention mechanism to local regions, as opposed to allowing global communication, shows minimal impact on performance, highlighting the potential for optimization in computational efficiency without a performance trade-off.

Implications and Future Directions

This comprehensive evaluation of ViT architectures in comparison to ResNets underlines the shifting landscape in image classification, where Transformer-based models are emerging as a viable alternative, especially with sufficient pre-training data. The findings suggest that while the architectural differences lead to distinct vulnerabilities and strengths, the application of Transformers in vision tasks can result in robust models with potential areas for further optimization.

The paper provides a foundation for future work to explore architectural enhancements, particularly in addressing identified vulnerabilities such as adversarial attacks. The potential for pruning redundant elements in the ViT architecture also provides a pathway for developing more efficient and scalable models. This research contributes to an improved understanding of the operational dynamics of Transformers in image classification tasks and lays groundwork for ongoing advancements in AI-driven visual recognition.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Srinadh Bhojanapalli (44 papers)
Ayan Chakrabarti (42 papers)
Daniel Glasner (7 papers)
Daliang Li (28 papers)
Thomas Unterthiner (24 papers)
Andreas Veit (29 papers)

Citations (338)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos