Are Transformers More Robust Than CNNs? (2111.05464v1)

Published 10 Nov 2021 in cs.CV

Abstract: Transformer emerges as a powerful tool for visual recognition. In addition to demonstrating competitive performance on a broad range of visual benchmarks, recent works also argue that Transformers are much more robust than Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these conclusions are drawn from unfair experimental settings, where Transformers and CNNs are compared at different scales and are applied with distinct training frameworks. In this paper, we aim to provide the first fair & in-depth comparisons between Transformers and CNNs, focusing on robustness evaluations. With our unified training setup, we first challenge the previous belief that Transformers outshine CNNs when measuring adversarial robustness. More surprisingly, we find CNNs can easily be as robust as Transformers on defending against adversarial attacks, if they properly adopt Transformers' training recipes. While regarding generalization on out-of-distribution samples, we show pre-training on (external) large-scale datasets is not a fundamental request for enabling Transformers to achieve better performance than CNNs. Moreover, our ablations suggest such stronger generalization is largely benefited by the Transformer's self-attention-like architectures per se, rather than by other training setups. We hope this work can help the community better understand and benchmark the robustness of Transformers and CNNs. The code and models are publicly available at https://github.com/ytongbai/ViTs-vs-CNNs.

Authors (4)

Yutong Bai (32 papers)
Jieru Mei (26 papers)
Alan Yuille (294 papers)
Cihang Xie (91 papers)

Citations (236)

View on Semantic Scholar

Summary

Insights on "Are Transformers More Robust Than CNNs?"

The paper "Are Transformers More Robust Than CNNs?" by Yutong Bai et al. provides a rigorous comparative analysis of the robustness characteristics of Transformers and Convolutional Neural Networks (CNNs). This paper addresses the emerging narrative in the computer vision community that Transformers, originally designed for natural language processing, exhibit superior robustness to CNNs when applied to visual recognition tasks.

Methodological Setup

The authors highlight the inconsistencies in prior comparisons by standardizing the experimental settings across CNNs and Transformers. Specifically, they match the model capacities and utilize a unified training setup to ensure a fair evaluation of robustness. Central to their investigation are two main types of robustness: adversarial robustness and generalization on out-of-distribution samples.

Findings on Adversarial Robustness

In evaluating adversarial robustness, the paper initially confirms that Transformers and CNNs demonstrate similar vulnerabilities to adversarial attacks in vanilla settings, with some minor nuances based on the methods utilized. Notably:

Initial Observations: Using AutoAttack, both ResNet-50 and DeiT-S exhibit high susceptibility to perturbations, showing negligible resilience when the perturbation radius is set to a standard scale (4/255).
Adversarial Training Outcomes: Adversarial training substantially bolsters both architectures against perturbation-based attacks, with robust training yielding comparable resilience in both ResNet-50 and DeiT-S. Intriguingly, the paper observes that adopting GELU activation functions from Transformers in CNNs significantly enhances their robustness, aligning their performance closely with that of the Transformers.
Patch-based Attacks: On patch-based adversarial attacks, data augmentation strategies like CutMix are found to be influential. Through robust data augmentations, CNNs like ResNet-50 can outperform Transformers such as DeiT-S in certain robustness metrics, challenging the prior belief of inherent Transformer superiority.

Generalization on Out-of-Distribution Samples

The paper substantiates that Transformers excel over CNNs in generalizing to out-of-distribution samples such as ImageNet-A, ImageNet-C, and Stylized-ImageNet, irrespective of the training setups.

Consistent Robustness: Even without pretraining on large-scale external datasets, Transformers demonstrate superior generalization. Through rigorous ablation studies, the authors determine that the inherent architectural features of Transformers, particularly the self-attention mechanism, play a pivotal role in this enhanced robustness.
Training Recipes and Distillation: The paper meticulously examines the effects of training recipes, cross-architecture distillation, and hybrid architectures. They conclude that neither adopting Transformers-like training strategies nor distillation from a Transformer model can fully bridge the robustness gap between CNNs and Transformers.
Model Scaling: Extending the analysis to models of various scales further corroborates the findings. For instance, Transformers of equivalent or even smaller parameter sizes compared to larger CNNs still exhibit superior performance on out-of-distribution tasks.

Implications and Future Directions

The findings of this paper have significant practical and theoretical implications:

Practical Insights: The results suggest that for applications requiring high robustness to adversarial attacks and generalization to varied real-world conditions, Transformers may offer intrinsic advantages over traditional CNNs. This has practical implications for the deployment of robust AI systems in real-world scenarios.
Theoretical Insights: The research provides a foundation to explore architectural components that endow Transformers with their robustness. Understanding these components can drive innovation in hybrid or new architectures that blend the strengths of CNNs and Transformers.
Future Developments: Given the superior performance of self-attention mechanisms in generalization, future AI models might increasingly incorporate or evolve from these architectures. Exploring variations of the attention mechanism and its impact on robustness can propel advancements in the field.

In conclusion, Bai et al.'s work offers a methodically sound, comprehensive comparison of CNNs and Transformers, refining our understanding of their respective robustness characteristics. This paper alleviates the uncertainties from previous work and paves the way for more robust and generalizable AI systems in computer vision and beyond.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - ytongbai/ViTs-vs-CNNs: [NeurIPS 2021]: Are Transformers More Robust Than CNNs? (Pytorch implementation & checkpoints) (178 stars)