Intriguing Properties of Vision Transformers (2105.10497v3)

Published 21 May 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility in attending image-wide context conditioned on a given patch can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via the self-attention mechanism.

PDF Abstract

An Analysis of "Intriguing Properties of Vision Transformers"

The paper "Intriguing Properties of Vision Transformers" presents a comprehensive paper of Vision Transformers (ViTs) and their robustness across various visual tasks. It specifically examines the ViT's performance in handling inconsistencies and perturbations in natural images, comparing these models to convolutional neural networks (CNNs).

Key Findings

The authors explore several significant properties of ViTs:

Robustness to Occlusions: The paper reveals that ViTs demonstrate superior robustness against severe image occlusions compared to CNNs. Notably, ViTs maintain up to 60% top-1 accuracy on ImageNet even when 80% of the image content is randomly occluded. This resilience is attributed to ViTs' ability to adapt their receptive fields dynamically through self-attention mechanisms.
Texture and Shape Bias: ViTs exhibit a lower reliance on texture cues than CNNs and can encode shape information effectively. When trained on stylized datasets, ViTs achieve shape recognition levels comparable to human perception. The introduction of a shape token in ViTs allows for the simultaneous modeling of texture and shape, enhancing their versatility.
Adversarial and Natural Perturbations: ViTs surpass CNNs in robustness to adversarial attacks and common corruptions. This performance improvement is closely linked to training methods, highlighting the importance of augmentations.
Automated Segmentation without Supervision: A fascinating discovery is the ability of ViTs to perform accurate semantic segmentation without pixel-level supervision, attributable to their capacity for encoding shape-biased representations.
Versatile Off-the-Shelf Feature Extraction: ViTs provide effective off-the-shelf features, demonstrating significant improvements in various classification tasks, including few-shot learning scenarios. The ability to form a feature ensemble from a single ViT model enhances their generalization across different datasets.

Methodology

The research encompasses a diverse range of experiments conducted on different ViT families, namely ViT, DeiT, and T2T, involving 15 vision datasets. Comparisons with high-performing CNNs, specifically ResNet50, serve as a baseline for evaluating robustness and generalization.

Implications and Future Directions

The outcomes of this paper offer meaningful insights into the potential applications of ViTs in areas demanding high levels of robustness and generalizability, such as autonomous vehicles and healthcare. The work suggests that the dynamic nature and flexible receptive fields of ViTs position them as superior alternatives to traditional CNN frameworks for handling complex visual tasks.

In future developments, it would be worthwhile to investigate the integration of separate tokens within ViTs to further harness their potential in modeling diverse cues. Additionally, combining techniques from self-supervised learning and stylized training could broaden the applicability of ViTs, especially in unsupervised segmentation tasks.

This thorough examination of ViTs underscores their adaptability and strength in handling natural perturbations, reshaping how visual recognition tasks are approached and paving the way for technological advancements in artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Muzammal Naseer (67 papers)
Kanchana Ranasinghe (21 papers)
Salman Khan (244 papers)
Munawar Hayat (73 papers)
Fahad Shahbaz Khan (225 papers)
Ming-Hsuan Yang (376 papers)

Citations (570)

View on Semantic Scholar

Intriguing Properties of Vision Transformers (2105.10497v3)

An Analysis of "Intriguing Properties of Vision Transformers"

Key Findings

Methodology

Implications and Future Directions

Related Papers