Papers
Topics
Authors
Recent
2000 character limit reached

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2010.11929v2)

Published 22 Oct 2020 in cs.CV, cs.AI, and cs.LG

Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Citations (33,045)

Summary

  • The paper introduces the Vision Transformer (ViT) that replaces CNNs with a Transformer-based approach by processing 16x16 image patches.
  • It demonstrates state-of-the-art accuracy, achieving 88.55% on ImageNet and 94.55% on CIFAR-100 after extensive pre-training.
  • The study highlights how self-attention mechanisms in ViT enhance computational efficiency and scale image recognition tasks effectively.

Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Introduction

The paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" introduces the Vision Transformer (ViT), a novel approach to image recognition that leverages the Transformer architecture, traditionally successful in NLP, for computer vision tasks. This approach bypasses convolutional networks entirely, relying on self-attention mechanisms applied to patches of images to perform classification tasks. Figure 1

Figure 1: Model overview - splitting an image into fixed-size patches and processing them using a Transformer encoder.

Methodology

Model Architecture

ViT treats images as sequences of flattened patches fed into a Transformer model, resembling its application in NLP. The architecture consists of:

  1. Image Patching: Images are split into non-overlapping patches of size 16x16 pixels.
  2. Linear Embedding: Each patch is flattened and linearly embedded into a constant vector size DD.
  3. Sequence Formation: A learnable classification token is prepended, and positional embeddings are added to retain spatial information.
  4. Transformer Encoder: A standard Transformer encoder processes the sequence, leveraging multi-head self-attention and feedforward layers.

This architecture is depicted in (Figure 1), providing a straightforward yet scalable model design.

Training and Fine-tuning

ViT models require extensive pre-training on large datasets such as ImageNet-21k or JFT-300M. The tasks are then fine-tuned to smaller datasets, showing impressive transfer capabilities. Fine-tuning often involves increasing the resolution, thereby increasing the input sequence length managed by 2D interpolation of position embeddings.

Experimental Results

Performance Analysis

ViT proves competitive, achieving strong results on multiple datasets. For instance, ViT achieves an 88.55% accuracy on ImageNet and 94.55% on CIFAR-100 when pre-trained on JFT-300M. These results affirm the potential of Transformers in vision when coupled with sufficient pre-training data. Figure 2

Figure 2: Breakdown of VTAB performance across Natural, Specialized, and Structured task groups.

The Transformer-based approach also scales favorably with increased data size, outperforming equivalent CNN-based models and demonstrating significant gains in computational efficiency (Figure 2).

Attention Insights

An important aspect investigated is how ViT utilizes attention. Analysis reveals that initial layers tend to focus on local patterns, with deeper layers attending to broader image contexts. This flexibility allows ViT to arguably leverage global image context more effectively than traditional CNNs. Figure 3

Figure 3

Figure 3

Figure 3: Initial linear embedding and attention distance analysis show the adaptability and extensive reach of ViT’s attention mechanisms.

Self-Supervision

Preliminary attempts at self-supervised learning with ViT showed promise but trailed behind supervised methods, suggesting further exploration is needed to fully harness self-supervised pre-training in vision domains.

Conclusion

ViT showcases the potential to apply Transformer architectures to computer vision tasks, achieving state-of-the-art performance with reduced training costs compared to CNNs. The research opens a pathway for further exploration into applying Transformers across varied vision tasks like object detection and segmentation, particularly with refined self-supervised learning strategies. Future developments should consider scaling up not just the models but optimally leveraging pre-training across diverse datasets for maximal gains.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 39 tweets with 208 likes about this paper.