Incorporating Convolution Designs into Visual Transformers (2103.11816v2)

Published 22 Mar 2021 in cs.CV

Abstract: Motivated by the success of Transformers in NLP tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to the vision domain. However, pure Transformer architectures often require a large amount of training data or extra supervision to obtain comparable performance with convolutional neural networks (CNNs). To overcome these limitations, we analyze the potential drawbacks when directly borrowing Transformer architectures from NLP. Then we propose a new \textbf{Convolution-enhanced image Transformer (CeiT)} which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies. Three modifications are made to the original Transformer: \textbf{1)} instead of the straightforward tokenization from raw input images, we design an \textbf{Image-to-Tokens (I2T)} module that extracts patches from generated low-level features; \textbf{2)} the feed-froward network in each encoder block is replaced with a \textbf{Locally-enhanced Feed-Forward (LeFF)} layer that promotes the correlation among neighboring tokens in the spatial dimension; \textbf{3)} a \textbf{Layer-wise Class token Attention (LCA)} is attached at the top of the Transformer that utilizes the multi-level representations. Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers. Besides, CeiT models also demonstrate better convergence with $3\times$ fewer training iterations, which can reduce the training cost significantly\footnote{Code and models will be released upon acceptance.}.

PDF Abstract

Overview of "Incorporating Convolution Designs into Visual Transformers"

The paper "Incorporating Convolution Designs into Visual Transformers" introduces a novel architectural framework named Convolution-enhanced image Transformer (CeiT), which aims to harness the strengths of both Convolutional Neural Networks (CNNs) and Transformers to achieve superior performance in visual tasks. The work presents three primary modifications to the traditional Vision Transformer (ViT) model, enhancing its efficacy without necessitating large training datasets or additional supervision from CNN-based teacher models.

Primarily addressing the limitations of using pure Transformer architectures for visual tasks—such as their reliance on extensive training data for competitive performance—the paper advocates the blended incorporation of CNN traits into the Transformer framework. It systematically dissects the shortcomings of direct adaptation of Transformer architectures from NLP to the visual domain and proposes solutions to overcome these challenges.

Contributions and Modifications

The contributions are well-defined through three strategic modifications to the baseline Transformer model:

Image-to-Tokens (I2T) Module: Unlike the standard ViT, which tokenizes images directly, CeiT leverages a convolutional stem to extract low-level features before tokenization. This approach is designed to capture elementary image structures, such as edges and corners, enhancing feature representation with minimal increase in computational burden.
Locally-enhanced Feed-Forward (LeFF) Layer: CeiT replaces the traditional feed-forward network within the Transformer encoder with a Locally-enhanced Feed-Forward layer. This modification incorporates depth-wise convolutions that localize feature extraction, simulating the local connectivity typical of CNNs, thereby enriching the token-wise dependencies critical for visual discernment.
Layer-wise Class Token Attention (LCA): The LCA module aggregates multi-level class token information across different encoder layers to refine the final representation. This process is vital for integrating information represented at various abstraction layers, enhancing classification performance through more informed aggregation.

Experimental Results

The empirical efficacy of CeiT is demonstrated across multiple benchmarks. On ImageNet, CeiT models surpass several state-of-the-art CNNs such as ResNet and EfficientNet in terms of model accuracy and computational efficiency. Remarkably, CeiT models trained with fewer epochs attain comparable performance to traditionally formidable ViT models, highlighting improved learning efficiency. When fine-tuned, CeiT displays significant accuracy improvements at higher resolutions, further emphasizing its adaptable nature to different visual recognition tasks.

Implications and Future Directions

The research has considerable implications for future developments in the field of visual Transformers. By incorporating convolutional elements into Transformer architectures, the paper effectively addresses well-known issues such as data inefficiency and slow convergence inherent in pure Transformer models. This method reduces computational costs while maintaining or enhancing model performance, offering scalability to diverse visual datasets without the need for prohibitive dataset sizes or excessive training resources.

Looking forward, the integration of convolutional strategies into Transformer designs could pave the way for more efficient and precise models across broader applications in computer vision. Future studies might explore the adaptation of these principles to other Transformer-based models in vision and beyond, potentially even inspiring hybrid architectures in other domains like video processing or multi-modal learning tasks. The convergence of CNN and Transformer advantages represents a promising paradigm in the ongoing evolution of artificial intelligence and machine learning methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Kun Yuan (117 papers)
Shaopeng Guo (3 papers)
Ziwei Liu (368 papers)
Aojun Zhou (45 papers)
Fengwei Yu (23 papers)
Wei Wu (481 papers)

Citations (435)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos