Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CvT: Introducing Convolutions to Vision Transformers (2103.15808v1)

Published 29 Mar 2021 in cs.CV

Abstract: We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at \url{https://github.com/leoxiaobin/CvT}.

Convolutional Vision Transformer (CvT): A Comprehensive Overview

The paper "CvT: Introducing Convolutions to Vision Transformers" addresses a significant advancement in the field of image recognition by merging the advantages of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The authors propose a novel architecture termed Convolutional vision Transformer (CvT), which leverages the strengths of both CNNs and Transformers for enhanced image recognition performance and efficiency.

Key Contributions and Architectural Innovations

The CvT architecture introduces two primary modifications to the standard ViT framework:

  1. Hierarchical Transformers with Convolutional Token Embedding: The first layer in each stage of the hierarchical structure is a Convolutional Token Embedding layer, which performs overlapping convolutions on a 2D-reshaped token map. This enables the network to capture local spatial context while progressively decreasing the length of the token sequence and increasing the feature dimensions, similar to the hierarchical design in CNNs.
  2. Convolutional Projection for Self-Attention: The paper replaces the traditional linear projection used in the Transformer’s attention mechanism with a convolutional projection. This is implemented as a depth-wise separable convolution, allowing the model to further capture local spatial information and manage computational complexity by subsampling key and value matrices.

Evaluations and Performance

The effectiveness of the CvT was evaluated through extensive experiments, primarily on the ImageNet dataset, and results were compared against other state-of-the-art models, including both CNN-based models (e.g., ResNet) and Transformer-based models (e.g., ViT, DeiT). Some key results reported include:

  • ImageNet-1k: The CvT-21 model achieves a top-1 accuracy of 82.5\% with only 32M parameters and 7.1 GFLOPs, which is higher than DeiT-B’s 81.8\% accuracy while utilizing 63\% fewer parameters and 60\% fewer FLOPs.
  • ImageNet-22k Pretraining: CvT-W24, pre-trained on ImageNet-22k and fine-tuned on ImageNet-1k, attains a top-1 accuracy of 87.7\%. This outperforms previous ViT models pre-trained on the same dataset, such as ViT-L/16, by a margin of 2.5%.

Ablation Studies

The paper methodically explores the impact of different components via detailed ablation studies:

  1. Removal of Positional Embeddings: Due to the local spatial information captured by the convolutional layers, the CvT architecture manages to maintain performance even without explicit positional embeddings, thus simplifying the model design and improving adaptability for varying input resolutions. For instance, removing positional embeddings does not degrade the performance of CvT-13, unlike DeiT-S which incurs a 1.8\% drop without positional embeddings.
  2. Effectiveness of Convolutional Token Embedding: The Convolutional Token Embedding layer is shown to be crucial for enhancing performance. The use of this layer provides a 0.8\% improvement in top-1 accuracy on ImageNet compared to the non-overlapping patch embedding used in ViT.
  3. Convolutional Projection Impact: The Convolutional Projection was demonstrated to offer benefits over the position-wise linear projection, especially when applied in all stages of the transformer blocks. When strides greater than 1 were used for key and value projections, computational efficiency was improved by reducing FLOPs, with minimal impact on performance.

Implications and Future Directions

The proposed CvT architecture presents both theoretical and practical implications. Theoretically, it challenges the homogeneous architecture of pure Transformer models by integrating convolutional operations, thus highlighting the continued relevance of convolutions in vision tasks. Practically, the CvT model’s ability to omit positional embeddings simplifies its application across various image resolutions, making it particularly suitable for tasks requiring variable input sizes.

Future developments might explore further optimizations in the balanced integration of convolutional and transformer elements, potentially leveraging NAS (Neural Architecture Search) techniques to automate and refine architecture design. Additionally, the application of CvT to broader tasks outside image classification, such as object detection and segmentation, would be an intriguing direction to extend the versatility of this hybrid modeling approach.

In conclusion, CvT harnesses the synergies between convolutional networks and transformer architectures, providing a robust, efficient, and adaptable solution for image recognition tasks. The design innovations and compelling performance metrics reported in this paper represent a significant contribution to the ongoing evolution of deep learning models in computer vision.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Haiping Wu (16 papers)
  2. Bin Xiao (93 papers)
  3. Noel Codella (21 papers)
  4. Mengchen Liu (48 papers)
  5. Xiyang Dai (53 papers)
  6. Lu Yuan (130 papers)
  7. Lei Zhang (1689 papers)
Citations (1,710)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com