CMT: Convolutional Neural Networks Meet Vision Transformers (2107.06263v3)

Published 13 Jul 2021 in cs.CV

Abstract: Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. However, there are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs). In this paper, we aim to address this issue and develop a network that can outperform not only the canonical transformers, but also the high-performance convolutional models. We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features. Furthermore, we scale it to obtain a family of models, called CMTs, obtaining much better accuracy and efficiency than previous convolution and transformer based models. In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively. The proposed CMT-S also generalizes well on CIFAR10 (99.2%), CIFAR100 (91.7%), Flowers (98.7%), and other challenging vision datasets such as COCO (44.3% mAP), with considerably less computational cost.

PDF Abstract

Convolutional Neural Networks Meet Vision Transformers: A Technical Overview

This paper introduces CMT, an innovative architecture integrating Convolutional Neural Networks (CNNs) with Vision Transformers (ViTs) to address the limitations inherent in deploying transformers for visual tasks such as image classification, object detection, and instance segmentation. The goal is to exploit the long-range dependency capabilities of transformers while leveraging CNNs' proficiency in capturing local information. The authors propose a hierarchy of models, termed CMTs, achieving superior accuracy and efficiency compared to existing CNN and transformer models.

Introduction and Background

CNNs have long dominated computer vision due to their deep feature extraction capabilities. In contrast, transformers, initially impactful in NLP, are now being tailored for the vision domain, promising in tasks like image classification and semantic segmentation. However, challenges persist, including computational costs and performance gaps when juxtaposed with CNNs. These gaps are rooted in transformers' design, which can neglect local features and requires substantial computational resources for high-resolution inputs.

Proposed Approach

The CMT architecture ingeniously enhances ViT's design by integrating a convolutional stem and creating hybrid layers that combine local perception via CNN techniques with the global attention mechanisms of transformers. This is structured through:

Local Perception Unit (LPU): Utilizing depth-wise convolutions, the LPU ensures local information is well captured while maintaining translation invariance.
Lightweight Multi-Head Self-Attention (LMHSA): This modified attention mechanism reduces computation by employing depth-wise convolutions on key and value matrices, thereby lowering input dimensionality before attention operations.
Inverted Residual Feed-Forward Network (IRFFN): Enhances local structure capture, similar to inverted residual blocks in CNNs, promoting efficient information transformation.

These are incorporated into a multi-stage architecture that mirrors traditional CNNs but employs transformers' attention capabilities, creating high-resolution feature maps amenable to dense prediction tasks.

Experimental Results

The CMT models, notably CMT-S, demonstrate significant improvements over past architectures. On the ImageNet dataset, CMT-S achieved an impressive 83.5% top-1 accuracy with only 4.0 billion FLOPs, vastly outperforming both DeiT (14x fewer FLOPs) and EfficientNet (2x fewer FLOPs). The experimental comparisons consistently position CMTs ahead in the accuracy vs. computational cost trade-off across a range of datasets including CIFAR-10, CIFAR-100, Flowers, and COCO.

Implications and Future Directions

The research substantiates that blending CNN and transformer methodologies enables models to handle both global context and local detail more effectively and efficiently, suggesting a promising avenue for future model architectures in tasks requiring spatial awareness. Furthermore, the compound scaling approach proposed offers a structured way to scale models in a balanced manner across depth, dimension, and resolution.

Future work could explore optimizing these hybrid architectures for various vision tasks beyond classification and detection, perhaps extending into video processing or 3D vision tasks. Additionally, exploring further integration techniques or alternative network components could yield even more efficient model designs.

Conclusion

CMT provides a compelling demonstration of the potential benefits unlocked by harmonizing the strengths of CNNs and transformers. The proposed architecture not only advances performance across key benchmarks but also offers a scalable, efficient framework for future vision models, paving the way for new developments in architecture design for computational visual tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Jianyuan Guo (40 papers)
Kai Han (184 papers)
Han Wu (124 papers)
Yehui Tang (63 papers)
Xinghao Chen (66 papers)
Yunhe Wang (145 papers)
Chang Xu (323 papers)

Citations (560)

View on Semantic Scholar