CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification (2103.14899v2)

Published 27 Mar 2021 in cs.CV

Abstract: The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that our approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2\% with a small to moderate increase in FLOPs and model parameters. Our source codes and models are available at \url{https://github.com/IBM/CrossViT}.

PDF Abstract

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The paper "CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification" by Chun-Fu (Richard) Chen, Quanfu Fan, and Rameswar Panda presents a novel approach to enhance the performance of vision transformers (ViTs) through multi-scale feature representations. Leveraging the foundational principles of the ViT architecture, the authors propose a dual-branch transformer model capable of processing image patches of varying sizes to generate robust image features, addressing limitations found in existing transformer models and convolutional neural networks (CNNs).

Methodology

The proposed model, termed CrossViT, introduces a dual-branch structure where small-patch and large-patch tokens are handled by two distinct branches with different computational complexities. The branches then communicate through a shared cross-attention mechanism that enables effective information exchange and integration between diverse scales.

Significant contributions of the paper include:

Dual-Branch Transformer: Two separate branches process image patches of different sizes, addressing the need for multi-scale feature representation. This dual-branch configuration facilitates learning of complementary features, enhancing the model’s understanding of both fine and coarse details within images.
Cross-Attention Mechanism: The introduction of a cross-attention based token fusion module, where a single token from each branch serves as a query for the other, enables efficient fusion of information. This module is designed to be computationally efficient due to its linear complexity in terms of both computation and memory, as opposed to the quadratic complexity typically associated with conventional attention mechanisms.

Experimental Results

Extensive experimentation underscores the efficacy of the CrossViT approach. Key findings include:

On the ImageNet1K dataset, CrossViT outperforms recent transformer-based models, including DeiT, by a substantial margin of 2%. This performance gain is achieved with only a modest increase in FLOPs and model parameters.
The model's performance is competitive with efficient CNN models, establishing its utility in the broader landscape of image classification architectures.

Implications and Future Directions

The introduction of multi-scale processing within a transformer framework opens avenues for more nuanced and flexible image classification models. By coupling small-patch and large-patch processing, CrossViT demonstrates that compelling improvements in accuracy and efficiency can be achieved without excessively increasing computational overhead.

From a theoretical perspective, the cross-attention module's linear complexity represents a significant innovation, potentially influencing future transformer model designs not only in computer vision but also in other domains where hierarchical feature representation is beneficial.

Practically, the enhanced performance of CrossViT on standard benchmarks suggests immediate applicability in real-world scenarios, particularly in areas where precise image classification is critical. Furthermore, the fact that the source codes and models are publicly available facilitates further research and experimentation by the academic community, promoting advancements and refinements in this domain.

Future developments in AI could explore extending the CrossViT framework to other vision tasks such as object detection, segmentation, and video analysis. Additionally, integrating domain adaptation techniques with multi-scale transformers may yield even greater performance gains across diverse datasets and application areas.

In summary, the CrossViT model presents a methodologically sound and practically effective enhancement to vision transformers. Its dual-branch architecture, coupled with a novel cross-attention mechanism, showcases a balanced approach to improving image classification accuracy and efficiency, contributing valuable insights to the ongoing evolution of transformer-based models in computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Chun-Fu Chen (28 papers)
Quanfu Fan (22 papers)
Rameswar Panda (79 papers)

Citations (1,245)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - IBM/CrossViT: Official implementation of CrossViT. https://arxiv.org/abs/2103.14899 (346 stars)