MaxViT: Multi-Axis Vision Transformer (2204.01697v4)

Published 4 Apr 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.

PDF Abstract

An In-Depth Review of MaxViT: Multi-Axis Vision Transformer

The paper presents "MaxViT," a novel architecture designed to enhance the capabilities of Vision Transformers (ViTs) by addressing the scalability and efficiency limitations inherent in traditional self-attention models when applied to large images. The research introduces a multi-axis attention mechanism, combining blocked local and dilated global attention, thereby facilitating both global and local spatial interactions with linear complexity.

Core Contributions

Multi-Axis Attention: The authors introduce a scalable attention mechanism that allows for global and local spatial interactions, something that has been challenging for vision applications of Transformers due to computational constraints. The proposed multi-axis attention, with blocked local and dilated global components, achieves these interactions with linear complexity.

MaxViT Architecture: By integrating the multi-axis attention model with convolutions, the authors develop a hierarchical vision backbone, MaxViT. This architecture maintains a global receptive field throughout the entire network, thus improving the early-stage image processing common in high-resolution tasks.

Numerical Highlights

Image Classification: MaxViT models deliver state-of-the-art performance across multiple settings. For instance, without extra data, MaxViT achieves 86.5% top-1 accuracy on ImageNet-1K, and with ImageNet-21K pre-training, it reaches 88.7%.

Object Detection & Image Generation: As a backbone, MaxViT also excels in object detection and image aesthetics assessment, outperforming current models in these tasks. Furthermore, it demonstrates strong generative capabilities on ImageNet, showcasing its versatility as a vision module.

Architectural and Training Details

The MaxViT framework incorporates MBConv layers to benefit from convolutions while deploying sequential block and grid attention to handle various spatial dependencies efficiently. The careful integration of these components allows MaxViT to maintain competitive accuracy-to-computation and accuracy-to-parameter trade-offs compared to existing models like CoAtNet and Swin Transformers.

Implications and Speculation

Theoretical Implications: The MaxViT approach challenges the existing paradigms of balancing local and global attention in transformer architectures. Its linear complexity offers a promising direction for further exploration, potentially influencing designs in both vision and language processing domains.

Practical Applications: By blending convolutions with efficient attention mechanisms, MaxViT simplifies the deployment of Vision Transformers in tasks demanding high-resolution and real-time processing, thereby broadening their practical applicability.

Future Directions: This research hints at promising future extensions, such as the adaptation of the multi-axis approach beyond standard 2D vision tasks, for instance in video processing or multi-modal data integration. Additionally, the integration of sparse global attention opens avenues for exploring more computationally efficient architectures in resource-constrained environments.

In conclusion, MaxViT presents a significant advancement in Vision Transformer design, providing a versatile, scalable, and efficient solution for complex visual processing tasks. Its architectural innovations and impressive performance across varied benchmarks reinforce its potential as a foundational model in vision research.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Zhengzhong Tu (71 papers)
Hossein Talebi (24 papers)
Han Zhang (338 papers)
Feng Yang (147 papers)
Peyman Milanfar (64 papers)
Alan Bovik (10 papers)
Yinxiao Li (20 papers)

Citations (501)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - google-research/maxvit: [ECCV 2022] Official repository for "MaxViT: Multi-Axis Vision Transformer". SOTA foundation models for classification, detection, segmentation, image quality, and generative modeling... (453 stars)