Scale-Aware Modulation Meet Transformer (2307.08579v2)

Published 17 Jul 2023 in cs.CV

Abstract: This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224² resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224² and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

PDF Abstract

An Overview of Scale-Aware Modulation Transformer (SMT)

The paper "Scale-Aware Modulation Meet Transformer" introduces a novel approach in computer vision, specifically focusing on the architecture and efficacy of Vision Transformers (ViT). The authors propose the Scale-Aware Modulation Transformer (SMT), a state-of-the-art model that integrates the strengths of convolutional neural networks (CNNs) and transformers to address several computer vision tasks more effectively and efficiently.

Key Features of SMT

The key contribution of SMT lies in its design, particularly through its innovative use of convolutional modulation within a transformer framework. Two novel modules form the core of the SMT's architecture: the Multi-Head Mixed Convolution (MHMC) and Scale-Aware Aggregation (SAA).

Multi-Head Mixed Convolution (MHMC): This module is designed to enhance the receptive field while capturing multi-scale features by utilizing multiple convolutions with different kernel sizes. This approach allows the model to address the computational inefficiencies often seen in conventional self-attention mechanisms that suffer from quadratic complexity concerning image resolution.
Scale-Aware Aggregation (SAA): SAA is highlighted for its lightweight yet effective architecture, which supports the efficient fusion of information across different heads, thereby enhancing convolutional modulation with minimal computational overhead.

In addition to these modules, the authors introduce an Evolutionary Hybrid Network (EHN) structure. EHN is crafted to simulate a gradual transition from local dependencies at shallower network levels to global dependencies at deeper levels. This architecture allows for significant computational efficiency while maintaining robust performance across various downstream visual tasks.

Numerical Results and Performance

The SMT is demonstrated to outperform current state-of-the-art models across several visual benchmarks. For instance, using the ImageNet-1K dataset, SMT achieves remarkable top-1 accuracy results of 82.2% and 84.3% for models with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs, respectively. Pretraining on a larger dataset, ImageNet-22K, with further fine-tuning, the model attains accuracy as high as 88.1%. Furthermore, for object detection tasks on the COCO dataset using Mask R-CNN, SMT consistently surpasses the performance of the Swin Transformer, exhibiting an increase of 4.2 mAP and 2.0 mIoU over Swin for Mask R-CNN and UPerNet, respectively, on ADE20K.

Implications and Future Work

The implications of this research are broad, affecting both theoretical understanding and practical application. The introduction of SMT can streamline processes by reducing computational costs while enhancing accuracy in visual recognition tasks. The innovative modulation techniques established by MHMC and SAA modules could also inspire hybrid models in other domains, beyond the current applications in computer vision.

Future work could explore further scaling and adaptation of SMT for more diverse datasets and complex tasks, potentially pushing the boundaries of efficiency and performance in AI models. Investigating the applicability of these scale-aware techniques in other AI-driven domains such as natural language processing and robotics may uncover additional avenues for enhancement and crossover innovations.

In sum, SMT represents a significant step forward in the development of hybrid architectures combining deep learning's convolutional capabilities with the attention mechanisms of transformers, promising improvements in both computational efficiency and task performance in the field of computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Weifeng Lin (15 papers)
Ziheng Wu (16 papers)
Jiayu Chen (51 papers)
Jun Huang (126 papers)
Lianwen Jin (116 papers)

Citations (49)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - AFeng-x/SMT: This is an official implementation for "Scale-Aware Modulation Meet Transformer". (209 stars)