MixFormer: Mixing Features across Windows and Dimensions (2204.02557v2)

Published 6 Apr 2022 in cs.CV

Abstract: While local-window self-attention performs notably in vision tasks, it suffers from limited receptive field and weak modeling capability issues. This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension. We propose MixFormer to find a solution. First, we combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields. Second, we propose bi-directional interactions across branches to provide complementary clues in the channel and spatial dimensions. These two designs are integrated to achieve efficient feature mixing among windows and dimensions. Our MixFormer provides competitive results on image classification with EfficientNet and shows better results than RegNet and Swin Transformer. Performance in downstream tasks outperforms its alternatives by significant margins with less computational costs in 5 dense prediction tasks on MS COCO, ADE20k, and LVIS. Code is available at \url{https://github.com/PaddlePaddle/PaddleClas}.

PDF Abstract

Overview of MixFormer: Mixing Features Across Windows and Dimensions

This paper presents MixFormer, an innovative approach to vision tasks, focusing on enhancing the efficiency and performance of Vision Transformers (ViTs). Traditional local-window self-attention, while effective, is limited by non-overlapping windows and shared channel weights, leading to restricted receptive fields and weakened modeling capabilities. MixFormer addresses these challenges through two key innovations: a parallel design incorporating depth-wise convolution and bi-directional interactions.

Key Contributions

Parallel Design: MixFormer combines local-window self-attention with depth-wise convolution in parallel, effectively expanding the receptive field and enhancing feature interweaving. This approach allows simultaneous modeling of intra-window and cross-window relations, proving more effective than previous sequential methods.
Bi-directional Interactions: The paper introduces channel and spatial interactions to overcome the weak modeling capacity in the channel and spatial dimensions, respectively. This dual-path strategy enhances the representation ability by providing complementary information between local-window self-attention and depth-wise convolution.

Experimental Evidence

MixFormer demonstrates substantial improvements across various tasks:

Image Classification: On the ImageNet-1K dataset, MixFormer achieves competitive accuracy, matching EfficientNet’s performance and outperforming RegNet and Swin Transformer, with a notable reduction in computational costs.
Dense Prediction Tasks: The model is evaluated on five tasks, including MS COCO and ADE20k, consistently surpassing alternatives such as Swin Transformer by significant margins while maintaining computational efficiency. For instance, on MS COCO with Mask R-CNN, MixFormer-B4 surpasses Swin-T by 2.9 in box mAP and 2.1 in mask mAP.

Theoretical and Practical Implications

The integration of parallel feature processing and enhanced interaction across branches not only addresses the inherent limitations of local-window attention but also sets a precedent for designing efficient networks with improved representational capacity. This methodology has shown promise beyond image classification, proving effective in semantic segmentation and instance segmentation tasks, indicating its general applicability.

Future Directions

Future exploration could investigate the application of these concepts to global self-attention models, potentially bringing similar benefits. Additionally, automated architecture search techniques like NAS could further optimize MixFormer’s design, potentially uncovering configurations that maximize performance across specific tasks or computational constraints.

Overall, MixFormer introduces a compelling framework for building more efficient and capable vision transformers, with potential implications across a variety of AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Qiang Chen (98 papers)
Qiman Wu (3 papers)
Jian Wang (966 papers)
Qinghao Hu (31 papers)
Tao Hu (146 papers)
Errui Ding (156 papers)
Jian Cheng (127 papers)
Jingdong Wang (236 papers)

Citations (89)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - PaddlePaddle/PaddleClas: A treasure chest for visual classification and recognition powered by PaddlePaddle (5,298 stars)