Papers
Topics
Authors
Recent
Search
2000 character limit reached

MixFormer: Mixing Features across Windows and Dimensions

Published 6 Apr 2022 in cs.CV | (2204.02557v2)

Abstract: While local-window self-attention performs notably in vision tasks, it suffers from limited receptive field and weak modeling capability issues. This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension. We propose MixFormer to find a solution. First, we combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields. Second, we propose bi-directional interactions across branches to provide complementary clues in the channel and spatial dimensions. These two designs are integrated to achieve efficient feature mixing among windows and dimensions. Our MixFormer provides competitive results on image classification with EfficientNet and shows better results than RegNet and Swin Transformer. Performance in downstream tasks outperforms its alternatives by significant margins with less computational costs in 5 dense prediction tasks on MS COCO, ADE20k, and LVIS. Code is available at \url{https://github.com/PaddlePaddle/PaddleClas}.

Citations (89)

Summary

  • The paper presents a parallel design that integrates local-window self-attention with depth-wise convolution, expanding the receptive field for improved feature integration.
  • It implements bi-directional channel and spatial interactions to overcome limitations in traditional attention mechanisms.
  • Experimental results demonstrate enhanced accuracy and efficiency on ImageNet, MS COCO, and ADE20k across diverse vision tasks.

Overview of MixFormer: Mixing Features Across Windows and Dimensions

This paper presents MixFormer, an innovative approach to vision tasks, focusing on enhancing the efficiency and performance of Vision Transformers (ViTs). Traditional local-window self-attention, while effective, is limited by non-overlapping windows and shared channel weights, leading to restricted receptive fields and weakened modeling capabilities. MixFormer addresses these challenges through two key innovations: a parallel design incorporating depth-wise convolution and bi-directional interactions.

Key Contributions

  1. Parallel Design: MixFormer combines local-window self-attention with depth-wise convolution in parallel, effectively expanding the receptive field and enhancing feature interweaving. This approach allows simultaneous modeling of intra-window and cross-window relations, proving more effective than previous sequential methods.
  2. Bi-directional Interactions: The paper introduces channel and spatial interactions to overcome the weak modeling capacity in the channel and spatial dimensions, respectively. This dual-path strategy enhances the representation ability by providing complementary information between local-window self-attention and depth-wise convolution.

Experimental Evidence

MixFormer demonstrates substantial improvements across various tasks:

  • Image Classification: On the ImageNet-1K dataset, MixFormer achieves competitive accuracy, matching EfficientNet’s performance and outperforming RegNet and Swin Transformer, with a notable reduction in computational costs.
  • Dense Prediction Tasks: The model is evaluated on five tasks, including MS COCO and ADE20k, consistently surpassing alternatives such as Swin Transformer by significant margins while maintaining computational efficiency. For instance, on MS COCO with Mask R-CNN, MixFormer-B4 surpasses Swin-T by 2.9 in box mAP and 2.1 in mask mAP.

Theoretical and Practical Implications

The integration of parallel feature processing and enhanced interaction across branches not only addresses the inherent limitations of local-window attention but also sets a precedent for designing efficient networks with improved representational capacity. This methodology has shown promise beyond image classification, proving effective in semantic segmentation and instance segmentation tasks, indicating its general applicability.

Future Directions

Future exploration could investigate the application of these concepts to global self-attention models, potentially bringing similar benefits. Additionally, automated architecture search techniques like NAS could further optimize MixFormer’s design, potentially uncovering configurations that maximize performance across specific tasks or computational constraints.

Overall, MixFormer introduces a compelling framework for building more efficient and capable vision transformers, with potential implications across a variety of AI applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.