AS-MLP: An Axial Shifted MLP Architecture for Vision (2107.08391v2)

Published 18 Jul 2021 in cs.CV

Abstract: An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Different from MLP-Mixer, where the global spatial feature is encoded for information flow through matrix transposition and one token-mixing MLP, we pay more attention to the local features interaction. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies. Such an operation enables us to utilize a pure MLP architecture to achieve the same local receptive field as CNN-like architecture. We can also design the receptive field size and dilation of blocks of AS-MLP, etc, in the same spirit of convolutional neural networks. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset. Such a simple yet effective architecture outperforms all MLP-based architectures and achieves competitive performance compared to the transformer-based architectures (e.g., Swin Transformer) even with slightly lower FLOPs. In addition, AS-MLP is also the first MLP-based architecture to be applied to the downstream tasks (e.g., object detection and semantic segmentation). The experimental results are also impressive. Our proposed AS-MLP obtains 51.5 mAP on the COCO validation set and 49.5 MS mIoU on the ADE20K dataset, which is competitive compared to the transformer-based architectures. Our AS-MLP establishes a strong baseline of MLP-based architecture. Code is available at https://github.com/svip-lab/AS-MLP.

Authors (4)

Dongze Lian (19 papers)
Zehao Yu (41 papers)
Xing Sun (94 papers)
Shenghua Gao (84 papers)

Citations (176)

View on Semantic Scholar

Summary

AS-MLP: An Axial Shifted MLP Architecture for Vision

The paper presents the Axial Shifted MLP (AS-MLP) architecture, an innovative approach in computer vision specifically within the MLP-based architecture framework. Unlike the traditional MLP-Mixer, which focuses on global spatial feature interactions through matrix transposition, AS-MLP emphasizes local feature interaction by employing an axial shift in channels and feature maps. This method effectively captures local dependencies akin to convolutional networks, thus addressing a critical limitation of previous MLP-based architectures.

Key Contributions

Axial Shifted MLP (AS-MLP): AS-MLP introduces a channel-shifting mechanism that leverages both horizontal and vertical shifts to capture features from diverse axial directions. This design allows AS-MLP to achieve local receptive fields comparable with CNN architectures while maintaining the simplicity of an MLP-based design.
Performance: The AS-MLP achieves a Top-1 accuracy of 83.3% on ImageNet-1K with 88 million parameters and 15.2 GFLOPs. These results outperform all contemporaneous MLP-based architectures and rival transformer-based architectures like Swin Transformer, indicating AS-MLP's efficacy in capturing local context efficiently.
Downstream Tasks: AS-MLP notably becomes the first MLP-based architecture to be applied successfully to downstream vision tasks such as object detection and semantic segmentation, setting a strong baseline in mAP and MS mIoU on COCO and ADE20K datasets, respectively.

Detailed Examination

Architecture Innovations: The essence of the AS-MLP architecture lies in using axial shifts—the spatial transformation of features across horizontally and vertically aligned channels—to integrate local feature interactions. By channeling features into groups, AS-MLP achieves the desired locality without the fixed window constraints typical in transformers, such as the Swin Transformer's window-based attention.

Technical Implementation: The AS-MLP architecture's blocks consist of Norm layers, Axial Shift operations, MLP layers, and residual connections. By using a parallel connection type in the AS-MLP block, features from horizontal and vertical shifts are jointly processed to enhance local dependency capture—a noteworthy progression over the MLP-Mixer's global token-mixing MLP.

Comparative Analysis: AS-MLP's design contrasts starkly with CNNs, transformers, and MLP-Mixer techniques. Unlike convolution operations focused on local receptive fields, AS-MLP strategically blends axial shifts to emulate these fields without convolution layers. Against transformers, AS-MLP avoids fixed-window constraints, enhancing adaptability and efficiency. It also circumvents MLP-Mixer's fixed spatial dimension issues, facilitating its application to varying input sizes in downstream tasks.

Empirical Results: The experimental results emphasize AS-MLP's superiority in computational efficiency and adaptability. Its transferability to tasks like object detection and semantic segmentation underlines its practical utility, evidenced by competitive results against transformer-based methods on benchmarks. Moreover, in mobile settings, AS-MLP outpaces Swin Transformer variants in accuracy with fewer parameters.

Practical and Theoretical Implications

Practical Implications: The ability to efficiently model local dependencies impacts various practical applications, from real-time image processing in mobile devices to resource-constrained environments where MLP architectures may offer significant advantages over traditional CNNs and emerging transformers.

Theoretical Implications: The results join ongoing discussions about the generalization capabilities of MLPs in vision tasks, challenging existing paradigms centered around convolutional and attention mechanisms. AS-MLP also suggests potential new directions in MLP-based architecture design, particularly in how local and global feature interactions can be balanced and optimized.

Future Directions

The application of AS-MLP in an array of vision tasks heralds further explorations into its integration with NLP models. Future developments might extend axial shift strategies to accommodate diverse modality inputs or seek further optimizations in terms of computational and energy efficiency. The foundational shift AS-MLP represents could invoke broader architectural inspirations, ultimately bridging MLP-based approaches with transformer and CNN insights.

In conclusion, AS-MLP marks a significant exploration into leveraging axial spatial shifts within MLP frameworks, harnessing local feature interactions efficiently. Its potential extends beyond current scope, inviting further research and development in the evolving landscape of AI architectures.

Related Papers

Find Related Papers

GitHub

GitHub - svip-lab/AS-MLP: [ICLR'22] This is an official implementation for "AS-MLP: An Axial Shifted MLP Architecture for Vision". (126 stars)