AS-MLP: An Axial Shifted MLP Architecture for Vision
The paper presents the Axial Shifted MLP (AS-MLP) architecture, an innovative approach in computer vision specifically within the MLP-based architecture framework. Unlike the traditional MLP-Mixer, which focuses on global spatial feature interactions through matrix transposition, AS-MLP emphasizes local feature interaction by employing an axial shift in channels and feature maps. This method effectively captures local dependencies akin to convolutional networks, thus addressing a critical limitation of previous MLP-based architectures.
Key Contributions
- Axial Shifted MLP (AS-MLP): AS-MLP introduces a channel-shifting mechanism that leverages both horizontal and vertical shifts to capture features from diverse axial directions. This design allows AS-MLP to achieve local receptive fields comparable with CNN architectures while maintaining the simplicity of an MLP-based design.
- Performance: The AS-MLP achieves a Top-1 accuracy of 83.3% on ImageNet-1K with 88 million parameters and 15.2 GFLOPs. These results outperform all contemporaneous MLP-based architectures and rival transformer-based architectures like Swin Transformer, indicating AS-MLP's efficacy in capturing local context efficiently.
- Downstream Tasks: AS-MLP notably becomes the first MLP-based architecture to be applied successfully to downstream vision tasks such as object detection and semantic segmentation, setting a strong baseline in mAP and MS mIoU on COCO and ADE20K datasets, respectively.
Detailed Examination
Architecture Innovations: The essence of the AS-MLP architecture lies in using axial shifts—the spatial transformation of features across horizontally and vertically aligned channels—to integrate local feature interactions. By channeling features into groups, AS-MLP achieves the desired locality without the fixed window constraints typical in transformers, such as the Swin Transformer's window-based attention.
Technical Implementation: The AS-MLP architecture's blocks consist of Norm layers, Axial Shift operations, MLP layers, and residual connections. By using a parallel connection type in the AS-MLP block, features from horizontal and vertical shifts are jointly processed to enhance local dependency capture—a noteworthy progression over the MLP-Mixer's global token-mixing MLP.
Comparative Analysis: AS-MLP's design contrasts starkly with CNNs, transformers, and MLP-Mixer techniques. Unlike convolution operations focused on local receptive fields, AS-MLP strategically blends axial shifts to emulate these fields without convolution layers. Against transformers, AS-MLP avoids fixed-window constraints, enhancing adaptability and efficiency. It also circumvents MLP-Mixer's fixed spatial dimension issues, facilitating its application to varying input sizes in downstream tasks.
Empirical Results: The experimental results emphasize AS-MLP's superiority in computational efficiency and adaptability. Its transferability to tasks like object detection and semantic segmentation underlines its practical utility, evidenced by competitive results against transformer-based methods on benchmarks. Moreover, in mobile settings, AS-MLP outpaces Swin Transformer variants in accuracy with fewer parameters.
Practical and Theoretical Implications
Practical Implications: The ability to efficiently model local dependencies impacts various practical applications, from real-time image processing in mobile devices to resource-constrained environments where MLP architectures may offer significant advantages over traditional CNNs and emerging transformers.
Theoretical Implications: The results join ongoing discussions about the generalization capabilities of MLPs in vision tasks, challenging existing paradigms centered around convolutional and attention mechanisms. AS-MLP also suggests potential new directions in MLP-based architecture design, particularly in how local and global feature interactions can be balanced and optimized.
Future Directions
The application of AS-MLP in an array of vision tasks heralds further explorations into its integration with NLP models. Future developments might extend axial shift strategies to accommodate diverse modality inputs or seek further optimizations in terms of computational and energy efficiency. The foundational shift AS-MLP represents could invoke broader architectural inspirations, ultimately bridging MLP-based approaches with transformer and CNN insights.
In conclusion, AS-MLP marks a significant exploration into leveraging axial spatial shifts within MLP frameworks, harnessing local feature interactions efficiently. Its potential extends beyond current scope, inviting further research and development in the evolving landscape of AI architectures.