S$^2$-MLP: Spatial-Shift MLP Architecture for Vision (2106.07477v2)

Published 14 Jun 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S$^2$-MLP). Different from MLP-Mixer, our S$^2$-MLP only contains channel-mixing MLP. We utilize a spatial-shift operation for communications between patches. It has a local reception field and is spatial-agnostic. It is parameter-free and efficient for computation. The proposed S$^2$-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S$^2$-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

PDF Abstract

Overview of S $^2$ -MLP: Spatial-Shift MLP Architecture for Vision

S $^2$ -MLP presents a nuanced approach to employing pure Multi-Layer Perceptron (MLP) architectures in vision tasks by introducing the spatial-shift operation as a means to facilitate communications between non-overlapping patches within an image. This model addresses the limitations faced by prior MLP-based architectures, such as MLP-Mixer, which struggle to match the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) on medium-scale datasets like ImageNet-1K.

Key Contributions and Findings

The core innovation in S $^2$ -MLP lies in its spatial-shift operation, which uniquely enables channel-wise interaction between adjacent patches, circumventing the need for token-mixing MLP components which have shown susceptibility to overfitting on medium-sized datasets. The spatial-shift operation is both parameter-free and computationally efficient, leveraging a local reception field to maintain spatial agnosticism, thereby reducing the risks associated with overfitting.

When evaluated on the ImageNet-1K dataset, S $^2$ -MLP demonstrated superior recognition accuracy over MLP-Mixer while achieving comparable performance to ViT, but with a simpler architecture and reduced computational overhead, indicating its efficiency and practicality in real-world applications.

Implications and Future Directions

The S $^2$ -MLP architecture, by reducing parameter dependency and computational complexity, represents a significant stride towards more efficient model configurations without sacrificing performance. Its parameter-free spatial-shift mechanism may serve as a foundational component in future MLP-based architectures, encouraging further exploration into efficient mechanisms for spatial content aggregation.

Moreover, the exploration of relationships between depthwise convolution, the spatial-shift operation, and token-mixing MLP offers intriguing insights into potential hybrid architectures that could capitalize on the strengths of each approach while mitigating their weaknesses. As AI research continues pushing towards optimizing model efficiency and accuracy, S $^2$ -MLP's principles might inform novel model design strategies, especially pertinent to resource-constrained applications.

Theoretical Underpinnings and Practical Considerations

The paper highlights the equivalence of the spatial-shift operation to a depthwise convolution with fixed kernel weights, an observation that may guide theoretical advancements concerning spatially localized feature integration in neural networks. Additionally, the efficiency gains attributed to the local operation of spatial-shifts suggest practical avenues for deploying high-performance vision models on edge computing devices.

In covering the spectrum from theoretical contributions to tangible improvements in model architecture efficacy, S $^2$ -MLP paves the way for the next generation of ergonomic, data-efficient vision architectures, critiquing the complications faced by conventional MLP-based models and offering a robust path forward in the convergence of MLP architectures and vision applications.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Tan Yu (17 papers)
Xu Li (126 papers)
Yunfeng Cai (27 papers)
Mingming Sun (28 papers)
Ping Li (421 papers)

Citations (173)

View on Semantic Scholar

S$^2$-MLP: Spatial-Shift MLP Architecture for Vision (2106.07477v2)

Overview of S2^22-MLP: Spatial-Shift MLP Architecture for Vision

Key Contributions and Findings

Implications and Future Directions

Theoretical Underpinnings and Practical Considerations

Related Papers

Overview of S $^2$ -MLP: Spatial-Shift MLP Architecture for Vision