S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision (2108.01072v1)

Published 2 Aug 2021 in cs.CV

Abstract: Recently, MLP-based vision backbones emerge. MLP-based vision architectures with less inductive bias achieve competitive performance in image recognition compared with CNNs and vision Transformers. Among them, spatial-shift MLP (S$^2$-MLP), adopting the straightforward spatial-shift operation, achieves better performance than the pioneering works including MLP-mixer and ResMLP. More recently, using smaller patches with a pyramid structure, Vision Permutator (ViP) and Global Filter Network (GFNet) achieve better performance than S$^2$-MLP. In this paper, we improve the S$^2$-MLP vision backbone. We expand the feature map along the channel dimension and split the expanded feature map into several parts. We conduct different spatial-shift operations on split parts. Meanwhile, we exploit the split-attention operation to fuse these split parts. Moreover, like the counterparts, we adopt smaller-scale patches and use a pyramid structure for boosting the image recognition accuracy. We term the improved spatial-shift MLP vision backbone as S$^2$-MLPv2. Using 55M parameters, our medium-scale model, S$^{2$-MLPv2-Medium} achieves an $83.6\%$ top-1 accuracy on the ImageNet-1K benchmark using $224\times 224$ images without self-attention and external training data.

Authors (5)

Tan Yu (17 papers)
Xu Li (126 papers)
Yunfeng Cai (27 papers)
Mingming Sun (28 papers)
Ping Li (421 papers)

Citations (48)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision (2108.01072v1)

Summary

Related Papers