Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? (2109.05422v2)

Published 12 Sep 2021 in cs.CV

Abstract: Transformers have sprung up in the field of computer vision. In this work, we explore whether the core self-attention module in Transformer is the key to achieving excellent performance in image recognition. To this end, we build an attention-free network called sMLPNet based on the existing MLP-based vision models. Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module. For 2D image tokens, sMLP applies 1D MLP along the axial directions and the parameters are shared among rows or columns. By sparse connection and weight sharing, sMLP module significantly reduces the number of model parameters and computational complexity, avoiding the common over-fitting problem that plagues the performance of MLP-like models. When only trained on the ImageNet-1K dataset, the proposed sMLPNet achieves 81.9% top-1 accuracy with only 24M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer. The success of sMLPNet suggests that the self-attention mechanism is not necessarily a silver bullet in computer vision. The code and models are publicly available at https://github.com/microsoft/SPACH

PDF Abstract

An Analysis of Sparse MLP for Image Recognition

The paper "Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?" explores the burgeoning field of attention-free neural networks for image recognition, specifically scrutinizing the necessity of self-attention mechanisms, as used in transformers, for achieving high accuracy in visual tasks. The authors present an alternative architecture named sMLPNet, which employs a novel sparse MLP (sMLP) for efficient token-mixing in image data processing.

Architecture and Methodology

In the setup of sMLPNet, traditional self-attention modules are replaced with a sparse MLP module designed to more efficiently handle global dependencies while significantly reducing parameters and computational cost. The architecture draws from prior MLP-based models, with essential modifications to minimize overfitting threats. These adjustments include the introduction of weight sharing and sparse connections within the MLP, which align one-dimensional MLP operations along both horizontal and vertical axes for 2D image tokens. The model's structure is further aligned with visionary guidelines previously harnessed in the computer vision domain, such as locality bias and pyramid processing.

sMLPNet’s implementation involves standard pre-processing steps like patch partitioning of the input image into non-overlapping fragments to be fed into the model. From there, it employs a series of token mixing and channel mixing modules across multiple stages, maintaining a connection to the pyramid framework typically found in CNN architectures. An essential part of the design is the depth-wise convolution layers that enhance local feature extraction, synergistically operating alongside the proposed sparse MLP for handling global attributes.

Empirical Evaluation

Analysis on the ImageNet-1K dataset reveals that sMLPNet achieves an impressive 81.9% top-1 accuracy with merely 24M parameters. When extended to 66M parameters, its accuracy parallels that of the Swin Transformer, a leading architecture characterized by incorporating attention mechanisms. Particularly noteworthy is how sMLPNet maintains this high level of accuracy without relying on the ever-popular self-attention, suggesting that these mechanisms might not be indispensable for vision models.

Performance and Implications

The exploration conducted with sMLPNet poses significant implications for the AI research landscape, both theoretically and practically. From a practical perspective, the drastic reduction in parameters and computational demands associated with sMLPNet implies increased accessibility to high-performance image recognition models, offering advantages in contexts constrained by resources or requiring fast inference times. From a theoretical lens, questioning the necessity of self-attention may open new avenues for developing machine learning architectures across tasks beyond computer vision.

The implications are substantial in terms of improving model scalability and applicability to larger or more complex datasets without succumbing to overfitting. Additionally, sMLPNet challenges conventional thinking about architectural dependencies, pushing researchers to reevaluate the underlying components critical to building robust and efficient models.

Future Directions

The framework laid out by the authors for attention-free networks heralds exciting potential developments, including reevaluating the design principles of globally contextual models. Future investigations might pivot toward further refining fusion methods in sMLP modules, exploring variants of locality modeling, or adapting similar approaches to broader tasks such as semantic segmentation and object detection. Overcoming challenges associated with MLP-like architectures, such as handling inputs of varying resolutions, could also stimulate advancements in versatile network design.

In conclusion, this paper presents a compelling argument against the assumed centrality of self-attention in image recognition, demonstrating through sMLPNet's architecture that alternative paths can lead to state-of-the-art performance.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Chuanxin Tang (13 papers)
Yucheng Zhao (28 papers)
Guangting Wang (11 papers)
Chong Luo (58 papers)
Wenxuan Xie (22 papers)
Wenjun Zeng (130 papers)

Citations (91)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/SPACH (196 stars)