A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2108.13002v2)

Published 30 Aug 2021 in cs.CV

Abstract: Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.

Authors (6)

Yucheng Zhao (28 papers)
Guangting Wang (11 papers)
Chuanxin Tang (13 papers)
Chong Luo (58 papers)
Wenjun Zeng (130 papers)
Zheng-Jun Zha (144 papers)

Citations (59)

View on Semantic Scholar

Summary

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

The paper titled "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP" undertakes a comprehensive analysis of convolutional neural networks (CNNs), Transformers, and multi-layer perceptrons (MLPs) within the field of computer vision. The research endeavors to offer empirical insights into the comparative performance and behaviors of these architectures under a unified experimental framework called SPACH.

Framework and Methodology

The SPACH framework was meticulously designed to minimize discrepancies introduced by differing architectural designs. This unified approach ensures a fair comparison between CNNs, Transformers, and MLPs. Key to this framework is its modular design where different network structures can be interchanged without altering other aspects of the framework. This standardization allows for a direct assessment of spatial and channel processing abilities inherent to each architecture.

Core Findings

Performance Parity at Moderate Scale: The paper finds that at moderate scales, all three architectures can yield competitive performance. However, when scaling up the network size, each architecture exhibits unique behaviors, revealing their inherent strengths and limitations.
Importance of Multi-Stage Design: The research underscores the overlooked importance of multi-stage design in Transformers and MLPs, traditionally utilized in CNNs. This design strategy consistently outperforms single-stage configurations across all model scales analyzed.
Local Modeling Efficiency: Another discovery is the critical role of local modeling. Light-weight depth-wise convolutions, when used as a local modeling mechanism, significantly enhance performance, bringing models on par with those using more computationally intensive structures.
Overfitting in MLPs: While MLPs can achieve impressive results with smaller models, they are notably prone to overfitting as model size increases. This represents a significant hurdle for MLPs in achieving State-of-the-Art (SOTA) performance.
Complementary Nature of Convolution and Transformer: The paper highlights the complementary features of CNNs and Transformers. CNNs demonstrate superior generalization capability, making them ideal for lightweight models, while Transformers excel with larger capacity models, suggesting a consideration of hybrid approaches.

Hybrid Model Proposal

In light of their findings, the authors propose hybrid models that incorporate both convolutional and Transformer layers. These models achieve remarkable results, with the Hybrid-MS-S+ model achieving an impressive 83.9% top-1 accuracy on ImageNet with 63M parameters and 12.3G FLOPs, which is competitive with the current SOTA models that employ more intricate architectures.

Implications and Future Directions

The research contributes significantly to the ongoing discourse on network architecture within computer vision. Practically, the findings suggest revisiting multi-stage designs and incorporating local modeling techniques even in non-convolutional architectures like Transformers and MLPs. Theoretically, the work opens intriguing questions regarding the optimal structure combination in hybrid models and the potential for new architectures beyond CNNs, Transformers, and MLPs.

This paper serves as a valuable resource for researchers aiming to objectively evaluate or improve upon current deep learning architectures. The results, derived from a methodical and standardized experimental setup, provide clear insights that guide both the design of future models and the understanding of existing ones. The empirical evidence supports the notion that leveraging the unique strengths of CNNs, Transformers, and MLPs—either individually or in hybrid configurations—can foster the development of more efficient and capable vision systems.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/SPACH (200 stars)

Reddit

"A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP", Zhao et al 2021 (4 points, 1 comment)