Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Published 30 Aug 2021 in cs.CV | (2108.13002v2)

Abstract: Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.

Citations (59)

Summary

  • The paper establishes SPACH, a unified framework that enables fair performance comparisons among CNNs, Transformers, and MLPs.
  • It demonstrates that multi-stage design and local modeling significantly improve performance across various network scales.
  • Hybrid architectures, merging convolution and Transformer layers, achieve competitive results with metrics like 83.9% top-1 accuracy on ImageNet.

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

The study titled "A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP" undertakes a comprehensive analysis of convolutional neural networks (CNNs), Transformers, and multi-layer perceptrons (MLPs) within the field of computer vision. The research endeavors to offer empirical insights into the comparative performance and behaviors of these architectures under a unified experimental framework called SPACH.

Framework and Methodology

The SPACH framework was meticulously designed to minimize discrepancies introduced by differing architectural designs. This unified approach ensures a fair comparison between CNNs, Transformers, and MLPs. Key to this framework is its modular design where different network structures can be interchanged without altering other aspects of the framework. This standardization allows for a direct assessment of spatial and channel processing abilities inherent to each architecture.

Core Findings

  1. Performance Parity at Moderate Scale: The study finds that at moderate scales, all three architectures can yield competitive performance. However, when scaling up the network size, each architecture exhibits unique behaviors, revealing their inherent strengths and limitations.
  2. Importance of Multi-Stage Design: The research underscores the overlooked importance of multi-stage design in Transformers and MLPs, traditionally utilized in CNNs. This design strategy consistently outperforms single-stage configurations across all model scales analyzed.
  3. Local Modeling Efficiency: Another discovery is the critical role of local modeling. Light-weight depth-wise convolutions, when used as a local modeling mechanism, significantly enhance performance, bringing models on par with those using more computationally intensive structures.
  4. Overfitting in MLPs: While MLPs can achieve impressive results with smaller models, they are notably prone to overfitting as model size increases. This represents a significant hurdle for MLPs in achieving State-of-the-Art (SOTA) performance.
  5. Complementary Nature of Convolution and Transformer: The study highlights the complementary features of CNNs and Transformers. CNNs demonstrate superior generalization capability, making them ideal for lightweight models, while Transformers excel with larger capacity models, suggesting a consideration of hybrid approaches.

Hybrid Model Proposal

In light of their findings, the authors propose hybrid models that incorporate both convolutional and Transformer layers. These models achieve remarkable results, with the Hybrid-MS-S+ model achieving an impressive 83.9% top-1 accuracy on ImageNet with 63M parameters and 12.3G FLOPs, which is competitive with the current SOTA models that employ more intricate architectures.

Implications and Future Directions

The research contributes significantly to the ongoing discourse on network architecture within computer vision. Practically, the findings suggest revisiting multi-stage designs and incorporating local modeling techniques even in non-convolutional architectures like Transformers and MLPs. Theoretically, the work opens intriguing questions regarding the optimal structure combination in hybrid models and the potential for new architectures beyond CNNs, Transformers, and MLPs.

This paper serves as a valuable resource for researchers aiming to objectively evaluate or improve upon current deep learning architectures. The results, derived from a methodical and standardized experimental setup, provide clear insights that guide both the design of future models and the understanding of existing ones. The empirical evidence supports the notion that leveraging the unique strengths of CNNs, Transformers, and MLPs—either individually or in hybrid configurations—can foster the development of more efficient and capable vision systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

  1. GitHub - microsoft/SPACH (205 stars)