ConvMLP: Hierarchical Convolutional MLPs for Vision (2109.04454v2)

Published 9 Sep 2021 in cs.CV

Abstract: MLP-based architectures, which consist of a sequence of consecutive multi-layer perceptron blocks, have recently been found to reach comparable results to convolutional and transformer-based methods. However, most adopt spatial MLPs which take fixed dimension inputs, therefore making it difficult to apply them to downstream tasks, such as object detection and semantic segmentation. Moreover, single-stage designs further limit performance in other computer vision tasks and fully connected layers bear heavy computation. To tackle these problems, we propose ConvMLP: a hierarchical Convolutional MLP for visual recognition, which is a light-weight, stage-wise, co-design of convolution layers, and MLPs. In particular, ConvMLP-S achieves 76.8% top-1 accuracy on ImageNet-1k with 9M parameters and 2.4G MACs (15% and 19% of MLP-Mixer-B/16, respectively). Experiments on object detection and semantic segmentation further show that visual representation learned by ConvMLP can be seamlessly transferred and achieve competitive results with fewer parameters. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Convolutional-MLPs.

Authors (4)

Jiachen Li (144 papers)
Ali Hassani (17 papers)
Steven Walton (16 papers)
Humphrey Shi (97 papers)

Citations (60)

View on Semantic Scholar

Summary

An Analysis of "ConvMLP: Hierarchical Convolutional MLPs for Vision"

The paper "ConvMLP: Hierarchical Convolutional MLPs for Vision" presents a novel architecture, ConvMLP, aimed at improving the efficacy and versatility of MLP-based models in visual recognition tasks. While significant advancements have been made with MLP-based approaches, particularly with MLP-Mixer and its variations, these models typically rely on spatial MLPs, which are limited by their fixed dimension inputs and heavy computational loads. ConvMLP proposes a hierarchical design combining convolutional layers and MLPs, focusing on overcoming these drawbacks and effectively transferring gained representations to downstream tasks such as object detection and semantic segmentation.

Core Contributions

ConvMLP introduces several architectural innovations that enable it to outperform existing MLP-based models:

Hierarchical Convolutional Structure: ConvMLP employs a hierarchical stage-wise design that integrates convolutional and MLP layers. This approach facilitates capturing spatial information more effectively than traditional MLPs, making ConvMLP suitable for arbitrary input dimensions—a critical requirement for downstream tasks. The architecture's effectiveness is evidenced by its performance, achieving 76.8% top-1 accuracy on ImageNet-1k with merely 9 million parameters and 2.4 GMACs.
Conv-MLP Blocks and Convolutional Downsampling: The implementation of Conv-MLP blocks, which consist of channel MLPs interspersed with 3x3 depth-wise convolutions, adds spatial interaction capabilities. The architecture further employs convolutional stages and downsampling methods that enhance spatial feature extraction without the computational burden inherent in traditional spatial MLPs.
Scalable and Transferable Design: ConvMLP is scalable by adjusting the depth and width of its convolution and MLP stages, allowing it to adapt to various computational budgets while retaining competitive performance. The architecture also demonstrates strong transferability to tasks beyond classification, as showcased by its robust performance on MS COCO and ADE20K benchmarks for object detection and semantic segmentation.

Numerical and Comparative Insights

The paper's results show that ConvMLP-S outperforms models like ResMLP-S12 and CycleMLP-B1, particularly in terms of the accuracy-to-computation ratio (Acc/GMACs) across different model size categories. Notably, ConvMLP's architecture allows it to maintain competitive accuracy with fewer parameters and lower computational costs compared to its peers. This efficiency makes it a viable option for real-world deployments where resource constraints are prevalent.

Implications and Future Directions

The introduction of ConvMLP marks a practical advancement in making MLP-based architectures more applicable to diverse visual tasks, notably facilitating their use in dynamic environments requiring variable input sizes. By co-designing convolutions and MLPs, ConvMLP sets a foundation for further exploration into hybrid architectures that leverage the strengths of both strategies.

Considering the efficacy demonstrated in visual recognition and downstream task performance, future research may delve into extending ConvMLP's principles to other domains, potentially combining it with attention mechanisms for enhanced contextual understanding or exploring its application in video analysis. Additionally, future work could investigate further optimizations to reduce computational overhead while maintaining or enhancing accuracy.

In summary, "ConvMLP: Hierarchical Convolutional MLPs for Vision" contributes significantly to the ongoing evolution of MLP-based models in computer vision, proposing a versatile and scalable framework to address their previous shortcomings. Its demonstrated effectiveness across multiple benchmarks opens the door to broader applications and further innovations in the field.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - SHI-Labs/Convolutional-MLPs: [Preprint] ConvMLP: Hierarchical Convolutional MLPs for Vision, 2021 (167 stars)