RepViT: Revisiting Mobile CNN From ViT Perspective (2307.09283v8)

Published 18 Jul 2023 in cs.CV

Abstract: Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10$\times$ faster inference than the advanced MobileSAM. Codes and models are available at \url{https://github.com/THU-MIG/RepViT}.

PDF Abstract

Essay on "RepViT: Revisiting Mobile CNN From ViT Perspective"

Introduction

The paper entitled "RepViT: Revisiting Mobile CNN From ViT Perspective" proposes a significant refinement of lightweight Convolutional Neural Networks (CNNs) by integrating architectural insights drawn from Vision Transformers (ViTs). Lightweight ViTs have shown promise in performance and latency, particularly on mobile devices where resources are constrained. However, the distinct architectural design principles between lightweight ViTs and CNNs, such as block structure and macro/micro designs, have not been thoroughly inspected. This paper attempts to bridge this gap and optimize CNNs for mobile deployments.

Methodology

The authors initiate their exploration with the well-known MobileNetV3 CNN architecture and iteratively integrate elements from ViTs to enhance its efficiency on mobile devices. This systematic enhancement leads to the development of a new family of CNNs named RepViT. Key modifications include:

Block Design: The authors separate the token and channel mixer components in CNNs by leveraging structural re-parameterization. This helps retain the spatial and channel representations more effectively, analogous to the MetaFormer structure used in ViTs.
Macro Design Innovations: The paper proposes a more efficient stem with stacked convolutions, deeper downsampling layers, and a simplified classifier, all contributing to reduced computational latency and enhanced performance.
Micro Adjustments: Detailed considerations include the exclusive use of $3 \times 3$ convolutions and a thoughtful placement strategy for squeeze-and-excitation layers.

Experimental Results

The research demonstrates that RepViT models significantly outperform state-of-the-art lightweight models both in terms of accuracy and latency on mobile devices across various vision tasks, including image classification, object detection, and instance segmentation. Notable results include achieving over 80% top-1 accuracy on ImageNet with a mere 1.0 ms latency on an iPhone 12 device, a first for such lightweight models.

Implications and Future Directions

The theoretical implication of this paper is the viable symbiosis of CNN and ViT architectural elements, yielding models that are highly conducive to mobile environments. Practically, RepViT promises enhanced performance for real-time applications on mobile platforms without sacrificing computational efficiency.

The integration of the RepViT structure with the Segment Anything Model (SAM) further establishes its efficacy in segmenting various vision tasks with remarkable speedup, showcasing nearly 10 $\times$ faster inference than existing models like MobileSAM.

Future research might focus on further optimizing the interaction between token and channel mixers or exploring different re-parameterization techniques to enhance both performance and deployment efficiency further. The paper's methods might also inspire new hybrid architectures that leverage the strengths of both CNNs and ViTs more comprehensively.

Conclusion

This paper exemplifies how architectural insights from Vision Transformers can be adapted to enhance lightweight CNNs, promoting substantial advancements in model performance and latency. The work serves as a robust baseline for future lightweight model development tailored for edge deployments, contributing to the broader computer vision and mobile computing landscape.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Ao Wang (43 papers)
Hui Chen (298 papers)
Zijia Lin (43 papers)
Jungong Han (111 papers)
Guiguang Ding (79 papers)

Citations (93)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - THU-MIG/RepViT: RepViT: Revisiting Mobile CNN From ViT Perspective [CVPR 2024] and RepViT-SAM: Towards Real-Time Segmenting Anything (799 stars)