EfficientFormer: Vision Transformers at MobileNet Speed (2206.01191v5)

Published 2 Jun 2022 in cs.CV

Abstract: Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, \textit{e.g.}, attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves $79.2\%$ top-1 accuracy on ImageNet-1K with only $1.6$ ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2$\times 1.4$ ($1.6$ ms, $74.7\%$ top-1), and our largest model, EfficientFormer-L7, obtains $83.3\%$ accuracy with only $7.0$ ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance.

PDF Abstract

EfficientFormer: Vision Transformers at MobileNet Speed

EfficientFormer is a novel architecture presented to address the computational challenges faced when deploying Vision Transformers (ViT) in resource-constrained environments, specifically mobile devices. This architecture strategically redefines the network design to achieve the speed comparable to lightweight convolutional neural networks (CNNs), such as MobileNet, while maintaining competitive performance on computer vision tasks.

Key Contributions

EfficientFormer introduces a dimension-consistent design paradigm for vision transformers, avoiding inefficient designs that hinder latency. Unlike previous approaches, which often resort to hybrid models incorporating MobileNet blocks, EfficientFormer remains true to a "pure transformer" structure. The architecture employs a combination of 4D and 3D processing blocks, facilitating ease of implementation and efficient operation across different hardware platforms.

The proposed architecture leverages a newly designed patch embedding mechanism that replaces large-kernel convolutions with a more efficient convolutional stem. This not only accelerates inference but also avoids penalties associated with operations that are suboptimal for current compiler optimizations. EfficientFormer further integrates a latency-driven slimming method to fine-tune the network architecture for optimal performance given computational constraints. This method uses a pre-trained supernet which undergoes controlled slimming guided by latency evaluation, rather than merely focusing on a reduction in MACs or parameter count.

Numerical Results and Implications

The efficiency of EfficientFormer is empirically validated through exhaustive experiments, revealing impressive results. EfficientFormer-L1, tailored for mobile applications, demonstrates an ImageNet top-1 accuracy of 79.2% while achieving a latency of 1.6 ms on an iPhone 12—comparable to MobileNetV2 but with a 4.5% increase in accuracy. The EfficientFormer-L7, on the other hand, pushes the boundaries further to an accuracy of 83.3% with a latency of 7.0 ms. These models successfully maintain high levels of accuracy while achieving significantly reduced inference times, emphasizing that transformers can perform efficiently under constrained conditions without sacrificing predictive accuracy.

Theoretical and Practical Insights

EfficientFormer exemplifies an important theoretical insight: the choice and consistency of dimension processing within network design can drastically influence latency without impairing the model’s capability to leverage the benefits of attention mechanisms. Such design paradigms present new opportunities for developing models optimized for edge devices, expanding the applicability of vision transformers to scenarios requiring low latency and high-throughput processing.

From a practical standpoint, the EfficientFormer architecture provides a robust framework for future developments in mobile-based AI applications, offering a model that reconciles the often conflicting demands of speed and precision. As edge computing continues to evolve, the strategies and insights derived from EfficientFormer will likely inform both industry and academic endeavors in optimizing AI models for diverse operating environments.

Future Developments and Speculations

The research presented opens doors for several exciting future directions in AI. There is potential for exploring more adaptive methods for integrating attention mechanisms that can dynamically adjust to varying computational graphs. Moreover, the latency-aware slimming approach may be further refined or adapted using emerging neural architecture search techniques or through hardware-specific optimizations.

EfficientFormer not only provides a significant advance in transformer efficiency but also narrows the gap between CNNs and the versatile ViTs, suggesting a promising horizon for deploying vision tasks across an expanded range of platforms, from mobile devices to distributed IoT networks. As AI models become more pervasive, the deployment of efficient transformers like EfficientFormer will play a crucial role in ensuring robust, real-time processing capabilities are feasible even in resource-limited environments.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yanyu Li (31 papers)
Geng Yuan (58 papers)
Yang Wen (13 papers)
Ju Hu (9 papers)
Georgios Evangelidis (14 papers)
Sergey Tulyakov (108 papers)
Yanzhi Wang (197 papers)
Jian Ren (97 papers)

Citations (282)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/sukhbir_dhillon/status/1796122550952771665

https://twitter.com/forestsfailyou/status/1795607212037120408