Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
10 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MobileNetV4 -- Universal Models for the Mobile Ecosystem (2404.10518v2)

Published 16 Apr 2024 in cs.CV

Abstract: We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Revisiting resnets: Improved training and scaling strategies. Advances in Neural Information Processing Systems, 34:22614–22627, 2021.
  3. Can weight sharing outperform random architecture search? an investigation with tunas. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  4. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10925–10934, 2022.
  5. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
  6. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5270–5279, 2022.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Discovering multi-hardware mobile models via architecture search. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2021, virtual, June 19-25, 2021, pages 3022–3031. Computer Vision Foundation / IEEE, 2021.
  9. Randaugment: Practical automated data augmentation with a reduced search space. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, June 2009.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  13. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In ICCV, 2023.
  14. Efficientvit: Memory efficient vision transformer with cascaded group attention. In CVPR, 2023.
  15. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
  16. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  17. Xianzhi Du Yeqing Li Abdullah Rashwan Le Hou Pengchong Jin Fan Yang Frederick Liu Jaeyoun Kim Hongkun Yu, Chen Chen and Jing Li. TensorFlow Model Garden. https://github.com/tensorflow/models, 2020.
  18. Searching for mobilenetv3. CoRR, abs/1905.02244, 2019.
  19. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
  20. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  21. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  22. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
  23. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  24. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  25. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023.
  26. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501, 2022.
  27. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16889–16900, 2023.
  28. Efficientformer: Vision transformers at mobilenet speed. In NeurIPS, 2022.
  29. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 936–944. IEEE Computer Society, 2017.
  30. Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2999–3007. IEEE Computer Society, 2017.
  31. Microsoft COCO: common objects in context. In David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014.
  32. A convnet for the 2020s. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11966–11976. IEEE, 2022.
  33. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
  34. Separable self-attention for mobile vision transformers. arXiv preprint arXiv:2206.02680, 2022.
  35. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  36. Mobilenetv2: Inverted residuals and linear bottlenecks. CoRR, abs/1801.04381, 2018.
  37. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  38. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.
  39. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  40. Mnasnet: Platform-aware neural architecture search for mobile. CoRR, abs/1807.11626, 2018.
  41. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  42. Fastvit: A fast hybrid vision transformer using structural reparameterization. arXiv preprint arXiv:2303.14189, 2023.
  43. Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7907–7917, 2023.
  44. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  45. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
  46. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
  47. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. CoRR, abs/1812.03443, 2018.
  48. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020.
  49. Netadapt: Platform-aware neural network adaptation for mobile applications. CoRR, abs/1804.03230, 2018.
  50. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  51. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  52. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017.
Citations (29)

Summary

  • The paper presents MobileNetV4's Universal Inverted Bottleneck (UIB) and Mobile MQA, which boost model efficiency across varied mobile platforms.
  • It employs an optimized neural architecture search strategy to reliably detect robust, mostly Pareto optimal architectures.
  • Results include 87% top-1 ImageNet accuracy and a 3.8ms runtime on Pixel 8 EdgeTPU, underscoring its practical impact on mobile deployment.

MobileNetV4: Enhancements for Optimal Mobile Ecosystem Deployment

Introduction to MobileNetV4

The latest addition to the MobileNets series, MobileNetV4 (MNv4), offers significant innovations in mobile device architectures which address the balancing act between efficiency and accuracy. The crucial advancements include the Universal Inverted Bottleneck (UIB), an optimized Mobile Multi-Query Attention (MQA) block tailored for mobile accelerators, and an improved neural architecture search (NAS) recipe. Among these advancements, the UIB and Mobile MQA are pivotal in achieving a universally efficient architecture designed to be mostly Pareto optimal across diverse mobile platforms, including CPUs, DSPs, GPUs, and specialized accelerators like the Apple Neural Engine and Google Pixel EdgeTPU.

Key Contributions

  • Universal Inverted Bottleneck (UIB): The UIB is an evolution of the Inverted Bottleneck block, integrating features from ConvNext and Feed Forward Networks. This block allows flexibility in spatial and channel mixing, option to extend the receptive field, and improved computational efficiency.
  • Mobile MQA: A novel attention block providing a 39% inference speedup on mobile accelerators. It exploits the efficiency of shared keys and values across all attention heads, significantly improving the operational intensity, which is crucial for performance on mobile devices.
  • Optimized Neural Architecture Search (NAS): A refined NAS process enhances MNv4's search effectiveness. The inclusion of a coarse and fine-grained search, along with an offline distilled dataset, improves the detection of robust architectures, making the search process more efficient and effective.
  • Mostly Pareto Optimal Performance: MNv4 achieves mostly Pareto optimal performance across a wide range of hardware, establishing a new benchmark for multi-platform deployment without platform-specific tuning.

Results and Implications

Improved Hardware Efficiency

MNv4 models showcase exceptional hardware-wise efficiency. Specifically, the MNv4-Hybrid-L model achieved 87% top-1 accuracy on ImageNet-1K with a notable 3.8ms runtime on the Pixel 8 EdgeTPU. This robust performance is indicative of its universal design efficiency, beneficial for practitioners aiming to deploy on a variety of mobile platforms without needing extensive customization.

Benchmarks Across Devices

The MNv4 models' performance was rigorously benchmarked across significant mobile processing environments. It was noted that these models exhibited mostly Pareto optimal curves in almost all hardware scenarios tested. This universality in performance across hardware types like CPUs, DSPs, and specialized accelerators underscores MNv4's broad applicability in the mobile ecosystem.

Future Outlook

Building on the insights gained from MNv4, future research could explore further optimization of the UIB and Mobile MQA components to enhance model efficiency and accuracy. Additionally, expanding the NAS methodology to seamlessly integrate emergent hardware capabilities could sustain the evolution of highly efficient mobile-specific models. As mobile devices continue to diversify and as their computing capabilities expand, maintaining a focus on universal model performance will remain paramount.

In conclusion, MobileNetV4's introduction of the UIB block, enhanced Mobile MQA, and refined NAS represent significant steps forward in the design of neural network architectures for mobile devices. Its mostly Pareto optimal performance across diverse hardware platforms not only enhances its applicability but also sets a new standard in mobile neural network efficiency and effectiveness.

Youtube Logo Streamline Icon: https://streamlinehq.com