Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory (2310.16898v3)

Published 25 Oct 2023 in cs.CV

Abstract: Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73.62\% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller. Code is available at https://github.com/liangyn22/MCUFormer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. {{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
  2. Vision-only robot navigation in a neural radiance world. IEEE Robotics and Automation Letters, 7(2):4606–4613, 2022.
  3. A logistic approximation to the cumulative normal distribution. Journal of industrial engineering and management, 2(1):114–127, 2009.
  4. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332, 2018.
  5. Cmix-nn: Mixed low-precision cnn library for memory-constrained edge devices. IEEE Transactions on Circuits and Systems II: Express Briefs, 67(5):871–875, 2020.
  6. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  7. Psvit: Better vision transformer via token pooling and attention sharing. arXiv preprint arXiv:2108.03428, 2021.
  8. Glit: Neural architecture search for global and local image transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–21, 2021.
  9. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12270–12280, 2021.
  10. Searching the search space of vision transformer. Advances in Neural Information Processing Systems, 34:8714–8726, 2021.
  11. Tvm: An automated end-to-end optimizing compiler for deep learning. arXiv preprint arXiv:1802.04799, 2018.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Vtnet: Visual transformer network for object goal navigation. arXiv preprint arXiv:2105.09447, 2021.
  15. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021.
  16. Ga-nav: Efficient terrain segmentation for robot navigation in unstructured outdoor environments. IEEE Robotics and Automation Letters, 7(3):8138–8145, 2022.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  18. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9223–9232, 2023.
  19. I-bert: Integer-only bert quantization. In International conference on machine learning, pages 5506–5518. PMLR, 2021.
  20. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus. arXiv preprint arXiv:1801.06601, 2018.
  21. Mcunetv2: Memory-efficient patch-based inference for tiny deep learning. arXiv preprint arXiv:2110.15352, 2021.
  22. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems, 33:11711–11722, 2020.
  23. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
  24. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  25. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103, 2021.
  26. A benchmark for deep learning based object detection in maritime environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  27. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7077–7087, 2021.
  28. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, 34:13937–13949, 2021.
  29. Rnnpool: Efficient non-linear pooling for ram constrained inference. Advances in Neural Information Processing Systems, 33:20473–20484, 2020.
  30. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  31. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492–504. PMLR, 2023.
  32. Entropy-driven mixed-precision quantization for deep network design. Advances in Neural Information Processing Systems, 35:21508–21520, 2022.
  33. Recognition and localization methods for vision-based fruit picking robots: A review. Frontiers in Plant Science, 11:510, 2020.
  34. Training data-efficient image transformers & distillation through attention. ArXiv, abs/2012.12877, 2020.
  35. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  36. Seadronessee: A maritime benchmark for detecting humans in open water. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2260–2270, 2022.
  37. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8741–8750, 2021.
  38. Semaffinet: Semantic-affine transformation for point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11819–11829, 2022.
  39. Quantformer: Learning extremely low-precision vision transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  40. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023.
  41. Embodied task planning with large language models. arXiv preprint arXiv:2307.01848, 2023.
  42. Anyview: Generalizable indoor 3d object detection with variable frames. arXiv preprint arXiv:2310.05346, 2023.
  43. Shapley-nas: discovering operation contribution for neural architecture search. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11892–11901, 2022.
  44. Pc-darts: Partial channel connections for memory-efficient architecture search. arXiv preprint arXiv:1907.05737, 2019.
  45. Structure design of egg auto-picking system and manipulator motion planning. Transactions of the Chinese Society of Agricultural Engineering, 32(8):48–55, 2016.
  46. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yinan Liang (4 papers)
  2. Ziwei Wang (128 papers)
  3. Xiuwei Xu (16 papers)
  4. Yansong Tang (81 papers)
  5. Jie Zhou (687 papers)
  6. Jiwen Lu (192 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com