Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs (2404.02945v1)

Published 3 Apr 2024 in cs.LG, cs.AI, cs.DC, and cs.PF

Abstract: Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.

Optimizing Tiny Transformers Deployment on Low-Power Microcontrollers

Introduction to the Framework

The recent surge in deploying Transformer models for edge computing applications emphasizes the necessity for efficient implementation strategies, especially on low-power microcontroller units (MCUs). This paper introduces a comprehensive framework that enhances the deployment of encoder-based Tiny Transformers across multiple commercial MCUs. The key contribution includes a novel library of optimized kernels targeting the efficient execution of Multi-Head Self-Attention (MHSA) mechanisms, fundamental to Transformer architectures. Additionally, the work presents a Fused-Weight Self-Attention (FWSA) inference schedule and a Depth-First Tiling (DFT) scheme aimed at minimizing memory footprint and computational overhead for MHSA operations.

Attention on Edge

The efficient execution of Transformer models on MCUs faces unique challenges, primarily due to the demanding memory and computation requirements of the attention mechanism. This paper's approach modifies traditional attention computations by introducing fused-weight and depth-first tiling strategies to mitigate these challenges.

The proposed Fused-Weight Self-Attention (FWSA) method reduces the computational complexity by fusing linear projection weights for queries and keys, effectively reducing the number of parameters and operations needed. This approach is particularly beneficial for models with a smaller embedding size (E), where it demonstrates a clear advantage in reducing both latency and memory requirements.

The Depth-First Tiling (DFT) method addresses the high memory footprint during the computation of attention maps by allowing their piecewise execution, thus never materializing the entire matrix in memory. This technique shows a significant reduction in memory peak usage, up to 6.19 times in some instances, highlighting its effectiveness for cache-less MCU devices.

Qualitative and Quantitative Enhancements

The paper reports a comprehensive evaluation of the proposed framework on a range of MCUs exploiting ARM and RISC-V Instruction Set Architectures (ISAs), showing substantial improvements over state-of-the-art (SotA) libraries. On average, a 4.79 times lower latency is observed compared to ARM's CMSIS-NN library and a 2 times lower latency compared to RISC-V's PULP-NN library.

A series of micro-benchmarks on the MHSA and FWSA operations highlight the scalability of performance across various input dimensions and the efficiency of parallel execution on multi-core platforms. The ablation paper emphasizes the individual contributions of the FWSA and DFT optimizations to reducing the runtime and memory requirements.

Practical Implications and Future Directions

The real-world implications of this research are profound, enhancing the deployment flexibility and efficiency of Tiny Transformers across a spectrum of IoT endpoints. The framework's ability to mitigate memory and computational bottlenecks opens new horizons for advanced on-device inference tasks within strict power and performance constraints.

This paper lays the groundwork for future research in the optimization of Transformer models for edge computing. Future work may explore the extension of these optimizations to other Transformer variants and the automatic generation of optimized tiling strategies based on model and hardware profiles.

The open-source availability of this framework encourages further community engagement and development, potentially expanding its applicability and improving the robustness of Tiny Transformers on low-power MCUs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. C. R. Banbury, V. J. Reddi, M. Lam, W. Fu, A. Fazel, J. Holleman, X. Huang, R. Hurtado, D. Kanter, A. Lokhmotov et al., “Benchmarking TinyML Systems: Challenges and Direction,” arXiv preprint arXiv:2003.04821, 2020.
  2. J. Lin, W.-M. Chen, Y. Lin, C. Gan, S. Han et al., “MCUNet: Tiny deep learning on IoT devices,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 711–11 722, 2020.
  3. S. Jain, A. Gural, M. Wu, and C. Dick, “Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks,” Proceedings of Machine Learning and Systems, vol. 2, pp. 112–128, 2020.
  4. C. Gong, Y. Chen, Y. Lu, T. Li, C. Hao, and D. Chen, “VecQ: Minimal loss DNN model compression with vectorized weight quantization,” IEEE Transactions on Computers, vol. 70, no. 5, pp. 696–710, 2021.
  5. S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, 2016.
  6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  7. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  8. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  9. T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open, vol. 3, pp. 111–132, 2022.
  10. A. Burrello, M. Scherer, M. Zanghieri, F. Conti, and L. Benini, “A microcontroller is all you need: Enabling transformer execution on low-power IoT endnodes,” in 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS), 2021, pp. 1–6.
  11. A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi, and F. Conti, “DORY: Automatic end-to-end deployment of real-world DNNs on low-cost IoT MCUs,” IEEE Transactions on Computers, vol. 70, no. 8, pp. 1253–1268, 2021.
  12. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 1, Jan. 2020.
  13. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  14. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.  Association for Computational Linguistics, 2019, pp. 4171–4186.
  15. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
  16. P. Busia, A. Cossettini, T. M. Ingolfsson, S. Benatti, A. Burrello, V. J. B. Jung, M. Scherer, M. A. Scrugli, A. Bernini, P. Ducouret, P. Ryvlin, P. Meloni, and L. Benini, “Reducing false alarms in wearable seizure detection with EEGformer: A compact transformer model for MCUs,” IEEE Transactions on Biomedical Circuits and Systems, pp. 1–13, 2024.
  17. D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv preprint arXiv:1606.08415, 2016.
  18. A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “PULP-NN: Accelerating quantized neural networks on parallel ultra-low-power RISC-V processors,” Philosophical Transactions of the Royal Society A, vol. 378, no. 2164, p. 20190155, 2020.
  19. L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient neural network kernels for ARM cortex-M CPUs,” arXiv preprint arXiv:1801.06601, 2018.
  20. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in International Conference on Machine Learning.  PMLR, 2020, pp. 5156–5165.
  21. K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller, “Rethinking attention with performers,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.  OpenReview.net, 2021.
  22. D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, and J. Hoffman, “Hydra attention: Efficient attention with many heads,” in European Conference on Computer Vision.  Springer, 2022, pp. 35–49.
  23. B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho et al., “RWKV: Reinventing RNNs for the transformer era,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds.  Singapore: Association for Computational Linguistics, Dec. 2023, pp. 14 048–14 077.
  24. S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
  25. Z. Qin, X. Han, W. Sun, D. Li, L. Kong, N. Barnes, and Y. Zhong, “The devil in linear transformer,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.  Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 7025–7041.
  26. N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv:1911.02150, 2019.
  27. J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds.  Singapore: Association for Computational Linguistics, Dec. 2023, pp. 4895–4901.
  28. T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” in The Twelfth International Conference on Learning Representations, 2024.
  29. W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with PagedAttention,” in Proceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23.  New York, NY, USA: Association for Computing Machinery, 2023, pp. 611–626.
  30. R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
  31. M. Milakov and N. Gimelshein, “Online normalizer calculation for softmax,” arXiv preprint arXiv:1805.02867, 2018.
  32. L. Liu, Z. Qu, Z. Chen, F. Tu, Y. Ding, and Y. Xie, “Dynamic sparse attention for scalable transformer acceleration,” IEEE Transactions on Computers, vol. 71, no. 12, pp. 3165–3178, 2022.
  33. T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and J. W. Lee, “ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 692–705.
  34. S. G. Bhaskaracharya, J. Demouth, and V. Grover, “Automatic Kernel Generation for Volta Tensor Cores,” arXiv preprint arXiv:2006.12645, 2020.
  35. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basic linear algebra subprograms for fortran usage,” ACM Trans. Math. Softw., vol. 5, no. 3, pp. 308–323, Sep. 1979.
  36. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: a system for large-scale machine learning,” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’16.  USA: USENIX Association, 2016, p. 265–283.
  37. A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” Proceedings of Machine Learning and Systems, vol. 3, pp. 711–732, 2021.
  38. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot MultiBox detector,” in Lecture Notes in Computer Science.  Springer International Publishing, 2016, pp. 21–37.
  39. L. Mei, P. Houshmand, V. Jain, S. Giraldo, and M. Verhelst, “ZigZag: Enlarging joint architecture-mapping design space exploration for DNN accelerators,” IEEE Transactions on Computers, vol. 70, no. 8, pp. 1160–1174, 2021.
  40. L. Mei, K. Goetschalckx, A. Symons, and M. Verhelst, “DeFiNES: Enabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA).  IEEE, 2023, pp. 570–583.
  41. M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12.
  42. E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “OPTQ: Accurate quantization for generative pre-trained transformers,” in The Eleventh International Conference on Learning Representations, 2023.
  43. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.  OpenReview.net, 2021.
  44. P. Busia, M. A. Scrugli, V. J.-B. Jung, L. Benini, and P. Meloni, “A noisy beat is worth 16 words: A tiny transformer for low-power arrhythmia classification on microcontrollers,” arXiv preprint arXiv:2402.10748, 2024.
  45. S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-BERT: Integer-only bert quantization,” in International Conference on Machine Learning.  PMLR, 2021, pp. 5506–5518.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Victor J. B. Jung (7 papers)
  2. Alessio Burrello (52 papers)
  3. Moritz Scherer (12 papers)
  4. Francesco Conti (67 papers)
  5. Luca Benini (362 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com