Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ME-ViT: A Single-Load Memory-Efficient FPGA Accelerator for Vision Transformers (2402.09709v1)

Published 15 Feb 2024 in eess.IV

Abstract: Vision Transformers (ViTs) have emerged as a state-of-the-art solution for object classification tasks. However, their computational demands and high parameter count make them unsuitable for real-time inference, prompting the need for efficient hardware implementations. Existing hardware accelerators for ViTs suffer from frequent off-chip memory access, restricting the achievable throughput by memory bandwidth. In devices with a high compute-to-communication ratio (e.g., edge FPGAs with limited bandwidth), off-chip memory access imposes a severe bottleneck on overall throughput. This work proposes ME-ViT, a novel \underline{M}emory \underline{E}fficient FPGA accelerator for \underline{ViT} inference that minimizes memory traffic. We propose a \textit{single-load policy} in designing ME-ViT: model parameters are only loaded once, intermediate results are stored on-chip, and all operations are implemented in a single processing element. To achieve this goal, we design a memory-efficient processing element (ME-PE), which processes multiple key operations of ViT inference on the same architecture through the reuse of \textit{multi-purpose buffers}. We also integrate the Softmax and LayerNorm functions into the ME-PE, minimizing stalls between matrix multiplications. We evaluate ME-ViT on systolic array sizes of 32 and 16, achieving up to a 9.22$\times$ and 17.89$\times$ overall improvement in memory bandwidth, and a 2.16$\times$ improvement in throughput per DSP for both designs over state-of-the-art ViT accelerators on FPGA. ME-ViT achieves a power efficiency improvement of up to 4.00$\times$ (1.03$\times$) over a GPU (FPGA) baseline. ME-ViT enables up to 5 ME-PE instantiations on a Xilinx Alveo U200, achieving a 5.10$\times$ improvement in throughput over the state-of-the art FPGA baseline, and a 5.85$\times$ (1.51$\times$) improvement in power efficiency over the GPU (FPGA) baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  2. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846.
  3. H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 315–11 325.
  4. D. A. Hudson and L. Zitnick, “Generative adversarial transformers,” in International conference on machine learning.   PMLR, 2021, pp. 4487–4499.
  5. P. Zhang, A. Srivastava, A. V. Nori, R. Kannan, and V. K. Prasanna, “Fine-grained address segmentation for attention-based variable-degree prefetching,” in Proceedings of the 19th ACM International Conference on Computing Frontiers, 2022, pp. 103–112.
  6. P. Zhang, R. Kannan, X. Tong, A. V. Nori, and V. K. Prasanna, “Sharp: Software hint-assisted memory access prediction for graph analytics,” in 2022 IEEE High Performance Extreme Computing Conference (HPEC), 2022, pp. 1–8.
  7. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  8. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  9. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  10. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning.   PMLR, 2021, pp. 10 347–10 357.
  11. Y. Li, G. Yuan, Y. Wen, J. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, and J. Ren, “Efficientformer: Vision transformers at mobilenet speed,” Advances in Neural Information Processing Systems, vol. 35, pp. 12 934–12 949, 2022.
  12. S. Mehta and M. Rastegari, “Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer,” arXiv preprint arXiv:2110.02178, 2021.
  13. K. Wu, J. Zhang, H. Peng, M. Liu, B. Xiao, J. Fu, and L. Yuan, “Tinyvit: Fast pretraining distillation for small vision transformers,” in European Conference on Computer Vision.   Springer, 2022, pp. 68–85.
  14. A. Bhandare, V. Sripathi, D. Karkada, V. Menon, S. Choi, K. Datta, and V. Saletore, “Efficient 8-bit quantization of transformer neural machine language translation model,” arXiv preprint arXiv:1906.00532, 2019.
  15. H. You, Z. Sun, H. Shi, Z. Yu, Y. Zhao, Y. Zhang, C. Li, B. Li, and Y. Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA).   IEEE, 2023, pp. 273–286.
  16. A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” Proceedings of Machine Learning and Systems, vol. 3, pp. 711–732, 2021.
  17. A. Khan, A. K. Paul, C. Zimmer, S. Oral, S. Dash, S. Atchley, and F. Wang, “Hvac: Removing i/o bottleneck for large-scale deep learning applications,” in 2022 IEEE International Conference on Cluster Computing (CLUSTER).   IEEE, 2022, pp. 324–335.
  18. T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022.
  19. H. Tabani, A. Balasubramaniam, S. Marzban, E. Arani, and B. Zonooz, “Improving the efficiency of transformers for resource-constrained devices,” in 2021 24th Euromicro Conference on Digital System Design (DSD).   IEEE, 2021, pp. 449–456.
  20. F. Busato, O. Green, N. Bombieri, and D. A. Bader, “Hornet: An efficient data structure for dynamic sparse graphs and matrices on gpus,” in 2018 IEEE High Performance extreme Computing Conference (HPEC), 2018, pp. 1–7.
  21. W. Hu, D. Xu, Z. Fan, F. Liu, and Y. He, “Vis-top: Visual transformer overlay processor,” arXiv preprint arXiv:2110.10957, 2021.
  22. Z. Lit, M. Sun, A. Lu, H. Ma, G. Yuan, Y. Xie, H. Tang, Y. Li, M. Leeser, Z. Wang et al., “Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL).   IEEE, 2022, pp. 109–116.
  23. M. Sun, H. Ma, G. Kang, Y. Jiang, T. Chen, X. Ma, Z. Wang, and Y. Wang, “Vaqf: Fully automatic software-hardware co-design framework for low-bit vision transformer,” arXiv preprint arXiv:2201.06618, 2022.
  24. A. Rahman, J. Lee, and K. Choi, “Efficient fpga acceleration of convolutional neural networks using logical-3d compute array,” in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016, pp. 1393–1398.
  25. T. Wang, L. Gong, C. Wang, Y. Yang, Y. Gao, X. Zhou, and H. Chen, “Via: A novel vision-transformer accelerator based on fpga,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 4088–4099, 2022.
  26. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  27. G. R. Nair, H.-S. Suh, M. Halappanavar, F. Liu, J.-s. Seo, and Y. Cao, “Fpga acceleration of gcn in light of the symmetry of graph adjacency matrix,” in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE).   IEEE, 2023, pp. 1–6.
  28. V. Iskandar, M. A. A. E. Ghany, and D. Goehringer, “Near-memory computing on fpgas with 3d-stacked memories: Applications, architectures, and optimizations,” ACM Transactions on Reconfigurable Technology and Systems, vol. 16, no. 1, pp. 1–32, 2022.
  29. S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer,” in 2020 IEEE 33rd International System-on-Chip Conference (SOCC).   IEEE, 2020, pp. 84–89.
  30. S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel, “Vita: A vision transformer inference accelerator for edge applications,” arXiv preprint arXiv:2302.09108, 2023.
  31. Xilinx. (2017) Deep learning with int8 optimization on xilinx devices. [Online]. Available: https://docs.xilinx.com/v/u/en-US/wp486-deep-learning-int8
  32. G. Cardarilli, L. Di Nunzio, R. Fazzolari et al., “A pseudo-softmax function for hardware-based high speed image classification,” Scientific Reports, vol. 11, p. 15307, 2021.
  33. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kyle Marino (1 paper)
  2. Pengmiao Zhang (7 papers)
  3. Viktor Prasanna (76 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.