ME-ViT: A Single-Load Memory-Efficient FPGA Accelerator for Vision Transformers (2402.09709v1)
Abstract: Vision Transformers (ViTs) have emerged as a state-of-the-art solution for object classification tasks. However, their computational demands and high parameter count make them unsuitable for real-time inference, prompting the need for efficient hardware implementations. Existing hardware accelerators for ViTs suffer from frequent off-chip memory access, restricting the achievable throughput by memory bandwidth. In devices with a high compute-to-communication ratio (e.g., edge FPGAs with limited bandwidth), off-chip memory access imposes a severe bottleneck on overall throughput. This work proposes ME-ViT, a novel \underline{M}emory \underline{E}fficient FPGA accelerator for \underline{ViT} inference that minimizes memory traffic. We propose a \textit{single-load policy} in designing ME-ViT: model parameters are only loaded once, intermediate results are stored on-chip, and all operations are implemented in a single processing element. To achieve this goal, we design a memory-efficient processing element (ME-PE), which processes multiple key operations of ViT inference on the same architecture through the reuse of \textit{multi-purpose buffers}. We also integrate the Softmax and LayerNorm functions into the ME-PE, minimizing stalls between matrix multiplications. We evaluate ME-ViT on systolic array sizes of 32 and 16, achieving up to a 9.22$\times$ and 17.89$\times$ overall improvement in memory bandwidth, and a 2.16$\times$ improvement in throughput per DSP for both designs over state-of-the-art ViT accelerators on FPGA. ME-ViT achieves a power efficiency improvement of up to 4.00$\times$ (1.03$\times$) over a GPU (FPGA) baseline. ME-ViT enables up to 5 ME-PE instantiations on a Xilinx Alveo U200, achieving a 5.10$\times$ improvement in throughput over the state-of-the art FPGA baseline, and a 5.85$\times$ (1.51$\times$) improvement in power efficiency over the GPU (FPGA) baseline.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846.
- H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “Maskgit: Masked generative image transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 315–11 325.
- D. A. Hudson and L. Zitnick, “Generative adversarial transformers,” in International conference on machine learning. PMLR, 2021, pp. 4487–4499.
- P. Zhang, A. Srivastava, A. V. Nori, R. Kannan, and V. K. Prasanna, “Fine-grained address segmentation for attention-based variable-degree prefetching,” in Proceedings of the 19th ACM International Conference on Computing Frontiers, 2022, pp. 103–112.
- P. Zhang, R. Kannan, X. Tong, A. V. Nori, and V. K. Prasanna, “Sharp: Software hint-assisted memory access prediction for graph analytics,” in 2022 IEEE High Performance Extreme Computing Conference (HPEC), 2022, pp. 1–8.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning. PMLR, 2021, pp. 10 347–10 357.
- Y. Li, G. Yuan, Y. Wen, J. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, and J. Ren, “Efficientformer: Vision transformers at mobilenet speed,” Advances in Neural Information Processing Systems, vol. 35, pp. 12 934–12 949, 2022.
- S. Mehta and M. Rastegari, “Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer,” arXiv preprint arXiv:2110.02178, 2021.
- K. Wu, J. Zhang, H. Peng, M. Liu, B. Xiao, J. Fu, and L. Yuan, “Tinyvit: Fast pretraining distillation for small vision transformers,” in European Conference on Computer Vision. Springer, 2022, pp. 68–85.
- A. Bhandare, V. Sripathi, D. Karkada, V. Menon, S. Choi, K. Datta, and V. Saletore, “Efficient 8-bit quantization of transformer neural machine language translation model,” arXiv preprint arXiv:1906.00532, 2019.
- H. You, Z. Sun, H. Shi, Z. Yu, Y. Zhao, Y. Zhang, C. Li, B. Li, and Y. Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 273–286.
- A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” Proceedings of Machine Learning and Systems, vol. 3, pp. 711–732, 2021.
- A. Khan, A. K. Paul, C. Zimmer, S. Oral, S. Dash, S. Atchley, and F. Wang, “Hvac: Removing i/o bottleneck for large-scale deep learning applications,” in 2022 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2022, pp. 324–335.
- T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022.
- H. Tabani, A. Balasubramaniam, S. Marzban, E. Arani, and B. Zonooz, “Improving the efficiency of transformers for resource-constrained devices,” in 2021 24th Euromicro Conference on Digital System Design (DSD). IEEE, 2021, pp. 449–456.
- F. Busato, O. Green, N. Bombieri, and D. A. Bader, “Hornet: An efficient data structure for dynamic sparse graphs and matrices on gpus,” in 2018 IEEE High Performance extreme Computing Conference (HPEC), 2018, pp. 1–7.
- W. Hu, D. Xu, Z. Fan, F. Liu, and Y. He, “Vis-top: Visual transformer overlay processor,” arXiv preprint arXiv:2110.10957, 2021.
- Z. Lit, M. Sun, A. Lu, H. Ma, G. Yuan, Y. Xie, H. Tang, Y. Li, M. Leeser, Z. Wang et al., “Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 109–116.
- M. Sun, H. Ma, G. Kang, Y. Jiang, T. Chen, X. Ma, Z. Wang, and Y. Wang, “Vaqf: Fully automatic software-hardware co-design framework for low-bit vision transformer,” arXiv preprint arXiv:2201.06618, 2022.
- A. Rahman, J. Lee, and K. Choi, “Efficient fpga acceleration of convolutional neural networks using logical-3d compute array,” in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016, pp. 1393–1398.
- T. Wang, L. Gong, C. Wang, Y. Yang, Y. Gao, X. Zhou, and H. Chen, “Via: A novel vision-transformer accelerator based on fpga,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 4088–4099, 2022.
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
- G. R. Nair, H.-S. Suh, M. Halappanavar, F. Liu, J.-s. Seo, and Y. Cao, “Fpga acceleration of gcn in light of the symmetry of graph adjacency matrix,” in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6.
- V. Iskandar, M. A. A. E. Ghany, and D. Goehringer, “Near-memory computing on fpgas with 3d-stacked memories: Applications, architectures, and optimizations,” ACM Transactions on Reconfigurable Technology and Systems, vol. 16, no. 1, pp. 1–32, 2022.
- S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer,” in 2020 IEEE 33rd International System-on-Chip Conference (SOCC). IEEE, 2020, pp. 84–89.
- S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel, “Vita: A vision transformer inference accelerator for edge applications,” arXiv preprint arXiv:2302.09108, 2023.
- Xilinx. (2017) Deep learning with int8 optimization on xilinx devices. [Online]. Available: https://docs.xilinx.com/v/u/en-US/wp486-deep-learning-int8
- G. Cardarilli, L. Di Nunzio, R. Fazzolari et al., “A pseudo-softmax function for hardware-based high speed image classification,” Scientific Reports, vol. 11, p. 15307, 2021.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- Kyle Marino (1 paper)
- Pengmiao Zhang (7 papers)
- Viktor Prasanna (76 papers)