Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer (2405.03882v3)

Published 6 May 2024 in cs.CV and cs.AI

Abstract: Motivated by the huge success of Transformers in the field of NLP, Vision Transformers (ViTs) have been rapidly developed and achieved remarkable performance in various computer vision tasks. However, their huge model sizes and intensive computations hinder ViTs' deployment on embedded devices, calling for effective model compression methods, such as quantization. Unfortunately, due to the existence of hardware-unfriendly and quantization-sensitive non-linear operations, particularly {Softmax}, it is non-trivial to completely quantize all operations in ViTs, yielding either significant accuracy drops or non-negligible hardware costs. In response to challenges associated with \textit{standard ViTs}, we focus our attention towards the quantization and acceleration for \textit{efficient ViTs}, which not only eliminate the troublesome Softmax but also integrate linear attention with low computational complexity, and propose Trio-ViT accordingly. Specifically, at the algorithm level, we develop a {tailored post-training quantization engine} taking the unique activation distributions of Softmax-free efficient ViTs into full consideration, aiming to boost quantization accuracy. Furthermore, at the hardware level, we build an accelerator dedicated to the specific Convolution-Transformer hybrid architecture of efficient ViTs, thereby enhancing hardware efficiency. Extensive experimental results consistently prove the effectiveness of our Trio-ViT framework. {Particularly, we can gain up to $\uparrow$$\mathbf{3.6}\times$, $\uparrow$$\mathbf{5.0}\times$, and $\uparrow$$\mathbf{7.3}\times$ FPS under comparable accuracy over state-of-the-art ViT accelerators, as well as $\uparrow$$\mathbf{6.0}\times$, $\uparrow$$\mathbf{1.5}\times$, and $\uparrow$$\mathbf{2.1}\times$ DSP efficiency.} Codes are available at \url{https://github.com/shihuihong214/Trio-ViT}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Ashish Vaswani et al. Attention is all you need. In NIPS, 2017.
  2. Jacob Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
  3. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  4. Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  5. Hugo Touvron et al. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 2020.
  6. Yuhang Li et al. Brecq: Pushing the limit of post-training quantization by block reconstruction. ArXiv, abs/2102.05426, 2021.
  7. Benoit Jacob et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2017.
  8. Zhewei Yao et al. Hawqv3: Dyadic neural network quantization. In International Conference on Machine Learning, 2020.
  9. Yang Lin et al. Fq-vit: Post-training quantization for fully quantized vision transformer. In International Joint Conference on Artificial Intelligence, 2021.
  10. Zhihang Yuan et al. Ptq4vit: Post-training quantization framework for vision transformers. ArXiv, abs/2111.12293, 2021.
  11. I-vit: Integer-only quantization for efficient vision transformer inference. ArXiv, abs/2207.01405, 2022.
  12. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  13. Dongchen Han et al. Flatten transformer: Vision transformer using focused linear attention. ArXiv, abs/2308.00442, 2023.
  14. Haoran You et al. Castling-vit: Compressing self-attention via switching towards linear-angular attention at vision transformer inference. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14431–14442, 2022.
  15. Mengshu Sun et al. Vaqf: Fully automatic software-hardware co-design framework for low-bit vision transformer. ArXiv, abs/2201.06618, 2022.
  16. Z. Li et al. Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization. 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), pages 109–116, 2022.
  17. Haoran You et al. Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design. ArXiv, abs/2210.09573, 2022.
  18. Zhen Dong et al. Hawq: Hessian aware quantization of neural networks with mixed-precision. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 293–302, 2019.
  19. Sheng Shen et al. Q-bert: Hessian based ultra low precision quantization of bert. In AAAI Conference on Artificial Intelligence, 2019.
  20. Zhenhua Liu et al. Post-training quantization for vision transformer. In Neural Information Processing Systems, 2021.
  21. Guangxuan Xiao et al. Smoothquant: Accurate and efficient post-training quantization for large language models. ArXiv, abs/2211.10438, 2022.
  22. Ze Liu et al. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
  23. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. ArXiv, abs/2110.02178, 2021.
  24. Benjamin Graham et al. Levit: a vision transformer in convnet’s clothing for faster inference. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12239–12249, 2021.
  25. Liqiang Lu et al. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021.
  26. Zheng Qu et al. Dota: detect and omit weak attentions for scalable transformer acceleration. Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022.
  27. Jyotikrishna Dass et al. Vitality: Unifying low-rank and sparse approximation for vision transformer acceleration with a linear taylor attention. ArXiv, abs/2211.05109, 2022.
  28. Mark Sandler et al. Mobilenetv2: Inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  29. Andrew G. Howard et al. Searching for mobilenetv3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1314–1324, 2019.
  30. Yunxiang Zhang et al. Wsq-addernet: Efficient weight standardization based quantized addernet fpga accelerator design with high-density int8 dsp-lut co-packing optimization. 2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9, 2022.
  31. Up or down? adaptive rounding for post-training quantization. ArXiv, abs/2004.10568, 2020.
  32. Steven K. Esser et al. Learned step size quantization. ArXiv, abs/1902.08153, 2019.
  33. Xiuying Wei et al. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. ArXiv, abs/2304.09145, 2023.
  34. Light-opu: An fpga-based overlay processor for lightweight convolutional neural networks. Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020.
  35. Shuanglong Liu et al. Toward full-stack acceleration of deep convolutional neural networks on fpgas. IEEE Transactions on Neural Networks and Learning Systems, 33:3974–3987, 2021.
  36. Sehoon Kim et al. Full stack optimization of transformer inference: a survey. ArXiv, abs/2302.14017, 2023.
  37. Jia Deng et al. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  38. Rundong Li et al. Fully quantized network for object detection. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2805–2814, 2019.
  39. Yoni Choukroun et al. Low-bit quantization of neural networks for efficient inference. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3009–3018, 2019.
  40. Towards accurate post-training network quantization via bit-split and stitching. In International Conference on Machine Learning, 2020.
  41. Easyquant: Post-training quantization via scale optimization. ArXiv, abs/2006.16669, 2020.
  42. NVIDIA. Fastertransformer. In https://github.com/nvidia/fastertransformer.
  43. Sehoon Kim et al. I-bert: Integer-only bert quantization. ArXiv, abs/2101.01321, 2021.
  44. Teng Wang et al. Via: A novel vision-transformer accelerator based on fpga. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41:4088–4099, 2022.
  45. Xilinx. Wp486: Deep learning with int8 optimization on xilinx devices. In White Paper, 2017.
  46. Tae Jun Ham et al. Elsa: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks. 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 692–705, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Huihong Shi (18 papers)
  2. Haikuo Shao (6 papers)
  3. Wendong Mao (13 papers)
  4. Zhongfeng Wang (50 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com