Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision (2307.00331v2)

Published 1 Jul 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE. Codes and models are available at https://github.com/HuangOwen/Quantization-Variation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Quantifying attention flow in transformers. In ACL, 2020.
  2. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6836–6846, 2021.
  3. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  4. Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2021.
  5. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  6. Is space-time attention all you need for video understanding? In ICML, volume 2, pp.  4, 2021.
  7. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp.  696–697, 2020.
  8. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  9. Robust quantization: One model to rule them all. Advances in neural information processing systems, 33:5308–5317, 2020.
  10. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4794–4802, 2019.
  11. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
  12. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  13. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12124–12134, 2022.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  15. Learned step size quantization. In International Conference on Learning Representations, 2020.
  16. Post-training piecewise linear quantization for deep neural networks. In European Conference on Computer Vision, pp.  69–86. Springer, 2020.
  17. Single path one-shot neural architecture search with uniform sampling. In European Conference on Computer Vision, pp.  544–560. Springer, 2020.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  19. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  20. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  21. Sdq: Stochastic differentiable quantization with mixed precision. In International Conference on Machine Learning, pp. 9295–9309. PMLR, 2022.
  22. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pp.  1–12, 2017.
  23. Stripes: Bit-serial deep neural network computing. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.  1–12. IEEE, 2016.
  24. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  25. Visualizing the loss landscape of neural nets. In Neural Information Processing Systems, 2018.
  26. Q-vit: Accurate and fully quantized low-bit vision transformer. arXiv preprint arXiv:2210.06707, 2022a.
  27. Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. In International Conference on Learning Representations, 2019.
  28. Q-vit: Fully differentiable quantization for vision transformer. arXiv preprint arXiv:2201.07703, 2022b.
  29. I-vit: Integer-only quantization for efficient vision transformer inference. arXiv preprint arXiv:2207.01405, 2022.
  30. Psaq-vit v2: Towards accurate and general data-free quantization for vision transformers. arXiv preprint arXiv:2209.05687, 2022c.
  31. Fq-vit: Fully quantized vision transformer without retraining. arXiv preprint arXiv:2111.13824, 2021.
  32. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10012–10022, 2021a.
  33. Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  3296–3305, 2019.
  34. How do adam and training strategies help bnns optimization. In International Conference on Machine Learning, pp. 6936–6946. PMLR, 2021b.
  35. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103, 2021c.
  36. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pp.  2736–2744, 2017.
  37. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.
  38. A statistical perspective on distillation. In International Conference on Machine Learning, pp. 7632–7642. PMLR, 2021.
  39. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  40. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  5191–5198, 2020.
  41. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. In International Conference on Learning Representations, 2018.
  42. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
  43. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11264–11272, 2019.
  44. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp. 7197–7206. PMLR, 2020.
  45. Overcoming oscillations in quantization-aware training. arXiv preprint arXiv:2203.11086, 2022.
  46. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3967–3976, 2019.
  47. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
  48. Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pp. 4095–4104. PMLR, 2018.
  49. Model compression via distillation and quantization. In International Conference on Learning Representations, 2018.
  50. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp.  764–775. IEEE, 2018.
  51. A fast knowledge distillation framework for visual recognition. arXiv preprint arXiv:2112.01528, 2021.
  52. Sliced recursive transformer. arXiv preprint arXiv:2111.05297, 2021a.
  53. Is label smoothing truly incompatible with knowledge distillation: An empirical study. In International Conference on Learning Representations, 2021b.
  54. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. PMLR, 2019.
  55. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. PMLR, 2021.
  56. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1365–1374, 2019.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5463–5474, 2021.
  59. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8612–8620, 2019.
  60. Towards accurate post-training network quantization via bit-split and stitching. In International Conference on Machine Learning, pp. 9847–9856. PMLR, 2020.
  61. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  22–31, 2021.
  62. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10687–10698, 2020.
  63. Kohei Yamamoto. Learnable companding quantization for accurate low-bit neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5029–5038, 2021.
  64. Fracbits: Mixed precision quantization via fractional bit-widths. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  10612–10620, 2021.
  65. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3903–3911, 2020.
  66. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  558–567, 2021a.
  67. Ptq4vit: Post-training quantization framework for vision transformers. arXiv preprint arXiv:2111.12293, 2021b.
  68. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp.  365–382, 2018.
  69. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6881–6890, 2021.
  70. Rethinking soft labels for knowledge distillation: A bias–variance tradeoff perspective. In International Conference on Learning Representations, 2020.
  71. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xijie Huang (26 papers)
  2. Zhiqiang Shen (172 papers)
  3. Kwang-Ting Cheng (96 papers)
  4. Pingcheng Dong (8 papers)
Citations (11)