Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FrameQuant: Flexible Low-Bit Quantization for Transformers (2403.06082v2)

Published 10 Mar 2024 in cs.LG and cs.CL

Abstract: Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains. The code is available at https://github.com/vsingh-group/FrameQuant

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Prompting large language model for machine translation: A case study. In Proceedings of the 40th International Conference on Machine Learning, ICML’23, 2023.
  2. Llama 2: Open foundation and fine-tuned chat models, 2023.
  3. Opt: Open pre-trained transformer language models, 2022a.
  4. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  5. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022b.
  6. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
  7. Generative adversarial transformers. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research. PMLR, 18–24 Jul 2021.
  8. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
  9. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
  10. Distilling the knowledge in a neural network, 2015.
  11. Data-free knowledge distillation for heterogeneous federated learning. In International conference on machine learning, pages 12878–12889. PMLR, 2021.
  12. Shi Chen and Qi Zhao. Shallowing deep networks: Layer-wise pruning based on feature representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12):3048–3056, 2019. doi: 10.1109/TPAMI.2018.2874634.
  13. Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4409–4412, 2012. doi: 10.1109/ICASSP.2012.6288897.
  14. O (n) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems, 33:13783–13794, 2020.
  15. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1510.00149.
  16. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems, 32, 2019.
  17. OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS.
  18. Data compression and harmonic analysis. IEEE Transactions on Information Theory, 44(6), 1998. doi: 10.1109/18.720544.
  19. Ole Christensen. An introduction to frames and riesz bases, 2018. URL https://link.springer.com/book/10.1007/978-3-319-25613-9.
  20. A comprehensive survey on model quantization for deep neural networks in image classification. ACM Transactions on Intelligent Systems and Technology, 14(6):1–50, November 2023. ISSN 2157-6912. doi: 10.1145/3623402. URL http://dx.doi.org/10.1145/3623402.
  21. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision. Chapman and Hall/CRC, 2022.
  22. A white paper on neural network quantization. ArXiv, abs/2106.08295, 2021. URL https://api.semanticscholar.org/CorpusID:235435934.
  23. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
  24. Bloom: A 176b-parameter open-access multilingual language model, 2023.
  25. Up or down? Adaptive rounding for post-training quantization. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/nagel20a.html.
  26. Optimal brain compression: A framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems, 35, 2022.
  27. Optimal brain damage. In Advances in Neural Information Processing Systems, volume 2, 01 1989.
  28. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks. IEEE, 1993.
  29. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In European Conference on Computer Vision, 2022.
  30. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34, 2021a.
  31. Towards accurate post-training quantization for vision transformer. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, New York, NY, USA, 2022. ISBN 9781450392037. doi: 10.1145/3503161.3547826. URL https://doi.org/10.1145/3503161.3547826.
  32. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  33. QuIP: 2-bit quantization of large language models with guarantees. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=xrk9g5vcXR.
  34. Quantized overcomplete expansions in Rnsuperscript𝑅𝑛{R}^{n}italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT: analysis, synthesis, and algorithms. IEEE Transactions on Information Theory, 44(1), 1998. doi: 10.1109/18.650985.
  35. Quantized frame expansions with erasures. Applied and Computational Harmonic Analysis, 10(3), 2001. ISSN 1063-5203. doi: https://doi.org/10.1006/acha.2000.0340. URL https://www.sciencedirect.com/science/article/pii/S1063520300903403.
  36. Equal-norm tight frames with erasures. Advances in Computational Mathematics, 18, 2003.
  37. Grassmannian frames with applications to coding and communication. Applied and computational harmonic analysis, 14(3), 2003.
  38. Fusion frames and distributed processing. Applied and computational harmonic analysis, 25(1), 2008.
  39. Compressed sensing for fusion frames. Proceedings of SPIE - The International Society for Optical Engineering, 10 2009. doi: 10.1117/12.826327.
  40. Beyond bandlimited sampling: Nonlinearities, smoothness and sparsity. ArXiv, abs/0812.3066, 2008. URL https://api.semanticscholar.org/CorpusID:8702589.
  41. Constructing tight fusion frames. Applied and Computational Harmonic Analysis, 30(2), 2011. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2010.05.002. URL https://www.sciencedirect.com/science/article/pii/S1063520310000850.
  42. Shayne F. D. Waldron. An introduction to finite tight frames, 2019. URL https://link.springer.com/book/10.1007/978-0-8176-4815-2.
  43. Finite frames, theory and applications, 2012. URL https://link.springer.com/book/10.1007/978-0-8176-8373-3.
  44. Harmonic grassmannian codes. Applied and Computational Harmonic Analysis, 65, 2023. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2023.01.009. URL https://www.sciencedirect.com/science/article/pii/S1063520323000106.
  45. Analysis of noise reduction in redundant expansions under distributed processing requirements. In Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., volume 4, 04 2005. ISBN 0-7803-8874-7. doi: 10.1109/ICASSP.2005.1415976.
  46. Robust dimension reduction, fusion frames, and grassmannian packings. Applied and Computational Harmonic Analysis, 26(1):64–76, 2009. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2008.03.001. URL https://www.sciencedirect.com/science/article/pii/S1063520308000249.
  47. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 2009.
  48. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  49. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  50. Training data-efficient image transformers; distillation through attention. In International Conference on Machine Learning, volume 139, July 2021.
  51. Deit iii: Revenge of the vit. In European Conference on Computer Vision. Springer, 2022.
  52. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 2021b.
  53. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe.
  54. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  55. Georgi Gerganov. llama.cpp, 2023. URL https://github.com/ggerganov/llama.cpp.
  56. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  57. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  58. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  59. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  60. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020.
  61. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  62. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  63. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, volume 85, page 8, 2013.
Citations (2)

Summary

We haven't generated a summary for this paper yet.