Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization (2402.17985v1)

Published 28 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have demonstrated state-of-the-art performance across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have been some efficient attempts to quantize LLMs, yet inference with large batch size or long sequence still has the issue of being compute-bound. Fine-grained quantization methods have showcased their proficiency in achieving low-bit quantization for LLMs, while requiring FP16 data type for linear layer computations, which is time-consuming when dealing with large batch size or long sequence. In this paper, we introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the large channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss. Our experiments show that FlattenQuant can directly use 4 bits to achieve 48.29% of the linear layer calculation in LLMs, with the remaining layers using 8 bits. The 4-bit matrix multiplication introduced in the FlattenQuant method can effectively address the compute-bound caused by large matrix calculation. Our work achieves up to 2$\times$ speedup and 2.3$\times$ memory reduction for LLMs with negligible loss in accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Neural Information Processing Systems.
  2. Piqa: Reasoning about physical commonsense in natural language.
  3. Understanding and overcoming the challenges of efficient transformer quantization. arXiv preprint arXiv:2109.12948.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. 3.2 the a100 datacenter gpu and ampere architecture. In 2021 IEEE International Solid-State Circuits Conference (ISSCC), volume 64, pages 48–50. IEEE.
  6. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
  7. Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference.
  8. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv e-prints.
  9. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130.
  10. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  11. Some implementations of the boxplot. The American Statistician, 43(1):50–54.
  12. A framework for few-shot language model evaluation. Version v0. 0.1. Sept.
  13. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370.
  14. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466–4475. PMLR.
  15. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713.
  16. Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pages 522–531. IEEE.
  17. Pointer sentinel mixture models.
  18. Can a suit of armor conduct electricity? a new dataset for open book question answering.
  19. OpenAI OpenAI. 2023. Gpt-4 technical report.
  20. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
  21. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  22. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  23. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  24. Once quantization-aware training: High performance extremely low-bit architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5340–5349.
  25. Han Vanholder. 2016. Efficient inference with tensorrt. In GPU Technology Conference, volume 1.
  26. Lightseq: A high performance inference library for transformers. arXiv preprint arXiv:2010.13887.
  27. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  28. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
  29. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
  30. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089.
  31. Hellaswag: Can a machine really finish your sentence?
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  33. A survey of large language models. arXiv preprint arXiv:2303.18223.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yi Zhang (994 papers)
  2. Fei Yang (110 papers)
  3. Shuang Peng (11 papers)
  4. Fangyu Wang (5 papers)
  5. Aimin Pan (3 papers)
Citations (1)