Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization (2311.16442v4)

Published 28 Nov 2023 in cs.LG and cs.DC

Abstract: LLMs have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory consumption. Applying 2-bit single-precision weight quantization brings >3% accuracy loss, so the state-of-the-art methods use mixed-precision methods for LLMs (e.g. Llama2-7b, etc.) to improve the accuracy. However, challenges still exist: (1) Uneven distribution in weight matrix. (2) Large speed degradation by adding sparse outliers. (3) Time-consuming dequantization operations on GPUs. To tackle these challenges and enable fast and efficient LLM inference on GPUs, we propose the following techniques in this paper. (1) Intra-weight mixed-precision quantization. (2) Exclusive 2-bit sparse outlier with minimum speed degradation. (3) Asynchronous dequantization. We conduct extensive experiments on different model families (e.g. Llama3, etc.) and model sizes. We achieve 2.91-bit for each weight considering all scales/zeros for different models with negligible loss. As a result, with our 2/4/16 mixed-precision quantization for each weight matrix and asynchronous dequantization during inference, our design achieves an end-to-end speedup for Llama2-7b is 1.74x over the original model, and we reduce both runtime cost and total cost by up to 2.53x and 2.29x with less GPU requirements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Mengnan Du et al. Shortcut learning of large language models in natural language understanding: A survey. arXiv preprint arXiv:2208.11857, 2022.
  2. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, 2023.
  3. The inference cost of search disruption - large language model cost analysis, 2023. https://www.semianalysis.com/p/the-inference-cost-of-search-disruption.
  4. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  5. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  6. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  7. Tim Dettmers et al. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
  8. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
  9. Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML, 2023.
  10. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  11. Nianhui Guo et al. Advanced ultra-low bitrate compression techniques for the llama family of llms. https://github.com/GreenBitAI/low_bit_llama, 2023.
  12. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
  13. Piqa: Reasoning about physical commonsense in natural language. In AAAI, 2020.
  14. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  15. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 2021.
  16. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  17. https://huggingface.co/.
  18. Nvidia management library (nvml) | nvidia developer. https://developer.nvidia.com/nvidia-management-library-nvml.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jinhao Li (21 papers)
  2. Shiyao Li (17 papers)
  3. Jiaming Xu (86 papers)
  4. Shan Huang (69 papers)
  5. Yaoxiu Lian (3 papers)
  6. Jun Liu (606 papers)
  7. Guohao Dai (51 papers)
Citations (2)