Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (2306.00978v5)

Published 1 Jun 2023 in cs.CL
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Abstract: LLMs have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various LLMing and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

Activation-aware Weight Quantization: Enhancing LLM Compression and Deployment Efficiency

Overview of Activation-aware Weight Quantization (AWQ) Approach

Activation-aware Weight Quantization (AWQ) represents a significant advancement in the quantization of LLMs, which are pivotal in natural language understanding and generation tasks. The core innovation of AWQ lies in its method for low-bit, weight-only quantization that strategically scales certain weights based on activation rather than weight values themselves. This approach is grounded in the understanding that a small subset of weights, when appropriately scaled, can significantly mitigate the negative impacts of quantization on model performance.

Key Findings and Methodological Insights

  • AWQ is premised on the differentiation of weights in terms of their impact on model performance, identifying that protecting as little as 0.1% to 1% of salient weights can substantially reduce quantization errors.
  • The method introduces a novel technique of per-channel scaling to minimize these errors without the need for mixed-precision formats, which are less hardware-efficient.
  • Unlike prior methods that may depend on backpropagation or reconstruction, AWQ requires no such processes, thus preserving generalization across diverse domains and modalities without overfitting to the calibration set.

Comparative Performance and Results

  • Empirical evaluations demonstrate AWQ's superior performance over existing quantization approaches across a variety of benchmarks, including domain-specific tasks and benchmarks for LLMing.
  • AWQ consistently achieves significant speedups, over 3× compared to FP16 implementations by Huggingface on various LLMs, and facilitates running larger models on hardware with limited memory capacity.
  • The method’s strengths extend to instruction-tuned LMs and, notably, to multi-modal LMs, marking a pioneering advancement in the quantization of such LLMs.

Implications and Future Directions

The introduction and implementation of AWQ have multiple implications for the field of AI and LLM research:

  • It addresses a critical challenge in the deployment of LLMs, making them more accessible and efficient for real-world applications, especially on edge devices.
  • The method’s hardware efficiency and broad applicability suggest a wider adoption in serving solutions and platforms, potentially becoming a standard approach in LLM quantization.
  • Future research may explore further optimizations in scaling techniques and extend the approach to a wider array of model architectures and tasks.

System Implementation and Deployment

An efficient and flexible serving framework accompanies the AWQ approach, translating theoretical benefits into practical performance improvements. This framework leverages kernel fusion and other optimization techniques to realize speed gains in LLM inference tasks. The successful deployment of LLMs, such as the 70B parameter Llama-2 model on constrained devices like the NVIDIA Jetson Orin, exemplifies the real-world applicability and benefits of AWQ.

Conclusion

Activation-aware Weight Quantization (AWQ) represents a significant leap forward in the efficient and effective quantization of LLMs. Through a meticulous focus on the saliency of weights and innovative per-channel scaling, AWQ not only preserves but enhances model performance post-quantization. Its application across a variety of LLMs and tasks, coupled with significant speed and efficiency gains, positions AWQ as a pivotal advancement in the broader endeavor to make powerful AI models more accessible and practical for a wide range of applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Openflamingo, March 2023.
  3. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  4. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  5. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  7. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  9. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  10. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  11. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  12. Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
  13. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  14. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  15. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  16. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
  17. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
  18. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  19. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
  20. The enron corpus: A new dataset for email classification research. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings 15, pages 217–226. Springer, 2004.
  21. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  23. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021.
  24. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  25. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems, 33:11711–11722, 2020.
  26. Visual instruction tuning. 2023.
  27. Pointer sentinel mixture models, 2016.
  28. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
  29. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
  30. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  32. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  33. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  34. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  35. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
  36. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  37. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  40. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  41. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In CVPR, 2019.
  42. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  43. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
  44. Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209.13325, 2022.
  45. Outlier suppression: Pushing the limit of low-bit transformer language models, 2022.
  46. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  47. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers, 2022.
  48. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  49. Opt: Open pre-trained transformer language models, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ji Lin (47 papers)
  2. Jiaming Tang (8 papers)
  3. Haotian Tang (28 papers)
  4. Shang Yang (12 papers)
  5. Xingyu Dang (3 papers)
  6. Chuang Gan (195 papers)
  7. Song Han (155 papers)
  8. Wei-Ming Chen (25 papers)
  9. Wei-Chen Wang (11 papers)
  10. Guangxuan Xiao (16 papers)
Citations (299)
Youtube Logo Streamline Icon: https://streamlinehq.com