Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models (2306.02272v4)

Published 4 Jun 2023 in cs.CL
OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

Abstract: LLMs with hundreds of billions of parameters require powerful server-grade GPUs for inference, limiting their practical deployment. To address this challenge, we introduce the outlier-aware weight quantization (OWQ) method, which aims to minimize LLM's footprint through low-precision representation. OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights. This sensitivity-aware mixed-precision scheme reduces the quantization error notably, and extensive experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a parameter-efficient fine-tuning for task-specific adaptation, called weak column tuning (WCT), enabling accurate task-specific LLM adaptation with minimal memory overhead in the optimized format. OWQ represents a notable advancement in the flexibility, efficiency, and practicality of LLM optimization literature. The source code is available at https://github.com/xvyaward/owq

Advancing LLM Efficiency and Adaptability with Outlier-aware Weight Quantization and Weak Column Tuning

Introduction

Deploying LLMs effectively in real-world applications remains a formidable challenge due to their extensive memory and computation requirements. Recent advancements in weight quantization protocols, such as the OPTQ approach, have made strides toward alleviating these issues by compressing models into manageable sizes without a significant loss in performance. This paper introduces a novel technique dubbed Outlier-aware Weight Quantization (OWQ), building on these precedents but incorporating a crucial innovation geared towards minimizing footprints of LLMs through highly sensitive, structured weight preservation.

Outlier-aware Weight Quantization (OWQ)

OWQ methodologically identifies and preserves a subset of weights particularly susceptible to quality degradation upon quantization—referred to as "weak columns." By granting these columns exemption from aggressive quantization, OWQ effectively reduces overall error, significantly preserving model quality even at extreme low-precision levels (e.g., 3.1 bits). Extensive empirical assessments affirm that the OWQ approach considerably improves upon the previous state-of-the-art quantization methods, including the highly-regarded OPTQ, particularly in the domain of fine-tuning and inference efficiency.

Weak Column Tuning (WCT)

An integral advancement presented in this paper is the introduction of Weak Column Tuning (WCT), a parameter-efficient fine-tuning scheme compatible with the OWQ-optimized models. WCT strategically updates only the high-precision weak columns identified during the OWQ process, offering an adept balance between adaptability to task-specific shifts and maintenance of a minimal computational overhead. This approach yields formidable performance enhancements against leading fine-tuning paradigms, including QLoRA, underscoring the dual advantage in memory efficiency and task adaptability introduced by OWQ.

Experimental Validation

The superiority of OWQ and WCT over existing methods is extensively validated across a variety of benchmarks and model configurations. For models quantized to 3.01 bits using OWQ, near-equivalent performance to 4-bit models quantized with conventional techniques is achieved, marking a significant leap in quantization efficiency. Additionally, the fine-tuning capabilities of WCT, when applied to pre-quantized models, outperform existing parameter-efficient tuning methods both in terms of reduced memory footprint and improved task-specific performance.

Future Directions

While the current instantiation of OWQ and WCT marks a substantial step forward in the practical deployment of LLMs, it also opens several avenues for future research. Exploring dynamically adaptive quantization schemes that can respond to variable task demands and model configurations could further enhance the versatility and efficiency of LLM deployments. Furthermore, integrating OWQ and WCT principles with emerging LLM architectures could catalyze the development of even more robust, adaptive, and efficient models suitable for a broader range of applications.

Conclusion

The OWQ technique, when coupled with the WCT scheme, represents a significant advancement in the optimization of LLMs for practical deployment. By addressing the challenge of maintaining model quality in extremely low-precision quantization scenarios and introducing an efficient mechanism for task-specific fine-tuning, this research paves the way for wider adoption and application of LLMs across diverse computational settings. The profound implications for both the theoretical understanding of model quantization and the practical deployment of LLMs warrant further investigation into this promising domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  4. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  5. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518–18529, 2020.
  6. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019.
  7. Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
  8. Optimal brain compression: A framework for accurate post-training quantization and pruning. In Advances in Neural Information Processing Systems, 2022.
  9. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  10. Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2023.
  11. A framework for few-shot language model evaluation, September 2021.
  12. Parent-mediated communication-focused treatment in children with autism (pact): a randomised controlled trial. The lancet, 375(9732):2152–2160, 2010.
  13. Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14. IEEE, 2014.
  14. IST-DASLab. gptq. https://github.com/IST-DASLab/gptq, 2022.
  15. Brecq: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations, 2021.
  16. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994.
  17. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  18. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  19. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
  20. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
  21. Loss aware post-training quantization. Machine Learning, 110(11-12):3245–3262, 2021.
  22. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 688–698. IEEE, 2018.
  23. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 580–595, 2018.
  24. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models, 2023.
  25. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  28. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  30. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  31. Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization. In International Conference on Learning Representations, 2022.
  32. Outlier suppression: Pushing the limit of low-bit transformer language models. In Advances in Neural Information Processing Systems, 2022.
  33. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  34. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  35. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  36. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Changhun Lee (9 papers)
  2. Jungyu Jin (1 paper)
  3. Taesu Kim (23 papers)
  4. Hyungjun Kim (18 papers)
  5. Eunhyeok Park (28 papers)
Citations (30)