SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models (2410.03750v1)
Abstract: Large pre-trained models (LPMs), such as LLMs, have become ubiquitous and are employed in many applications. These models are often adapted to a desired domain or downstream task through a fine-tuning stage. This paper proposes SQFT, an end-to-end solution for low-precision sparse parameter-efficient fine-tuning of LPMs, allowing for effective model manipulation in resource-constrained environments. Additionally, an innovative strategy enables the merging of sparse weights with low-rank adapters without losing sparsity and accuracy, overcoming the limitations of previous approaches. SQFT also addresses the challenge of having quantized weights and adapters with different numerical precisions, enabling merging in the desired numerical format without sacrificing accuracy. Multiple adaptation scenarios, models, and comprehensive sparsity levels demonstrate the effectiveness of SQFT. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
- Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
- GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
- Tim Dettmers and Luke Zettlemoyer. 2023. The case for 4-bit precision: k-bit inference scaling laws. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
- GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323.
- Optimal Brain Compression: a framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems, 36.
- A framework for few-shot language model evaluation.
- Masafumi Hagiwara. 1994. A simple and effective method for removal of hidden units and weights. Neurocomputing, 6(2):207–218. Backpropagation, Part IV.
- Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(1).
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
- MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California. Association for Computational Linguistics.
- Optimal brain damage. In Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann.
- Openelm: An efficient language model family with open training and inference framework.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing.
- Shears: Unstructured sparsity with neural low-rank adapter search. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 395–405, Mexico City, Mexico. Association for Computational Linguistics.
- LoNAS: Elastic low-rank adapters for efficient large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10760–10776, Torino, Italia. ELRA and ICCL.
- Up or down? adaptive rounding for post-training quantization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
- Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
- Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Towards accurate post-training network quantization via bit-split and stitching. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9847–9856. PMLR.
- SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning.
- Besa: Pruning large language models with blockwise parameter-efficient sparsity allocation. Preprint, arXiv:2402.16880.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Advances in Neural Information Processing Systems, volume 35, pages 27168–27183. Curran Associates, Inc.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- A careful examination of large language model performance on grade school arithmetic. Preprint, arXiv:2405.00332.
- Loraprune: Pruning meets low-rank parameter-efficient fine-tuning. Preprint, arXiv:2305.18403.