QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources (2310.07147v1)
Abstract: LLMs have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pre-trained models on downstream datasets provides further significant performance gains, but this process has been challenging due to its extraordinary resource requirements. To this end, existing efforts focus on parameter-efficient fine-tuning, which, unfortunately, fail to capitalize on the powerful potential of full-parameter fine-tuning. In this work, we propose QFT, a novel Quantized Full-parameter Tuning framework for LLMs that enables memory-efficient fine-tuning without harming performance. Our framework incorporates two novel ideas: (i) we adopt the efficient Lion optimizer, which only keeps track of the momentum and has consistent update magnitudes for each parameter, an inherent advantage for robust quantization; and (ii) we quantize all model states and store them as integer values, and present a gradient flow and parameter update scheme for the quantized weights. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, satisfied by a single A6000 GPU.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Symbolic discovery of optimization algorithms. arxiv 2023. arXiv preprint arXiv:2302.06675, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2021.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
- Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 293–302, 2019.
- Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518–18529, 2020.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pp. 291–326. Chapman and Hall/CRC, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1341–1355, 2020.
- HuggingFace. Open llm leaderboard, 2023a. URL https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- HuggingFace. Sharegpt data, 2023b. URL https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered?doi=true.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018.
- Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
- Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616, 2020.
- Efficient rematerialization for deep networks. Advances in Neural Information Processing Systems, 32, 2019.
- Memory efficient optimizers with 4-bit states. arXiv preprint arXiv:2309.01507, 2023.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- I-vit: integer-only quantization for efficient vision transformer inference. arXiv preprint arXiv:2207.01405, 2022.
- Patch similarity aware data-free quantization for vision transformers. In European Conference on Computer Vision, pp. 154–170. Springer, 2022a.
- Repq-vit: Scale reparameterization for post-training quantization of vision transformers. arXiv preprint arXiv:2212.08254, 2022b.
- Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Full parameter fine-tuning for large language models with limited resources. arXiv preprint arXiv:2306.09782, 2023.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 891–905, 2020.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020.
- shareGPT. Sharegpt, 2023. URL https://sharegpt.com/.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp. 41–53, 2018.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Zhikai Li (24 papers)
- Xiaoxuan Liu (21 papers)
- Banghua Zhu (38 papers)
- Zhen Dong (87 papers)
- Qingyi Gu (25 papers)
- Kurt Keutzer (199 papers)