L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models (2402.04902v5)

Published 7 Feb 2024 in cs.LG and cs.CL

Abstract: Due to the high memory and computational costs associated with LLMs, model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically apply post-training quantization (PTQ) to pre-trained LLMs, followed by PEFT to recover accuracy loss. Meanwhile, this approach has limitations in recovering the accuracy loss. In this paper, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA. By employing a memory-optimized layer design, L4Q significantly reduces QAT's memory overhead, making its training cost comparable to LoRA, while preserving the advantage of QAT in producing fully quantized LLMs with high accuracy. Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy compared to decoupled fine-tuning schemes, particularly in 4-bit and 3-bit quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA and Mistral models with instructional datasets, we showcase L4Q's capabilities in language tasks and few-shot learning.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (3)

Hyesung Jeon (2 papers)
Yulhwa Kim (9 papers)
Jae-Joon Kim (15 papers)

Citations (3)

View on Semantic Scholar

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models (2402.04902v5)

Related Papers