Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth (2505.03802v3)

Published 2 May 2025 in cs.LG and cs.AI

Abstract: QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for LLMs (LLM). Recently, methods based on SVD for continuous update iterations to initialize LoRA matrices to accommodate quantization errors have generally failed to consistently improve performance. Dynamic mixed precision is a natural idea for continuously improving the fine-tuning performance of quantized models, but previous methods often optimize low-rank subspaces or quantization components separately, without considering their synergy. To address this, we propose \textbf{QR-Adaptor}, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer, thereby continuously improving model performance. QR-Adaptor does not minimize quantization error but treats precision and rank allocation as a discrete optimization problem guided by actual downstream performance and memory usage. Compared to state-of-the-art (SOTA) quantized LoRA fine-tuning methods, our approach achieves a 4.89\% accuracy improvement on GSM8K, and in some cases even outperforms the 16-bit fine-tuned model while maintaining the memory footprint of the 4-bit setting.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Changhai Zhou (7 papers)
  2. Shijie Han (8 papers)
  3. Shiyang Zhang (10 papers)
  4. Yuhua Zhou (8 papers)
  5. Weizhong Zhang (40 papers)
  6. Cheng Jin (76 papers)