Sparse is Enough in Fine-tuning Pre-trained Large Language Models (2312.11875v3)
Abstract: With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.
- Sharpness-aware minimization improves language model generalization, 2022.
- Language models are few-shot learners, 2020.
- Evaluating large language models trained on code, 2021.
- Qlora: Efficient finetuning of quantized llms, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models, 2022.
- Sharpness-aware minimization for efficiently improving generalization, 2021.
- Qualitatively characterizing neural network optimization problems, 2015.
- Visualizing and understanding the effectiveness of bert, 2019.
- Towards a unified view of parameter-efficient transfer learning, 2022.
- Measuring massive multitask language understanding, 2021.
- Flat Minima. Neural Computation, 9(1):1–42, 01 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URL https://doi.org/10.1162/neco.1997.9.1.1.
- Parameter-efficient transfer learning for nlp, 2019.
- Lora: Low-rank adaptation of large language models, 2021.
- Fantastic generalization measures and where to find them, 2019.
- On large-batch training for deep learning: Generalization gap and sharp minima, 2017.
- Measuring the intrinsic dimension of objective landscapes, 2018a.
- Visualizing the loss landscape of neural nets, 2018b.
- Prefix-tuning: Optimizing continuous prompts for generation, 2021.
- Same pre-training loss, better downstream: Implicit bias matters for language models, 2022.
- Subspace methods for nonlinear optimization. CSIAM Transactions on Applied Mathematics, 2(4):585–651, 2021. ISSN 2708-0579. doi: https://doi.org/10.4208/csiam-am.SO-2021-0016. URL http://global-sci.org/intro/article_detail/csiam-am/19986.html.
- Roberta: A robustly optimized bert pretraining approach, 2019.
- Full parameter fine-tuning for large language models with limited resources, 2023.
- Automatic differentiation in pytorch. 2017.
- Adapterfusion: Non-destructive task composition for transfer learning, 2021.
- Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models, 2023.
- Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.
- Self-instruct: Aligning language models with self-generated instructions, 2023.
- Finetuned language models are zero-shot learners, 2022.
- Dithered backprop: A sparse and quantized backpropagation algorithm for more efficient deep neural network training, 2020.
- Accelerating cnn training by pruning activation gradients, 2020.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022.
- Adaptive budget allocation for parameter-efficient fine-tuning, 2023.
- A comprehensive survey on pretrained foundation models: A history from bert to chatgpt, 2023.
- Weixi Song (3 papers)
- Zuchao Li (76 papers)
- Lefei Zhang (64 papers)
- Hai Zhao (227 papers)
- Bo Du (264 papers)