Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparse is Enough in Fine-tuning Pre-trained Large Language Models (2312.11875v3)

Published 19 Dec 2023 in cs.LG, cs.AI, and cs.CL

Abstract: With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Sharpness-aware minimization improves language model generalization, 2022.
  2. Language models are few-shot learners, 2020.
  3. Evaluating large language models trained on code, 2021.
  4. Qlora: Efficient finetuning of quantized llms, 2023.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  6. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models, 2022.
  7. Sharpness-aware minimization for efficiently improving generalization, 2021.
  8. Qualitatively characterizing neural network optimization problems, 2015.
  9. Visualizing and understanding the effectiveness of bert, 2019.
  10. Towards a unified view of parameter-efficient transfer learning, 2022.
  11. Measuring massive multitask language understanding, 2021.
  12. Flat Minima. Neural Computation, 9(1):1–42, 01 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URL https://doi.org/10.1162/neco.1997.9.1.1.
  13. Parameter-efficient transfer learning for nlp, 2019.
  14. Lora: Low-rank adaptation of large language models, 2021.
  15. Fantastic generalization measures and where to find them, 2019.
  16. On large-batch training for deep learning: Generalization gap and sharp minima, 2017.
  17. Measuring the intrinsic dimension of objective landscapes, 2018a.
  18. Visualizing the loss landscape of neural nets, 2018b.
  19. Prefix-tuning: Optimizing continuous prompts for generation, 2021.
  20. Same pre-training loss, better downstream: Implicit bias matters for language models, 2022.
  21. Subspace methods for nonlinear optimization. CSIAM Transactions on Applied Mathematics, 2(4):585–651, 2021. ISSN 2708-0579. doi: https://doi.org/10.4208/csiam-am.SO-2021-0016. URL http://global-sci.org/intro/article_detail/csiam-am/19986.html.
  22. Roberta: A robustly optimized bert pretraining approach, 2019.
  23. Full parameter fine-tuning for large language models with limited resources, 2023.
  24. Automatic differentiation in pytorch. 2017.
  25. Adapterfusion: Non-destructive task composition for transfer learning, 2021.
  26. Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245.
  27. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  28. Llama: Open and efficient foundation language models, 2023.
  29. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.
  30. Self-instruct: Aligning language models with self-generated instructions, 2023.
  31. Finetuned language models are zero-shot learners, 2022.
  32. Dithered backprop: A sparse and quantized backpropagation algorithm for more efficient deep neural network training, 2020.
  33. Accelerating cnn training by pruning activation gradients, 2020.
  34. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022.
  35. Adaptive budget allocation for parameter-efficient fine-tuning, 2023.
  36. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Weixi Song (3 papers)
  2. Zuchao Li (76 papers)
  3. Lefei Zhang (64 papers)
  4. Hai Zhao (227 papers)
  5. Bo Du (264 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.