Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparse is Enough in Fine-tuning Pre-trained Large Language Models (2312.11875v3)

Published 19 Dec 2023 in cs.LG, cs.AI, and cs.CL

Abstract: With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Sharpness-aware minimization improves language model generalization, 2022.
  2. Language models are few-shot learners, 2020.
  3. Evaluating large language models trained on code, 2021.
  4. Qlora: Efficient finetuning of quantized llms, 2023.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  6. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models, 2022.
  7. Sharpness-aware minimization for efficiently improving generalization, 2021.
  8. Qualitatively characterizing neural network optimization problems, 2015.
  9. Visualizing and understanding the effectiveness of bert, 2019.
  10. Towards a unified view of parameter-efficient transfer learning, 2022.
  11. Measuring massive multitask language understanding, 2021.
  12. Flat Minima. Neural Computation, 9(1):1–42, 01 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URL https://doi.org/10.1162/neco.1997.9.1.1.
  13. Parameter-efficient transfer learning for nlp, 2019.
  14. Lora: Low-rank adaptation of large language models, 2021.
  15. Fantastic generalization measures and where to find them, 2019.
  16. On large-batch training for deep learning: Generalization gap and sharp minima, 2017.
  17. Measuring the intrinsic dimension of objective landscapes, 2018a.
  18. Visualizing the loss landscape of neural nets, 2018b.
  19. Prefix-tuning: Optimizing continuous prompts for generation, 2021.
  20. Same pre-training loss, better downstream: Implicit bias matters for language models, 2022.
  21. Subspace methods for nonlinear optimization. CSIAM Transactions on Applied Mathematics, 2(4):585–651, 2021. ISSN 2708-0579. doi: https://doi.org/10.4208/csiam-am.SO-2021-0016. URL http://global-sci.org/intro/article_detail/csiam-am/19986.html.
  22. Roberta: A robustly optimized bert pretraining approach, 2019.
  23. Full parameter fine-tuning for large language models with limited resources, 2023.
  24. Automatic differentiation in pytorch. 2017.
  25. Adapterfusion: Non-destructive task composition for transfer learning, 2021.
  26. Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245.
  27. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  28. Llama: Open and efficient foundation language models, 2023.
  29. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.
  30. Self-instruct: Aligning language models with self-generated instructions, 2023.
  31. Finetuned language models are zero-shot learners, 2022.
  32. Dithered backprop: A sparse and quantized backpropagation algorithm for more efficient deep neural network training, 2020.
  33. Accelerating cnn training by pruning activation gradients, 2020.
  34. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022.
  35. Adaptive budget allocation for parameter-efficient fine-tuning, 2023.
  36. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.