Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Targeted Efficient Fine-tuning: Optimizing Parameter Updates with Data-Driven Sample Selection (2403.08484v2)

Published 13 Mar 2024 in cs.CL

Abstract: Fine-tuning all parameters of LLMs is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by selectively fine-tuning specific parameters. Most of the parameter efficient fine-tuning (PEFT) methods center on selecting or introducing a set of parameters to be fine-tuned. However, there are few methods that consider the impact of data samples on parameter selecting. Representative data driven methods include FISH Mask based method, which randomly selects a portion of data samples as a basis when selecting parameters. However, this random data sample selection method cannot select optimal parameters for unstable data distribution. In this work, we introduce a data-centric approach and propose the Iterative Range Decreasing (IRD) algorithm to optimize the sample-parameter pair selection in FISH Mask. IRD iteratively refines the selection by identifying subsets of samples and parameters exhibiting higher Fisher information. We demonstrate the effectiveness and rationality of proposed strategy by conducting experiments on GLUE benchmark. Experimental results show our strategy optimizes the parameter selection and achieves preferable performance over some typical baseline methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Shun-ichi Amari. 1996. Neural learning in structured parameter spaces-natural riemannian gradient. Advances in neural information processing systems, 9.
  2. Stronger generalization bounds for deep nets via a compression approach. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 254–263. PMLR.
  3. Language models are few-shot learners. In NeurIPS.
  4. Semeval-2017 task 1: Semantic textual similarity - multilingual and cross-lingual focused evaluation. CoRR, abs/1708.00055.
  5. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics.
  6. Sparse low-rank adaptation of pre-trained language models. CoRR, abs/2311.11696.
  7. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mac. Intell., 5(3):220–235.
  8. William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In IWP@IJCNLP. Asian Federation of Natural Language Processing.
  9. Ronald A Fisher. 1922. On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604):309–368.
  10. Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
  11. Parameter-efficient transfer learning with diff pruning. In ACL/IJCNLP (1), pages 4884–4896. Association for Computational Linguistics.
  12. Parameter-efficient transfer learning for NLP. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
  13. Lora: Low-rank adaptation of large language models. In ICLR. OpenReview.net.
  14. Scaling laws for neural language models. CoRR, abs/2001.08361.
  15. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR (Poster).
  16. The winograd schema challenge. In KR. AAAI Press.
  17. Measuring the intrinsic dimension of objective landscapes. In ICLR (Poster). OpenReview.net.
  18. Scaling down to scale up: A guide to parameter-efficient fine-tuning. CoRR, abs/2303.15647.
  19. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In NeurIPS.
  20. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In ACL (2), pages 61–68. Association for Computational Linguistics.
  21. Rethinking parameter counting in deep models: Effective dimensionality revisited. CoRR, abs/2003.02139.
  22. A kernel-based view of language model fine-tuning. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 23610–23641. PMLR.
  23. Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451.
  24. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419.
  25. Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier.
  26. Razvan Pascanu and Yoshua Bengio. 2014. Revisiting natural gradient for deep networks. In ICLR.
  27. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  28. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392. The Association for Computational Linguistics.
  29. Roger Ratcliff. 1990. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285.
  30. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642. ACL.
  31. Training neural networks with fixed sparse masks. In NeurIPS, pages 24193–24205.
  32. Dylora: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. In EACL, pages 3266–3279. Association for Computational Linguistics.
  33. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR (Poster). OpenReview.net.
  34. Neural network acceptability judgments. CoRR, abs/1805.12471.
  35. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, pages 1112–1122. Association for Computational Linguistics.
  36. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL (2), pages 1–9. Association for Computational Linguistics.
  37. Masking as an efficient alternative to finetuning for pretrained language models. In EMNLP (1), pages 2226–2241. Association for Computational Linguistics.
  38. Can chatgpt understand too? A comparative study on chatgpt and fine-tuned BERT. CoRR, abs/2302.10198.

Summary

We haven't generated a summary for this paper yet.