Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method (2402.17193v1)

Published 27 Feb 2024 in cs.CL and cs.LG
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Abstract: While LLMs often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning -- full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task- and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods.

Exploring the Dynamics of LLM Finetuning Across Scalable Parameters

Introduction to Finetuning Scaling in LLMs

In the rapidly evolving landscape of NLP, leveraging pretrained LLMs for downstream applications has established a new norm, capitalizing on in-context learning and emergent capabilities of models like GPT-4 and PaLM 2. Despite these advances, a systematic understanding of how various factors, particularly model size, pretraining data size, new finetuning parameters, and finetuning data size, influence the effectiveness of finetuning methods remains undeveloped. This gap in knowledge forms the crux of our investigation, focusing on two finetuning approaches: Full-Model Tuning (FMT) and Parameter Efficient Tuning (PET), the latter comprising methods like prompt tuning and Low-rank Option (LoRA).

Methodology and Experimentation

The research conducts a thorough analysis across multiple dimensions, involving LLM model sizes from 1B to 16B parameters and finetuning tasks including bilingual machine translation and multilingual summarization. The essence of this exploration is captured in a proposed multiplicative joint scaling law, articulating a relationship between finetuning data size and other scalars under paper, and highlighting:

  • The relative impact of scaling LLM models versus pretraining data on finetuning efficiency.
  • The limited effectiveness of scaling PET parameters.
  • Task and data dependency in the selection of optimal finetuning methods.
  • Enhanced zero-shot generalization to related tasks by PET over FMT.

Key Observations and Findings

The analysis brings forth several intriguing findings:

  • LLM model size scaling significantly surpasses pretraining data scaling in benefitting finetuning performance, underlining the crucial role of model architecture complexity.
  • In the field of PET parameter scaling, neither prompt tuning length nor LoRA rank scaling demonstrated substantial gains, with LoRA exhibiting better training stability.
  • The paper corroborates the task and data-dependent nature of optimal finetuning method selection, arguing against a one-size-fits-all approach.
  • Intriguingly, PET methods, particularly in the face of scant finetuning data, show a stronger propensity for zero-shot generalization, a key consideration for tasks where model flexibility is paramount.

Future Trajectories and Theoretical Implications

This investigation opens several avenues for future research, notably in extending these findings to multi-modal LLMs and understanding the impacts of finetuning data quality. The data-dependent joint scaling law proposed enriches our theoretical comprehension of finetuning dynamics in LLMs, laying groundwork for more optimized, task-specific application of these powerful models.

Concluding Remarks

The in-depth examination underscores the nuanced interplay between model size, data size, and finetuning methods in enhancing LLM performance on downstream tasks. By dissecting these relationships, this paper offers vital insights necessary for navigating the complexities of LLM finetuning, poised to influence future NLP research and application strategies significantly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Scaling laws for generative mixed-modal language models. arXiv preprint arXiv:2301.03728, 2023.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  4. Data scaling laws in NMT: The effect of noise and architecture. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  1466–1482. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/bansal22b.html.
  5. Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478, 2019.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  10. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models, 2022.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  12. Scaling laws for multilingual neural machine translation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  10053–10071. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/fernandes23a.html.
  13. Scaling laws for sparsely-connected foundation models, 2023.
  14. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023.
  15. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  16. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pp. 10867–10878. PMLR, 2023.
  17. Scaling laws for neural machine translation. CoRR, abs/2109.07740, 2021. URL https://arxiv.org/abs/2109.07740.
  18. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  19. Data and parameter scaling laws for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5915–5922, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.478. URL https://aclanthology.org/2021.emnlp-main.478.
  20. Towards a unified view of parameter-efficient transfer learning, 2022.
  21. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  22. Teaching machines to read and comprehend. In NIPS, pp.  1693–1701, 2015. URL http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.
  23. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
  24. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. URL http://arxiv.org/abs/1712.00409.
  25. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  26. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2790–2799. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/houlsby19a.html.
  27. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https://arxiv.org/abs/2106.09685.
  28. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
  29. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  30. Findings of the 2022 conference on machine translation (wmt22). In Proceedings of the Seventh Conference on Machine Translation, pp.  1–45, Abu Dhabi, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.1.
  31. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021. URL https://arxiv.org/abs/2104.08691.
  32. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  33. Prefix-tuning: Optimizing continuous prompts for generation. CoRR, abs/2101.00190, 2021. URL https://arxiv.org/abs/2101.00190.
  34. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  35. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022.
  36. Compacter: Efficient low-rank hypercomplex adapter layers. CoRR, abs/2106.04647, 2021. URL https://arxiv.org/abs/2106.04647.
  37. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
  38. James Cross Onur Çelebi Maha Elbayad Kenneth Heafield Kevin Heffernan Elahe Kalbassi Janice Lam Daniel Licht Jean Maillard Anna Sun Skyler Wang Guillaume Wenzek Al Youngblood Bapi Akula Loic Barrault Gabriel Mejia Gonzalez Prangthip Hansanti John Hoffman Semarley Jarrett Kaushik Ram Sadagopan Dirk Rowe Shannon Spruit Chau Tran Pierre Andrews Necip Fazil Ayan Shruti Bhosale Sergey Edunov Angela Fan Cynthia Gao Vedanuj Goswami Francisco Guzmán Philipp Koehn Alexandre Mourachko Christophe Ropers Safiyyah Saleem Holger Schwenk Jeff Wang NLLB Team, Marta R. Costa-jussà. No language left behind: Scaling human-centered machine translation. 2022.
  39. OpenAI. Gpt-4 technical report, 2023.
  40. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
  41. Exploring the limits of transfer learning with a unified text-to-text transformer. 21(1), jan 2020. ISSN 1532-4435.
  42. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8, 2023.
  43. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  44. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  45. MLSUM: The multilingual summarization corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  8051–8067, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.647. URL https://aclanthology.org/2020.emnlp-main.647.
  46. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7881–7892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.704. URL https://aclanthology.org/2020.acl-main.704.
  47. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
  48. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  49. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2022.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  51. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  52. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098, 2023.
  53. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. CoRR, abs/2106.10199, 2021. URL https://arxiv.org/abs/2106.10199.
  54. Scaling vision transformers. CoRR, abs/2106.04560, 2021. URL https://arxiv.org/abs/2106.04560.
  55. Examining scaling and transfer of language model architectures for machine translation. CoRR, abs/2202.00528, 2022a. URL https://arxiv.org/abs/2202.00528.
  56. Prompting large language model for machine translation: A case study. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  41092–41110. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/zhang23m.html.
  57. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.
  58. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Biao Zhang (76 papers)
  2. Zhongtao Liu (6 papers)
  3. Colin Cherry (38 papers)
  4. Orhan Firat (80 papers)
Citations (77)