Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

323 4

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method (2402.17193v1)

Published 27 Feb 2024 in cs.CL and cs.LG

Abstract: While LLMs often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning -- full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task- and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods.

PDF HTML Abstract

Exploring the Dynamics of LLM Finetuning Across Scalable Parameters

Introduction to Finetuning Scaling in LLMs

In the rapidly evolving landscape of NLP, leveraging pretrained LLMs for downstream applications has established a new norm, capitalizing on in-context learning and emergent capabilities of models like GPT-4 and PaLM 2. Despite these advances, a systematic understanding of how various factors, particularly model size, pretraining data size, new finetuning parameters, and finetuning data size, influence the effectiveness of finetuning methods remains undeveloped. This gap in knowledge forms the crux of our investigation, focusing on two finetuning approaches: Full-Model Tuning (FMT) and Parameter Efficient Tuning (PET), the latter comprising methods like prompt tuning and Low-rank Option (LoRA).

Methodology and Experimentation

The research conducts a thorough analysis across multiple dimensions, involving LLM model sizes from 1B to 16B parameters and finetuning tasks including bilingual machine translation and multilingual summarization. The essence of this exploration is captured in a proposed multiplicative joint scaling law, articulating a relationship between finetuning data size and other scalars under paper, and highlighting:

The relative impact of scaling LLM models versus pretraining data on finetuning efficiency.
The limited effectiveness of scaling PET parameters.
Task and data dependency in the selection of optimal finetuning methods.
Enhanced zero-shot generalization to related tasks by PET over FMT.

Key Observations and Findings

The analysis brings forth several intriguing findings:

LLM model size scaling significantly surpasses pretraining data scaling in benefitting finetuning performance, underlining the crucial role of model architecture complexity.
In the field of PET parameter scaling, neither prompt tuning length nor LoRA rank scaling demonstrated substantial gains, with LoRA exhibiting better training stability.
The paper corroborates the task and data-dependent nature of optimal finetuning method selection, arguing against a one-size-fits-all approach.
Intriguingly, PET methods, particularly in the face of scant finetuning data, show a stronger propensity for zero-shot generalization, a key consideration for tasks where model flexibility is paramount.

Future Trajectories and Theoretical Implications

This investigation opens several avenues for future research, notably in extending these findings to multi-modal LLMs and understanding the impacts of finetuning data quality. The data-dependent joint scaling law proposed enriches our theoretical comprehension of finetuning dynamics in LLMs, laying groundwork for more optimized, task-specific application of these powerful models.

Concluding Remarks

The in-depth examination underscores the nuanced interplay between model size, data size, and finetuning methods in enhancing LLM performance on downstream tasks. By dissecting these relationships, this paper offers vital insights necessary for navigating the complexities of LLM finetuning, poised to influence future NLP research and application strategies significantly.

PDF Markdown Bookmark Chat (Pro)

References (58)

Authors (4)

Biao Zhang (76 papers)
Zhongtao Liu (6 papers)
Colin Cherry (38 papers)
Orhan Firat (80 papers)

Citations (77)

View on Semantic Scholar

Tweets

https://twitter.com/srush_nlp/status/1799974373614662025

https://twitter.com/BZhangGo/status/1762884609774096886

https://twitter.com/_akhaliq/status/1762724606400446673

https://twitter.com/BZhangGo/status/1764680103009694023

https://twitter.com/sooperset/status/1764207612340674612

https://twitter.com/azamatomu/status/1787892938594308483

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method, Zhang et al. 2024 [Scaling laws for fine-tuning, including PEFT; limited gains from scaling for the latter] (4 points, 1 comment)