Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 228 tok/s Pro
2000 character limit reached

Scaling Laws for Fine-Tuning

Updated 19 August 2025
  • Scaling laws for fine-tuning are mathematically defined rules that link model size, data volume, and compute to predictable performance gains on downstream tasks.
  • These laws account for effective data transfer, phase transitions, and domain composition to optimize resource allocation while mitigating issues like catastrophic forgetting.
  • Empirical studies validate these principles across diverse architectures and domains, enabling accurate performance forecasting and informed model-data trade-offs.

Scaling laws for fine-tuning describe the predictable mathematical relationships that govern how neural model performance on downstream tasks improves with increasing model size, fine-tuning data volume, compute allocation, and—in multi-domain or transfer scenarios—the composition and alignment of fine-tuning datasets. Unlike scaling laws for pre-training, which typically focus on model and dataset size, fine-tuning scaling laws must explicitly account for the pre-trained foundation, data transfer dynamics, phase transitions in improvement, domain mixture composition, catastrophic forgetting, data poisoning susceptibility, complexity-driven task bottlenecks, and practical resource constraints. Theoretical and empirical studies across a diverse landscape of neural architectures (transformers, image models, speech models, mixture-of-experts, etc.) consistently reveal that fine-tuning improvements usually adhere to power-law or rectified power-law forms, but these can be modulated or even limited by factors distinct from those present during pre-training.

1. Mathematical Forms of Scaling Laws in Fine-Tuning

Fine-tuning scaling laws extend and generalize the classical power-law forms derived from pre-training. For generative and discriminative models across image, language, multimodal, and mathematical domains, loss LL is universally expressed as a sum of an irreducible term (usually data entropy or inherent error) plus a reducible power-law term dependent on model size NN, data size DD, or compute CC:

L(x)=L+(x0/x)αxL(x) = L_{∞} + (x_0/x)^{\alpha_x}

where xx stands for NN, CC, or DD, LL_{∞} quantifies the entropy ("irreducible loss"), and αx\alpha_x is a domain- and task-dependent scaling exponent (Henighan et al., 2020).

For transfer and fine-tuning, the scaling law introduces "effective data transferred":

Dt=k(DF)αNβD_t = k(D_F)^{\alpha} N^{\beta}

where DFD_F is the fine-tuning (target) dataset size, NN is the model parameter count, and exponents α\alpha and β\beta capture task alignment and model generality (Hernandez et al., 2021). The observed loss is then largely determined by DF+DtD_F + D_t for the data regime of interest, shifting the data requirement lower for larger models and closer pre-training alignment.

Recent works introduce rectified scaling laws for fine-tuning, incorporating a "pre-learned data size" DlD_l from the pre-trained model and capturing phase transitions:

L(D)=BDl+Dβ+EL(D) = \frac{B}{D_l + D^{\beta}} + E

This form explains the transition from a slow "pre-power" phase (dominated by DlD_l) to a classic power-law regime (when DDlD \gg D_l), ensuring accurate prediction even when only small fine-tuning datasets are available (Lin et al., 4 Feb 2024).

In multitask or transfer settings:

L(p,f)=(Apα+G)fβ+EL(p, f) = (A \cdot p^{-α} + G) \cdot f^{-β} + E

where pp is pre-training steps/tokens, ff is fine-tuning dataset size, and GG is the transfer gap—the residual loss due to domain mismatch, which sets a lower bound on transfer efficiency (Barnett, 30 Aug 2024). More general forms also incorporate domain mixture weights hh as

L(N,D,h)=E+1i=1kCihiγi+A/Nα+B/DβL(N, D, h) = E + \frac{1}{\sum_{i=1}^k C_i h_i^{γ_i} + A/N^{\alpha} + B/D^{\beta}}

enabling prediction of optimal training mixtures (Shukor et al., 12 Jul 2025).

2. Transfer, Effective Data, and Domain Alignment

A critical insight is that the benefit from pre-training is quantifiable as "effective data transferred", governed by a power law in both model size NN and fine-tuning data DFD_F:

Dt=kDFαNβD_t = k\, D_F^{\alpha} N^{\beta}

with DtD_t dominating in the low-data regime. The exponents α\alpha and β\beta directly reflect the degree of distributional alignment and model generality. As α\alpha decreases (closer pre-training and fine-tuning distributions), pre-training provides more powerful data-multiplicative benefits. Scaling laws thus formalize the notion that larger and better aligned models "need less" task-specific fine-tuning data for comparable performance (Hernandez et al., 2021, Barnett, 30 Aug 2024).

However, transfer gaps GG can impose fundamental limits in the form of irreducible downstream loss, even with unlimited pre-training. When pre-training and target tasks are misaligned, scaling up pre-training yields diminishing returns, and fine-tuning data collection becomes comparatively more critical (Barnett, 30 Aug 2024, Isik et al., 6 Feb 2024). This is particularly well-illustrated in machine translation scaling, where only models pretrained on distributionally matching languages exhibit monotonic and predictable BLEU score improvements (Isik et al., 6 Feb 2024).

3. Multidimensional Scaling: Compute, Data, Model Size, and Mixture Composition

Scaling laws provide practical guidance for optimizing model and training design under resource constraints. The compute-optimal allocation for training, given fixed total compute CC, satisfies

NoptCβwithβ0.7N_{\text{opt}} \propto C^\beta \quad \text{with} \quad \beta \approx 0.7

Sublinear dataset scaling with model size (DN0.4D \sim N^{0.4}) is thus optimal in this regime (Henighan et al., 2020). For fine-tuning, data composition—not just total tokens—is crucial: accounting for the number of examples NN and their mean length LL yields a dataset volume V=N×LV = N \times L whose contributions to downstream performance differ for the same total tokens under various sampling strategies (Lagasse et al., 9 May 2025).

In Mixture-of-Experts (MoE) and multitask models, scaling laws incorporate new architectural hyperparameters such as granularity GG:

L(N,D,G)=c+[(g/Gγ)+a]/Nα+b/DβL(N, D, G) = c + [(g/G^\gamma) + a]/N^\alpha + b/D^\beta

Fine-grained tuning (raising GG) often yields more efficient adaptation than traditional expert configurations (Krajewski et al., 12 Feb 2024).

Scaling laws for optimal data mixtures enable analytic determination of the best domain weights hh for any target task and compute budget, moving beyond trial-and-error for pretraining and fine-tuning mixtures (Shukor et al., 12 Jul 2025).

4. Empirical Findings: Phase Transitions, Forgetting, Data Poisoning, and Complexity

Empirical studies reveal that fine-tuning scaling laws are not universally monotonic or linear. Notable phenomena include:

  • Pre-power and Power Phases: Rectified scaling laws account for an initial regime where improvements are slow (controlled by pre-learned data size), transitioning to rapid power-law scaling as more fine-tuning data is added. This two-phase behavior is pronounced in low-data fine-tuning settings (Lin et al., 4 Feb 2024, Sengupta et al., 17 Feb 2025).
  • Forgetting and Catastrophic Interference: Fine-tuning LLMs—especially with parameter-efficient methods (e.g., LoRA)—produces a strong, inverse linear relationship between downstream loss and forgetting of pretrained capabilities (knowledge, reasoning, safety), regardless of the number of fresh parameters trained. Both forgetting and fine-tuning loss scale as shifted power laws with respect to parameter count and update steps, quantifying "catastrophic forgetting" as an intrinsic scaling phenomenon (Kalajdzievski, 11 Jan 2024).
  • Scaling Laws and Data Poisoning: As model size increases, vulnerability to data poisoning and jailbreak tuning scales upward, with larger models learning harmful behaviors from minimal poisoned data at a faster rate. Regression analyses establish a positive scaling relationship between log-parameters and post-attack harmfulness scores, even at very low poisoning rates (Bowen et al., 6 Aug 2024).
  • Complexity-Driven Limits: Scaling laws can be governed by task-intrinsic complexity, as shown in combinatorial optimization (e.g., Traveling Salesman Problem). Fixed-capacity models exhibit superlinear increases in suboptimality with respect to problem size and complexity, revealing predictable, irreducible performance gaps that cannot be overcome by fine-tuning alone when the problem's solution or representation space scales "too fast" (Weissman et al., 15 Jun 2025).

5. Domain-Specific Scaling Laws and Practical Implications

Scaling law forms and exponents differ across domains:

  • In contrastive (CLIP) and multimodal models, zero-shot accuracy and retrieval scale as clean power-laws in compute and model size, but the actual effect sizes and scaling trends strongly depend on the particular data distribution and the alignment between source and target domains (Cherti et al., 2022).
  • In speech recognition (RescoreBERT), normalized word error rates improve as a joint power-law in both fine-tuning data and model size for pre-trained models, but only as a function of data size for non-pretrained models—quantifying the data-multiplicative effects of transfer (Gu et al., 2023).
  • For synthetic data, scaling laws track those observed for real data, but improvements plateau earlier, and the number of tokens required to saturate performance is lower for large models (Qin et al., 25 Mar 2025).
  • In multi-task power system modeling, scenario generalization follows a power law with data size, is largely insensitive to strong scaling of parameter count, and remains robust in multi-task environments—demonstrating predictability for practical limits of fine-tuning in highly structured, domain-specific applications (Liu et al., 25 Mar 2025).

Table: Prototypical Fine-Tuning Scaling Laws Across Domains

Domain/Setting Scaling Law Formulation Notes/Interpretation
Generic generative/fine-tuning L(x)=L+(x0/x)αxL(x) = L_{\infty} + (x_0 / x)^{\alpha_x} Loss approaches irreducible entropy; αx\alpha_x: scaling exponent
Data transfer (low-data) Dt=kDFαNβD_t = k D_F^{\alpha} N^{\beta} Effective data from pretraining; α\alpha: proximity, β\beta: generality
Rectified fine-tuning L(D)=B/(Dl+Dβ)+EL(D) = B / (D_l + D^{\beta}) + E Accounts for pre-power phase transition and pre-learned data
Mixture-of-Experts L(N,D,G)=c+[g/Gγ+a]/Nα+b/DβL(N, D, G) = c + [g/G^{\gamma} + a]/N^{\alpha} + b/D^{\beta} Efficiency scales with granularity
Transfer with gap L(p,f)=(Apα+G)fβ+EL(p, f) = (A p^{- \alpha} + G) f^{-\beta} + E GG: transfer gap sets lower bound
Data mixture optimization L(N,D,h)=E+1Cihiγi+A/Nα+B/DβL(N, D, h) = E + \frac{1}{\sum C_i h_i^{\gamma_i} + A/N^{\alpha} + B/D^{\beta}} Mixture- and scale-optimal fine-tuning/pretraining

6. Predictive Use, Model Selection, and Limitations

Scaling laws for fine-tuning enable:

  • Performance Forecasting: From small-scale pilot runs, extrapolation using established scaling exponents can accurately predict large-scale model performance in both language and vision tasks, provided power-law fits are robust (goodness-of-fit R20.9R^2 \gg 0.9) (Ivgi et al., 2022).
  • Model/Data Trade-offs: Law–backed analytic frameworks allow prediction of how to optimally allocate resources between increasing model size, expanding dataset size, collecting better-aligned data, or modifying domain mixtures to achieve a desired downstream loss under a fixed budget (Henighan et al., 2020, Shukor et al., 12 Jul 2025).
  • Selection Algorithms: The Accept-then-Stop (AtS) algorithm demonstrates that nature of the power-law and pre-power phases enables informed model selection (choose the candidate with the lowest extrapolated full-dataset loss) with orders-of-magnitude lower resource consumption (Lin et al., 4 Feb 2024).
  • Caveats and Deviations: Not all domains, architectures, or objectives exhibit perfect adherence to simple scaling law forms. Phase transitions, irreducible domain gaps, catastrophic forgetting, compositionality bottlenecks, and problem complexity saturation are persistent challenges—suggesting that practitioners must calibrate scaling laws locally and monitor phase transitions during fine-tuning and deployment (Sengupta et al., 17 Feb 2025).

7. Data Quality, Annotation, and Ethical Considerations

Data quality and composition have scaling effects beyond simple size. For low-resource environments, the scaling law can be used as a yardstick for annotation quality: if model performance does not increase with model size for a given dataset, annotation revisions may be required (Kong, 5 May 2024). The use of scaling law trends as both an annotation diagnostic and fine-tuning target metric provides a principled approach for robust dataset creation, especially under constraints of privacy, funding, and compute.

Fine-tuning scaling laws also highlight critical ethical risks, notably increased susceptibility to harmful behavior acquisition and safety degradation as models scale in size (Kalajdzievski, 11 Jan 2024, Bowen et al., 6 Aug 2024), necessitating rigorous data curation, red-teaming, and safety benchmarking throughout large-scale fine-tuning.


In summary, scaling laws for fine-tuning provide a quantitative theoretical and empirical foundation for predicting, optimizing, and understanding the adaptation of large neural models to downstream tasks. The functional forms of these laws, along with their domain- and data-specific exponents, support principled model/data budgeting, optimal mixture selection, safety evaluation, and resource-efficient model selection. At the same time, deviations arising from forgetting, data quality issues, and complexity bottlenecks illustrate the necessity of adaptive, empirically validated fine-tuning protocols as model scales and application landscape continue to evolve.