Fine-Tuning Scaling Laws

Updated 5 November 2025

Fine-tuning scaling laws are mathematical models that define the relationship between compute, model size, and fine-tuning data in boosting downstream task performance.
They integrate factors like dataset composition, transfer gap, and specialized architectures (e.g., Mixture-of-Experts) to inform optimal resource allocation.
These laws highlight tradeoffs such as catastrophic forgetting and emphasize careful hyperparameter tuning and robust defense measures.

Fine-tuning scaling laws quantify and predict how model performance improves as a function of resource allocation—principally compute, model size, and fine-tuning data—during the adaptation of pretrained models to downstream tasks. Historically, scaling laws were developed for pretraining, but recent research demonstrates the critical necessity of refined laws for fine-tuning under practical constraints. These laws now increasingly account for dataset composition, model architecture (including specialist methods like Mixture of Experts), transfer dynamics, robustness, and catastrophic forgetting. A consensus emerges that effective fine-tuning cannot be understood via naive token counting alone: nuanced variables such as data composition, pretraining-finetuning distribution alignment, task mixture optimization, and parameter isolation all demonstrably shift scaling regimes and optimal strategies.

1. Mathematical Formulations and Data Composition in Fine-Tuning Scaling Laws

Traditional scaling laws for fine-tuning relate accuracy (or another downstream metric) to some function of model size ( $M$ ), total number of fine-tuning tokens ( $T$ ), and occasionally compute. Recent advances, exemplified by the introduction of a data composition-aware scaling law (Lagasse et al., 9 May 2025), assert that treating dataset volume as an explicit product of the number of examples ( $N$ ) and average token length per example ( $L$ ) yields more precise predictions: $\text{Accuracy} = A\, V^{\beta} M^{\gamma} + E, \qquad V = N \cdot L$ where $A,\,\beta,\,\gamma,\,E$ are empirically fitted constants. This formulation makes explicit that total tokens alone are insufficient: holding $V$ (and thus the compute budget) fixed, it is possible to achieve higher performance by optimizing $N$ and $L$ , e.g., preferring a larger number of shorter examples versus fewer longer ones. Empirical ablation confirms that both $N$ and $L$ independently govern fine-tuning efficiency—even when $V$ is constant—demonstrating the violation of "token equivalence" commonly assumed in pre-Chinchilla scaling.

This principle generalizes to other domains: for example, in fine-tuning transformer-based foundation models for power system intelligence (Liu et al., 25 Mar 2025), a power-law scaling is observed between dataset size (demonstrations, number of scenarios) and generalization performance. However, model size scaling saturates rapidly—data scaling dominates.

2. Transfer Scaling, Domain Alignment, and the Transfer Gap

Fine-tuning scaling laws involving transfer learning require explicit modeling of (a) pretraining quantity, (b) alignment between pretraining and fine-tuning distributions, and (c) the "transfer gap." Empirical studies demonstrate that the benefit of pretraining, measured as "effective data transferred," is well-described by a power law in both fine-tuning dataset size and model parameter count (Hernandez et al., 2021): $D_T = k \, (D_F)^\alpha N^\beta$ Here, $D_T$ represents the equivalent data in the fine-tuning distribution that would deliver the same loss as a pretrained, fine-tuned model.

Extension to more general transfer settings introduces the transfer gap term $G$ , the irreducible difference between attainable loss after infinite pretraining versus optimal downstream performance (Barnett, 30 Aug 2024): $L(p, f) = (A \cdot p^{-\alpha} + G)\, f^{-\beta} + E$ $G$ dominates when pretraining and fine-tuning distributions are misaligned. Low $G$ means pretraining is highly efficient; high $G$ implies additional fine-tuning data is essential and pretraining alone cannot compensate. Optimization of data allocation must thus be grounded in empirical measurement of these exponents and the transfer gap per domain.

3. Mixture, Modularity, and Data Mixture Scaling

Scaling laws for fine-tuning with heterogeneous or multi-domain data demonstrate that both scale and composition of the data mixture affect downstream loss in a structured fashion (Shukor et al., 12 Jul 2025). The general scaling law is: $\mathcal{L}(N, D, h) = E + \frac{1}{\sum_{i=1}^k C_i h_i^{\gamma_i}} + \frac{A}{N^\alpha} + \frac{B}{D^{\beta}}$ where $h\in\Delta_k$ are the mixture weights across $k$ domains. A "joint" variant models parameter and data scaling as explicit functions of the mixture, enabling prediction and optimization of domain weights to minimize downstream loss under fixed budgets. This approach is substantially more accurate than per-domain or naive combined metrics and allows for principled selection of $h^\ast$ via convex optimization, with the property that mixture optimality can vary with scale (model size, token budget). Small-scale experiments suffice to fit all parameters needed for robust extrapolation to larger $N, D, h$ .

For specialist architectures such as Mixture-of-Experts, additional hyperparameters (e.g., granularity, number of active experts) must be modeled directly. Scaling laws then take the form (Krajewski et al., 12 Feb 2024): $\mathcal{L}(N, D, G) = c + \left(\frac{g}{G^\gamma} + a\right)\frac{1}{N^\alpha} + \frac{b}{D^\beta}$ indicating fine-tuning (and pretraining) efficiency is maximized by non-standard expert sizes and high granularity, especially at large compute.

4. Task Dependency, Alignment, and Predictive Robustness

The emergence and reliability of fine-tuning scaling laws are highly task-dependent (Ivgi et al., 2022). Clean power-law relationships commonly arise on tasks that closely track the pretraining objective and have sufficient data (e.g., QNLI, SQuAD 1.1), while tasks with poor alignment, limited data, or evaluation metric mismatch (e.g., MRPC, poorly aligned machine translation) may not yield monotonic trends. For downstream metrics such as BLEU or COMET (machine translation), scaling with pretraining data is only monotonic and well fit by log- or power-law functions if pretraining and fine-tuning data are closely aligned (Isik et al., 6 Feb 2024). Cross-entropy often improves monotonically regardless, but should not be used in isolation, as it can "decouple" from task metrics: increasing pretraining on misaligned data can lower cross-entropy but hurt BLEU/COMET.

Practical protocol requires early diagnostics for scaling law fit on small-scale runs; if monotonic curves and clean fits are not observed, further compute on pretraining is unlikely to yield gains. For high-resource settings with large fine-tuning sets, pretraining brings minimal additional benefit.

5. Catastrophic Forgetting, Safety, and Failure Modes in Fine-Tuning Scaling

Parameter-efficient fine-tuning (e.g., LoRA) does not immunize against predictable, monotonic "catastrophic forgetting" of pretraining capabilities. Both fine-tuning performance and forgetting loss obey shifted power-law relationships with the number of parameters updated and gradient steps (Kalajdzievski, 11 Jan 2024): $\mathcal{L}_\text{f}(P,N) = -c_{\mathrm{ft}}c_{\mathrm{f,ft}} \left[ \left(\frac{a_{\mathrm{f}}{P}\right)^{\alpha_{\mathrm{f}}} + \left(\frac{b_{\mathrm{f}}{N}\right)^{\beta_{\mathrm{f}}} \right]^\rho + s_{\mathrm{f,ft}} - c_{\mathrm{f,ft}}s_{\mathrm{ft}}$ Forgetting is linearly related to new-task loss at convergence, independent of the parameter count or early stopping, and affects reasoning, factual knowledge, and alignment/safety guardrails. Fine-tuning thus introduces an inherent, quantifiable tradeoff.

Susceptibility to data poisoning attacks and harmful behavior insertion scales monotonically with model size, holding even at low poisoning rates (as little as 0.5% (Bowen et al., 6 Aug 2024)). Larger models both absorb harmful policies more efficiently and are harder to "clean," requiring rigorous curation and defense measures as scale increases.

6. Optimizer and Schedule Scaling in Fine-Tuning Laws

Reliable extrapolation and benchmarking of fine-tuning scaling laws require per-scale (model size) tuning of optimizer hyperparameters, including learning rate, batch size, and, at small batch sizes, the AdamW $\beta_2$ parameter (Porian et al., 27 Jun 2024). Legacy scaling laws (Kaplan et al. 2020) misallocated optimal ratio of tokens to parameters due to fixed hyperparameters, warmup over-provisioning, or miscounted output head FLOPs; correcting for these factors resolves discrepancies with the Chinchilla (Hoffmann et al. 2022) law and yields universal scaling recommendations: $N^*(C) \propto C^{1/2}, \quad D^*(C) \propto C^{1/2}, \quad \rho^*(C) = \frac{D^*(C)}{N^*(C)} \approx \text{const}$ Empirically, optimal batch size scales $\propto N^{0.5}$ and learning rate scales $\propto N^{-0.16}$ .

For fine-tuning, naive application of pretraining hyperparameters can break downstream scaling laws and reduce practical efficiency.

7. Real-World Implications and Design Guidance

Practical fine-tuning and foundation model deployment demand explicit modeling of dataset composition ( $N$ , $L$ ), transfer gap, mixture weights, and domain alignment. Benchmarking should report not only token count but example count and length distributions, subsampling strategies, and task alignment.

For resource-constrained adaptation, optimal performance at a prescribed compute or token budget is achieved by maximizing the number of high-quality, short examples within budget, using per-scale optimizer tuning and, when necessary, modular architectures with tuned mixture granularity.

Catastrophic forgetting and data poisoning susceptibility are predictable and monotonic: scaling up models without commensurate improvements in annotation rigor, quality control, and defense for both data and method will exacerbate alignment, safety, and generalization pathologies. In transfer learning settings, practitioners should measure the transfer gap $G$ per target application and allocate resources in proportion to these fitted values.

Scaling law identification and fitting at small scale extrapolates robustly to larger scale and reduces the need for grid search, but only when all controlling variables are included and distributions are aligned. Failure to do so leads to misallocation and suboptimal transfer.