Rectified Scaling Laws in LLM Fine-Tuning

Updated 5 November 2025

Rectified Scaling Laws are a refined framework that integrates a pre-learned data size term to capture both the pre-power and power-law phases in fine-tuning.
The approach improves prediction accuracy, demonstrated by lower RMSD scores and resource-efficient model selection methods like the Accept then Stop algorithm.
It guides strategic allocation of data and compute resources in both natural and synthetic dataset scenarios, optimizing LLM training and performance forecasting.

Rectified scaling laws describe a mathematical framework for predicting and analyzing the fine-tuning performance of large pre-trained models, particularly LLMs, as a function of additional data size. Rectified scaling laws refine classic power-law approaches by introducing an explicit "pre-learned data size" term, capturing the effect of prior knowledge acquired during pre-training. This approach yields a two-phase scaling curve, consisting of a previously uncharacterized "pre-power phase" and the standard power-law regime, providing a more accurate and robust basis for model selection and data scaling strategies across both natural and synthetic datasets.

1. Discovery and Mathematical Definition

Rectified scaling laws were introduced to address the empirical observation that fine-tuning loss as a function of data size does not fit the monotonic log-log derivative predicted by conventional scaling laws. Classical forms, such as

$\hat{L}(D) = \frac{B}{D^\beta} + E$

essentially describe a single-phase decay, failing to account for complex regimes observed at small data sizes after pre-training (Lin et al., 4 Feb 2024). Empirically, when plotting fine-tuning loss versus data size on log-log axes, two distinct phases emerge:

A pre-power phase at small data sizes, where the log-derivative of loss with respect to data size decreases and then transitions;
The familiar power-law phase at larger data sizes, with linear decrease.

The rectified scaling law introduces the "pre-learned data size" $D_l$ , representing the effective downstream-task-equivalent data learned during pre-training: $\hat{L}(D) = \frac{B}{D_l + D^\beta} + E$ where:

$D$ is the fine-tuning data size,
$D_l$ is the pre-learned data size,
$B, \beta, E$ are fitted parameters (task/model-dependent).

This formulation allows the loss curve to exhibit an inflection: a gradually decreasing slope (pre-power) giving way to the linear (power) phase. Theoretical analysis shows that the second derivative of $\hat{L}(\ln D)$ changes sign only if the law includes $D_l$ , making a phase transition possible. Empirical Root Mean Square Deviation (RMSD) between predicted and observed curves sharply improves (avg. 0.007 for rectified vs. 0.036 for vanilla) (Lin et al., 4 Feb 2024).

2. Modeling Fine-Tuning and Phase Transition

The inclusion of $D_l$ enables modeling of transfer learning scenarios, where a model is pre-trained on heterogeneous, large-scale data and then fine-tuned on a smaller, downstream set. The pre-power phase dominates when $D \ll D_l^{1/\beta}$ , with the model’s prior knowledge determining performance improvement from early data, rather than the typical power-law-driven regime.

Phase transition behavior is statistically and empirically demonstrated:

The derivative of $\hat{L}(D)$ with respect to $\log D$ first decreases, then flattens, indicating a transition out of pre-power phase.
Curves fitted with the rectified law not only match the full scaling range but also predict extrapolated future performance reliably.

This is critical in practical settings, as real-world fine-tuning commonly occurs at data sizes straddling both regimes.

3. Implications for Model Selection and Resource Allocation

Rectified scaling laws underpin new algorithms for resource-efficient LLM selection, notably the Accept then Stop (AtS) algorithm (Lin et al., 4 Feb 2024). AtS proceeds as follows:

Iterative fine-tuning of candidate models on successively smaller data subsets.
Monitoring the log-log loss trend; stopping further data reduction when deviation exceeds a threshold:

$I_{stop} \triangleq \frac{|\log \hat{L} - f(\log \hat{D})|}{\sigma}$

where $f$ is the linear fit and $\sigma$ is residual standard deviation, stopping when $I_{stop} > \delta$ .

Extrapolation from the accepted (power phase) points predicts full-data performance.

Empirically, AtS achieves high Pearson correlation between predicted and realized fine-tuning outcomes using only 1/256–1/512 of the available data. Selection accuracy remains above 95% on diverse NLP benchmarks, with resource consumption reduced by 100–500× versus traditional grid-search or exhaustive fine-tuning (Lin et al., 4 Feb 2024).

4. Extensions to Synthetic Data and Data Complexity

Extensive experiments on synthetic data (e.g., with the SynthLLM framework) show that the rectified scaling law precisely fits the loss curves for large-scale synthetic datasets (Qin et al., 25 Mar 2025). For a model of a given size, performance initially improves in accordance with the rectified scaling law, then saturates at a data/model-dependent plateau. Notably:

Larger models reach their saturation threshold with fewer synthetic tokens; e.g., an 8B model saturates at ≈1T tokens, whereas a 3B model requires ≈4T;
Performance improvements plateau near 300B tokens for both 3B and 8B models;
Synthetic data presents a scalable and predictable alternative to web-scraped corpora, as validated by empirical fits.

In parallel, scaling law parameters themselves are sensitive to data complexity, measured via gzip-compressibility (Pandey, 26 May 2024). The functional form: $L(N, D, H) = E(H) + \frac{A(H)}{N^{\alpha(H)}} + \frac{B(H)}{D^{\beta(H)}}$ with all parameters linear in complexity $H$ (gzip compression ratio), demonstrates that scaling optima shift as data becomes more or less complex. For less compressible data (high entropy), optimal compute allocation shifts toward larger datasets rather than larger models.

5. Broader Theoretical Context and Relationship to Classic Scaling Laws

Rectified scaling laws generalize and extend the predictive framework established by classic scaling studies (e.g., Kaplan et al., Chinchilla), which assume a monotonic, single-phase decay. Theoretical models (Maloney et al., 2022) show that such power-law scaling arises from a power-law spectrum in the data covariance, and breaking points appear when the data’s intrinsic dimension is saturated—a phenomenon naturally modeled by the $D_l$ parameter in rectified laws.

Rectified scaling laws also interface with optimal data mixture scaling, where loss scaling functions of both mixture and scale allow for analytical computation of mixture weights and extrapolation to large-model, large-data regimes (Shukor et al., 12 Jul 2025). In all these cases, the addition of structural prior-knowledge or dataset complexity variables to the scaling law formula yields a more expressive and predictive tool for model training and deployment.

6. Practical Impact and Future Directions

Mathematically robust and empirically validated, rectified scaling laws guide principled selection and curation of models and data under compute or data constraints. Their adoption enables:

Rapid, resource-constrained model selection without needing to fine-tune every candidate on the full dataset;
Forecasting data requirements for desired performance targets, especially when high-quality organic data is unavailable;
Strategic allocation of compute across model scale and data size as a function of data complexity;
Rational benchmark design and dataset engineering, especially for long-range deployment of synthetic or curated corpora.

The identification of pre-power and power phases challenges the field to design scaling diagnostics and meta-learning procedures sensitive to initial conditions and pre-trained knowledge. A plausible implication is the eventual refinement of model pre-training objectives and synthetic data generators, with the aim of optimizing $D_l$ for downstream adaptation.

Summary Table: Rectified Scaling Law and Regimes

Scaling Law Formula	Key Regime/Role	Main Parameter Addition
$\hat{L}(D) = \frac{B}{D^\beta} + E$	Classical power law; fine-tuning with abundant data	None
$\hat{L}(D) = \frac{B}{D_l + D^\beta} + E$	Rectified law; models with prior knowledge, small to moderate data	Pre-learned data size ( $D_l$ )

In conclusion, rectified scaling laws constitute a mathematically justified and empirically established framework for fine-tuning and continual training of large models, providing resource-efficient, predictable, and robust solutions to selection and scaling problems in the modern era of data- and compute-intensive AI.

PDF Markdown Chat (Pro)

References (5)

Selecting Large Language Model to Fine-tune via Rectified Scaling Law (2024)

Scaling Laws of Synthetic Data for Language Models (2025)

gzip Predicts Data-dependent Scaling Laws (2024)

A Solvable Model of Neural Scaling Laws (2022)

Scaling Laws for Optimal Data Mixtures (2025)

Follow Topic

Get notified by email when new papers are published related to Rectified Scaling Laws.