Additive Scaling Law: Principles & Applications

Updated 26 March 2026

Additive scaling laws define system responses as the sum of independent power-law contributions, capturing saturation and irreducible noise floors.
They are applied across machine learning, quantized regression, physical aggregation, and financial modeling to offer interpretable parameter control.
Their tractable calibration and clear attribution of resource limitations enable counterfactual reasoning in complex, scaling systems.

The additive scaling law refers to a class of quantitative relationships in which the system-level response variable is expressed as a sum (or sum-like combination) of separate contributions, each obeying an explicit power-law or analytic form in the relevant scaling parameters. This paradigm emerges across domains including stochastic processes, aggregation kinetics, quantitative linguistics, machine learning, stochastic analysis, financial mathematics, and statistical mechanics, and is distinct from multiplicative, purely self-similar, or fully interacting scaling regimes. Additive scaling laws offer tractability, independent control over different scale-determining factors, and sharply interpretable limits, often capturing saturation, irreducible error, or external noise floors. This article provides a technical overview of the formulation, derivation, calibration, and functional implications of additive scaling laws in representative settings.

1. Fundamental Formulation and Mathematical Structure

In the canonical form, an additive scaling law for a response variable $L$ associated with scaling variables $(x_1, \ldots, x_n)$ is expressed as

$L(x_1,\ldots,x_n) = \sum_{i=1}^n A_i x_i^{-\alpha_i} + L_{\infty}$

where each term represents a limiting contribution (often model size, data size, number of experts, etc.), $A_i$ are scale coefficients, $\alpha_i$ scaling exponents, and $L_{\infty}$ an irreducible floor. Variants may adopt nested power means (generalized means), logarithmic terms, or grouped symbolic expressions, but retain a strictly additive structure at the top level (Lin et al., 27 Jul 2025, Su et al., 2024, Droppo et al., 2021).

For instance, in large-scale neural network pretraining, the test loss law can be written with explicit additive power-law terms in the form (Su et al., 2024): $L(N,S,B) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{S_c}{S}\right)^{\alpha_S} \left(1+\frac{B(L)}{B}\right)^{\alpha_S}$ with an implicit critical batch parameter $B(L)$ encoding the optimal batch size for minimizing loss.

In quantized high-dimensional regression, the additive scaling law for population risk incorporates explicit additive error due to quantization: $\mathbb{E}\bigl[R(\bar v)\bigr] \approx R^* + A M_{\mathrm{eff}}^{-(a-1)} + B N_{\mathrm{eff}}^{-(a-1)/a} + E + C\delta$ where $M_{\mathrm{eff}}, N_{\mathrm{eff}}$ are effective model and sample sizes accounting for quantization, and $C\delta$ is a signal-independent additive floor set by the quantization variance (Zhang et al., 22 Feb 2026).

Such explicit additive decompositions contrast with multiplicative or intertwined scaling laws, permitting sharp attribution of system limitations to individual factors.

2. Domain-Specific Manifestations

2.1 Machine Learning and Large-Scale Models

Additive scaling laws in neural network pretraining, fine-tuning, and model selection appear as sums of power laws in model size, data volume, and other architectural variables. For example, additive laws accurately predict test loss, perplexity, or accuracy for transformer-based models with up to tens of billions of parameters as confirmed empirically and theoretically (Su et al., 2024, Droppo et al., 2021, Lin et al., 27 Jul 2025). The additive structure defines:

A nonzero irreducible loss floor $L_\infty$ .
Power-law decay in loss due to increasing data ( $D$ ) and model parameters ( $N$ ), with exponents fitted on small models but robustly transferring to large regimes.
Saturation phenomena, where loss cannot be reduced below $L_\infty$ regardless of orders-of-magnitude increases in $N$ or $D$ .

Automated scaling law discovery frameworks (e.g., EvoSLD (Lin et al., 27 Jul 2025)) systematically recover additive laws by searching over symbolic expressions parameterized by group-wise response data and control variables, finding that parsimonious additive sum-of-monomial structures outperform more complex alternatives.

2.2 Quantized Regression and Additive Noise

In high-dimensional linear regression with quantized arithmetic, additive scaling laws capture the irreducible noise floor imposed by quantization and its impact on effective model dimensionality. For additive quantization, the model-agnostic floor and parameter shrinkage are quantified precisely via spectral decay rates of the data covariance and bit precision (Zhang et al., 22 Feb 2026). Specifically, additive quantization yields effective shrinkage: $M_{\mathrm{eff}} \approx M \left(\delta M^{a} + 1\right)^{-1/(a-1)}$ and an explicit additive term $C\delta$ in the scaling law for risk.

2.3 Aggregation Kinetics and Physical Systems

Additive rules in aggregation models (single-monomer transfer kinetics) produce scaling laws for temporal evolution of cluster size distributions. The master equation with additive rules admits scaling solutions of the form $f(n,t) = t^{-\alpha} F(n/t^\beta)$ . Two limiting geometries (pile-up and wall) exhibit distinct scaling exponents and behavior, determined solely by the additive rules and boundary conditions. Scale-free exponential (wall regime) or reflected Gaussian (pile-up regime) distributions emerge without recourse to nonlinearities, clarifying universality classes in aggregation phenomena (Gordienko, 2011).

3. Generalized Stochastic Process Limits

Scaling limits of additive functionals in Markov processes, particle systems, or interacting diffusions show convergence under scaling to Lévy subordinators. Abstractly, one considers scaled additive functionals $A_n(t)$ and proves, under fast-mixing and equicontinuity conditions, weak convergence to a nondecreasing Lévy process $A(\cdot)$ , with a Laplace exponent determined by the scaling of the underlying processes (Taillefumier et al., 2024, Gonçalves et al., 2011). The Lévy–Khinchine representation controls the jump and drift structure:

$\mathbb{E}[e^{-\mu A(t)}] = e^{-t \Phi(\mu)}$

with

$\Phi(\mu) = d \mu + \int_{0}^{\infty} (1 - e^{-\mu x}) \, \Pi(dx)$

linking the scaling law of the additive functional to process-specific structural parameters.

Scaling limits yield non-Gaussian, non-self-similar scaling behaviors, crucial for doubly-stochastic models (e.g., collective synchrony in neuroscience) and occupation-time distributions in interacting particle systems (Taillefumier et al., 2024, Gonçalves et al., 2011).

4. Statistical Linguistics and Additive Frequency Scaling

Additive scaling laws are prominent in quantitative linguistics, where the distribution of absolute word frequencies $D_L(n)$ in a text of length $L$ exhibits robust linear scaling in frequency normalization: $D_L(n) = \frac{1}{L V_L} g\left(\frac{n}{L}\right)$ for a scaling function $g$ independent of $L$ , and $V_L$ the vocabulary size (Font-Clos et al., 2013). For lemmatized texts, $g(x)$ is often a double power law, and the single scaling parameter (text length $L$ ) is sufficient for collapse across disparate scales, unifying Zipf and Heaps laws as manifestations of the same underlying additive scaling relation.

5. Additive Scaling in Financial Models

Additive scaling laws underpin the construction of additive process models, particularly additive normal tempered stable (ATS) processes for equity volatility surfaces (Azzone et al., 2021, Azzone et al., 2019). In the ATS framework, key model parameters such as jump intensity $k_t$ and skew $\eta_t$ obey independent power-law scaling in time to maturity $t$ , with exponents calibrated to market data: $k_t = \bar{k} t^{\beta} \quad \text{and} \quad \eta_t = \bar{\eta} t^{\delta}$ with empirical findings $\beta \approx 1,\, \delta \approx -1/2$ precisely reproducing the $T^{-1/2}$ decay of implied volatility skew observed in markets. The model remains additive and allows the time structure of smiles and skews to be captured with just two independent exponents (Azzone et al., 2021).

6. Calibration, Prediction, and Structural Implications

The methodology of additive scaling law analysis prescribes:

Fitting scaling exponents and constants using small models or data subsets (for neural scaling laws: linear regression in log-space over converged loss datasets (Su et al., 2024)).
Validating extrapolation by empirically verifying the sum-like law up to the largest accessible scale (e.g., LLMs up to 33B parameters (Su et al., 2024), acoustic models across two orders of magnitude (Droppo et al., 2021)).
Quantitative comparisons with alternative hypothesis classes (symbolic regression, multiplicative laws) demonstrate the dominance of additive sum-of-powers models in fitting real-world grouped data sets, with significant reductions in normalized mean squared error and robustness to overfitting (Lin et al., 27 Jul 2025).

Additive forms are mechanistically motivated by the independent bottleneck effect: as any single resource (model size, data, bits of precision, training steps) saturates, its term in the scaling law dominates, and the limit is sharp. The inclusion of an irreducible floor, explicit additive quantization error, or data-vs-model trade-off constants directly encodes physical, statistical, or resource-constrained regimes inaccessible to strictly multiplicative or self-similar laws.

7. Limitations, Extensions, and Interpretability

While additive scaling laws are broadly successful across domains, predictive accuracy is typically guaranteed only within the validated scaling range (e.g., up to 33B parameters for transformers, hours to thousands of hours for acoustic models, or specific time horizons in quantitative finance) (Su et al., 2024, Droppo et al., 2021, Azzone et al., 2021). Extrapolation to novel architectural regimes (e.g., mixture-of-experts, very long contexts, or extreme quantization) may violate key assumptions, requiring re-estimation of exponents or the introduction of correction terms. For finite data or non-converged optimization, early behavior may deviate from the additive form, and tuning of learning rates or process-specific constants is necessary for reliable prediction.

Additive scaling laws offer unique interpretability: each term distinctly identifies the contribution and saturation behavior of a single resource or parameter, and the law enables tractable, group-wise counterfactual reasoning and optimization over hyperparameters or experimental design (Lin et al., 27 Jul 2025, Droppo et al., 2021, Su et al., 2024). In stochastic process and physical systems formulations, additive scaling is linked to non-self-similar scaling limits, Lévy subordinators, and new universality classes. In linguistic and aggregation settings, single-scale additive laws expose the surprisingly simple mechanisms underlying complex emergent distributions.