Loss Scaling: Empirical Laws & Optimization

Updated 2 April 2026

Loss scaling is a quantitative framework that employs power-law laws to describe how error metrics decrease with increased data, model complexity, and compute resources.
It distinguishes between irreducible data noise and reducible error components, enabling precise resource allocation and hyperparameter optimization.
Applied across deep learning, scientific computing, and quantum systems, loss scaling guides strategies like adaptive loss scaling and optimal compute partitioning for improved convergence.

Loss scaling refers to a spectrum of quantitative strategies and empirical laws describing how loss—broadly construed as an error metric or objective in learning, inference, or physical systems—changes under systematic variation of parameters such as data, model complexity, compute budget, training schedule, or resource allocation. In contemporary deep learning and related fields, loss scaling encompasses (1) power-law and joint scaling laws that predict loss trajectories as a function of model/data/compute, (2) mechanisms for adjusting or balancing loss components (notably in multi-objective optimization and mixed-precision numerics), and (3) resource-allocation policies driven by empirical loss scaling exponents. The concept is also foundational to benchmarking, hyperparameter optimization, and practical deployment decisions in machine learning, scientific computing, and experimental physics.

1. Structural Decomposition and Empirical Scaling Laws

Loss scaling in prediction tasks admits a precise decomposition into irreducible and reducible components. Given a classifier with predicted posterior $q_\theta(y|x)$ and the true (oracle) posterior $p(y|x)$ , the expected loss (cross-entropy risk) decomposes as

$L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$

where $L_a$ (aleatoric) is the data-intrinsic entropy floor and $L_e$ (epistemic) is the reducible gap. Empirically, $L_e$ follows a resource-dependent power-law: $L_e(N) \approx C \cdot N^{-\alpha}$ , where $N$ is dataset size, $C$ is a fitted constant, and $\alpha$ is the scaling exponent characteristic of architecture and data regime. Notably, total loss $p(y|x)$ 0 may appear to plateau at the aleatoric floor, but $p(y|x)$ 1 continues to diminish as data accrues, a crucial phenomenon for quantifying ongoing learning and optimal resource planning. For instance, on AFHQ and ImageNet-64, ResNet-50 achieves $p(y|x)$ 2 (AFHQ) and $p(y|x)$ 3 (ImageNet), while MobileNetV3 reaches a higher $p(y|x)$ 4 on AFHQ but with offset $p(y|x)$ 5 much larger than for ResNet-50, reflecting rapid but less data-efficient convergence (Khorasani et al., 30 Jan 2026).

2. Exponentiation and Practical Scaling Laws Across Domains

Loss scaling adopts a universal power-law or shifted power-law form across model types and domains. For neural material models, validation loss $p(y|x)$ 6 obeys

$p(y|x)$ 7

where $p(y|x)$ 8 is varied over data points, parameter count, or total FLOPs. EquiformerV2 achieves $p(y|x)$ 9, $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 0, $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 1, and parameter scaling provides greater loss reduction per resource than dataset scaling (Trikha et al., 26 Sep 2025). Similarly in LLMs, joint scaling laws of the canonical form

$L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 2

(graduating to two- or three-term joint fits) trace the full loss surface as a function of model size $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 3, data $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 4, and total tokens/steps/compute $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 5. These exponents ( $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 6) capture the given domain’s fundamental data and model efficiency (Su et al., 2024).

Mixture-of-Experts (MoE) models and dense Transformers share qualitatively identical scaling laws, with MoEs exhibiting greater benefit from model-scale increases at fixed compute budget, as captured by allocation exponents $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 7: for dense models, $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 8, and for 8-expert MoEs, $L(q_\theta) = \mathbb{E}_x\bigl[\,H\bigl(p(y|x)\bigr)\,\bigr] + \mathbb{E}_x\bigl[\,\mathrm{KL}\left(p(y|x)\Vert q_\theta(y|x)\right)\bigr] = L_a + L_e,$ 9, with MoEs minimizing test loss per FLOP by ∼5–10% compared to dense models under equal compute (Wang et al., 2024).

3. Loss Scaling in Mixed-Precision and Multi-Objective Settings

Loss scaling appears in mixed-precision training as a mechanism to preserve gradient fidelity. Standard global loss scaling selects a scalar multiplier $L_a$ 0 for the backward gradient, preventing underflow in FP16 or lower precision arithmetic and requiring careful tuning to avoid overflow. Adaptive Loss Scaling (ALS) replaces this with per-layer scalars $L_a$ 1 computed from local gradient statistics, maintaining uniform underflow probability across layers. ALS thus removes the need for hyperparameter search, controls local numerical stability, and empirically yields slightly superior or at least comparable accuracy to the best static loss scale in both classification and detection tasks, at minimal overhead (Zhao et al., 2019).

In multi-task and multi-scale PDE inversion, explicit network scaling separates the unit-shape (order-of-unity) neural parameter regime from the final output, allowing explicit per-task or per-component scaling of loss and derivatives. Dynamic scaling, using physics-driven weights based on order-of-magnitude estimates of each residual term, outperforms both GradNorm (which equalizes gradient norms) and SoftAdapt (which adapts weights using per-task loss drop) on accuracy and convergence stability across tasks with strongly varying intrinsic scales (Xu et al., 2024).

Stochastic loss scaling randomly samples weightings for composite loss components at each step, introducing diversity that can facilitate escaping poorly conditioned minima in highly non-convex landscapes. Fixed- or annealed-variance schedules (e.g., $L_a$ 2 or $L_a$ 3) can yield modest improvements in convergence and final error for physics-informed neural networks (Mills et al., 2022).

4. Loss Scaling in Physical and Quantum Systems

Beyond machine learning, loss scaling characterizes system performance and physical loss in the presence of geometric, material, or probabilistic constraints. In superconducting microwave resonators, two-level-system (TLS) dielectric loss scales as a power of the geometric dimension: $L_a$ 4 with $L_a$ 5– $L_a$ 6, and the total loss contribution is determined by both the filling factor of each region and its intrinsic loss tangent. Accurate scaling predictions require 3D Maxwell–London simulation and detailed knowledge of device geometry and materials (Niepce et al., 2019).

In optical quantum networks, the rate-loss scaling law governs entanglement distribution: direct transmission has $L_a$ 7, but polarization–photon-number hybrid sources with single-click entanglement swapping achieve $L_a$ 8, matching the scaling of a one-hop quantum repeater but without the necessity of quantum memories. This O( $L_a$ 9) scaling extends practical link distances and enhances network throughput (Shimizu et al., 20 Jul 2025).

5. Loss Scaling, Allocation Strategies, and Predictive Utility

Empirical scaling exponents and constants enable quantitative planning and trade-off analysis for model training and inference. Given a loss scaling law $L_e$ 0, the resource requirement for a target loss $L_e$ 1 is $L_e$ 2. For joint scaling laws, optimal compute allocation—i.e., how to split a fixed budget between model scale and dataset size—is dictated by the relative exponents $L_e$ 3 and the fitted prefactors, with closed-form optimality conditions derived in terms of log-linear allocation (Su et al., 2024, Wang et al., 2024).

Loss-to-loss scaling laws provide high-confidence transferability predictions between pretraining and downstream task performance—empirically, the relationship is a shifted power law between source and target losses:

$L_e$ 4

with $L_e$ 5 determined almost entirely by the pretraining data distribution and tokenizer. Architectural and optimization hyperparameters affect only the subleading terms, focusing the practitioner’s leverage on data selection and curation (Mayilvahanan et al., 17 Feb 2025).

6. Limitations, Regime Change, and Interpretative Cautions

All reported loss scaling laws exhibit strong empirical validity within training, model, and domain ranges tested. However, exponents and prefactors may shift under regime transition—e.g., entering non-monotonic learning in parameter-sparse models, overfitting at ultralarge data regimes, or saturating at dataset-intrinsic entropy floors. For resource allocation and out-of-domain predictions, constant variations (by up to an order of magnitude) with minor technical changes (context window, tokenizer, domain) require renewed coefficient estimation on a regime-matched pilot run (Su et al., 2024, Trikha et al., 26 Sep 2025).

Mitigating loss deceleration and destructive interference (zero-sum learning)—which present as transitions in scaling slope or as nonstationarities in per-example gradient alignment—remains an active area, with optimizer schedule, gradient-surgery, and per-task weighting as proposed interventions (Mircea et al., 5 Jun 2025).

7. Broader Impact and Universal Relevance

Loss scaling unifies disparate phenomena across fields—deep learning, quantum networking, resonator physics, and computational PDEs—by exposing the fundamental exponents and allocation rules that dictate progress with increased resources. Accurate knowledge of scaling laws enables not only rational experiment design, resource budgeting, and deployment but also insight into the underlying limitations of any supervised, unsupervised, or physical inference task. The predictive reliability and transferability of loss scaling laws position them as central to both theoretical analysis and practical engineering of large-scale systems (Khorasani et al., 30 Jan 2026, Trikha et al., 26 Sep 2025, Wang et al., 2024, Zhao et al., 2019, Xu et al., 2024, Su et al., 2024).