Unified Loss Scaling Law for QAT

Updated 30 September 2025

The paper introduces a unified loss scaling law that mathematically integrates model size, training tokens, quantization granularity, and bit-width to predict quantization errors.
It demonstrates how gradient estimation and loss regularization via controlled noise and adaptive scaling lead to stable convergence and flatter minima.
The framework enables precise compute allocation across full-precision and quantized training, guiding optimal resource planning and QAT configurations.

Quantization-aware training (QAT) is a method for enabling high-fidelity model quantization by incorporating quantization effects within the optimization loop. As deep learning models and deployment platforms exhibit massive diversity in architecture, computational budget, and quantization regime, the search for a unified loss scaling law in QAT has become central to both theory and practice. The current landscape, established in recent literature, describes unified loss scaling laws that integrate model size, training data, quantization granularity, bit-width, and compute allocation with procedures for stable and performant optimization. These laws now inform optimal QAT configurations, resource allocation, and estimator design.

1. Theoretical Formulation of the Unified Loss Scaling Law

Recent advances have produced explicit mathematical formulations for QAT loss scaling, taking the form

$L(N, D_{qat}, D_{fp}, B) = L_{FP}(N, D_{total}) + \epsilon(N, D_{qat}, D_{fp}, B)$

where:

$N$ : model size (parameter count)
$D_{fp}$ , $D_{qat}$ : numbers of tokens for full-precision vs. quantized training
$B$ : quantization bit-width
$L_{FP}(N, D_{total})$ : scaling law for full-precision loss, usually following a "Chinchilla-like" law
$\epsilon(N, D_{qat}, D_{fp}, B)$ : the quantization-induced penalty term, capturing the additional error that arises from both quantization and the allocation of compute to QAT (Dremov et al., 26 Sep 2025).

A typical instantiated penalty model is: $\delta_p(N, D, G) = \frac{k \cdot D^{\gamma_D} \cdot (\log_2(G))^{\gamma_G}}{N^{\gamma_N}}$ where $G$ is the quantization group size, $D$ the number of training tokens, and $(\gamma_N, \gamma_D, \gamma_G)$ are scaling exponents.

The response of the quantized loss to this set of control variables is governed by universal trends:

Quantization error decreases with model size ( $N$ ).
Quantization error increases with number of training tokens ( $D$ ).
Quantization error increases with coarser quantization (higher $G$ ).
Bit-width $B$ affects the effective capacity and the quantization error, often in a non-linear but predictable way.

This law supports compositionality, allowing for integration of further compression factors such as sparsity, and defines scaling in terms of a "capacity-equivalent" parameter count $N' = N \cdot \rho(R)$ , where $\rho(R)$ is a representation capacity function quantifying the parameter efficiency of the quantization format as a function of, for example, the Gaussian mean squared error (GMSE) of the format (Panferov et al., 2 Jun 2025).

2. Gradient Estimation, Loss Regularization, and Stability

Gradient approximation under quantization is a central technical challenge. Traditional straight-through estimators (STE) offer a biased, locally constant gradient; more contemporary schemes modulate this with learnable scales and controlled noise:

$Q(W) = Q(W) + sg(y_Q \cdot |Q(W)-W| \cdot \epsilon)$

where $y_Q = c \cdot \exp(-k|Q(W)-W|)$ modulates exponentially decaying quantization-error-aware noise, and $sg(\cdot)$ indicates a stop-gradient operation to limit bias (Wang et al., 2022).

Simultaneously, the quantization step size $s$ may be learned, providing direct control over the scaling of task loss gradients through quantizers (e.g., LSQ). Additive noise, particularly when proportional to quantization error and decaying in variance, introduces an implicit curvature term that encourages optimization towards flatter minima, demonstrably improving generalization and convergence stability.

Loss regularization can be extended to include convex, piecewise-affine penalties (e.g., PARQ), which enforce parameter clustering at quantizer values via proximal gradient methods, providing last-iterate convergence with explicit control of quantization regularization strength (Jin et al., 19 Mar 2025).

3. Factorization by Model, Data, and Quantization Regimes

A generic unified law for quantization error and loss is given by: $\delta_p(N, D, G) = \frac{k \cdot D^{\gamma_D} \cdot (\log_2 G)^{\gamma_G}}{N^{\gamma_N}}$ This encodes major empirical findings:

Cohorts of model sizes (from 74M to 973M parameters): loss drops with $N$ (scaling exponent $\gamma_N$ )
Increased data drives up error ( $\gamma_D > 0$ ), a non-intuitive effect distinct from full-precision scaling behavior
Quantization group size $G$ changes error by a logarithmic factor ( $\gamma_G$ ), reflecting discrete grouping structure impacting effective quantization granularity

Error can further be decomposed into weight and activation quantization contributions, which may dominate depending on data/model regime:

For large datasets, weight quantization can become the error bottleneck.
For small models or at typical compute scales, activation quantization (especially with outlier-dominated projections such as FC2 input) controls loss floor (Chen et al., 20 May 2025).

4. Compute Allocation and the Tokens-per-Parameter-Byte Principle

A major practical outcome is the discovery that allocation of compute between FP pretraining and QAT is not fixed, but should scale with available training tokens and model size. The optimal QAT fraction $f^*$ depends on the tokens-per-parameter-byte statistic: $S_{total} = \frac{D_{total}}{N \cdot B}$ with fitted scaling: $f^*(N, D_{total}, B) = \exp\left(\frac{\log S_{total}}{a}\right),~~a \approx 6.73$ This law predicts that with larger compute budgets ( $S_{total}$ large), the optimal QAT fraction increases—contrary to prior beliefs that a fixed 10% QAT is optimal. The scaling relationship unifies regime selection across bit-widths, model sizes, and data volumes, supporting compute-optimal training strategies (Dremov et al., 26 Sep 2025).

A related implication is that cooldown & QAT fusion—integrating learning rate decay and QAT without redundant full-precision updates—can eliminate wasted compute, further aligning practice to the theoretical scaling law.

5. Unification Across Quantization Formats and Representation Capacity

A universal capacity metric $\rho(R)$ enables comparisons of quantized, sparse, or sparse-quantized formats. Empirically, the scaling law becomes: $\text{Loss}(N, D) \simeq A \cdot (N \cdot \rho(R))^{-\alpha} + B D^{-\beta} + E$ where $\rho(R)$ is often a function of the Gaussian MSE for the representation $R$ , such as: $\rho(R) = L \cdot \tanh(F \cdot \log_{1/4}(\text{GMSE}(R)))^C$ These capacity-based formulations robustly predict performance across scalar-, vector-, low-bit, and mixed-precision quantizations, as well as hybrid sparsity-quantization settings (Panferov et al., 2 Jun 2025, Frantar et al., 23 Feb 2025). Representation capacity is multiplicative across compression factors, enabling joint optimization and resource planning.

6. Practical Implications, Comparisons, and Open Problems

Key implications and guidance from the unified loss scaling law:

QAT error reductions are maximized by increasing model size, reducing quantization group size, and dynamically allocating more compute to QAT as overall budget increases.
For mixed-precision or groupwise quantization, error control can be achieved by identifying layer-specific bottlenecks (e.g., outlier-prone FC2 projections) and applying high-precision selectively.
Loss regularization via controlled noise, dynamic scaling, and piecewise penalties can be tuned to produce flatter minima and robust performance.
Compute planning (i.e., partitioning FP and QAT phases) can be predicted accurately using the tokens-per-parameter-byte scaling.
Capacity metrics allow principled "apples-to-apples" selection among quantization formats.

Remaining open directions include extension to adaptive optimizers (beyond SGD with momentum), optimal group assignment in groupwise quantization, further compositional rules for multi-factor compression, and system-specific adaptation for diverse deployment hardware (Wang et al., 2022, Chen et al., 20 May 2025, Panferov et al., 2 Jun 2025).

7. Summary Table: Principal Factors and Loss Scaling Law

Factor	Scaling Law Role	Implications for QAT Planning
Model size ( $N$ )	$\sim N^{-\gamma_N}$ decrease	Larger models tolerate lower bits
Training tokens ( $D$ )	$\sim D^{+\gamma_D}$ increase	More data may worsen quant error
Group size ( $G$ )	$\sim (\log_2 G)^{\gamma_G}$ increase	Finer quantization → lower error
Bit-width ( $B$ )	Capacity, tokens-per-param-byte	Governs both error and resource use
Representation capacity ( $\rho(R)$ )	Modifies $N$ (effective)	Universal cross-format comparability
QAT/FP Compute fraction	Predicted by $f^*(S_{total})$	Allocate QAT fraction per scaling

This unified scaling framework encodes the major empirical and theoretical advances in QAT, offering a predictive model for quantization performance and optimal resource allocation under diverse regimes. Its adoption enables principled model design, efficient compute usage, and systematic exploration of novel quantization-aware training methods.

PDF Markdown Chat (Pro)

References (6)

Compute-Optimal Quantization-Aware Training (2025)

Unified Scaling Laws for Compressed Representations (2025)

Error-aware Quantization through Noise Tempering (2022)

PARQ: Piecewise-Affine Regularized Quantization (2025)

Scaling Law for Quantization-Aware Training (2025)

Compression Scaling Laws:Unifying Sparsity and Quantization (2025)

Follow Topic

Get notified by email when new papers are published related to Unified Loss Scaling Law for QAT.