Quantized Training Scaling Laws

Updated 12 November 2025

Quantized training scaling laws are a framework that predicts loss reduction as a power-law function of model parameters, training tokens, and quantization precision.
They unify various compression techniques by replacing raw parameter count with an effective capacity measure, enabling precise performance trade-off analysis under quantization-induced error.
Empirical models, including those for quantization-induced degradation and optimal QAT allocation, guide the design of efficient low-precision large language models.

Quantized training scaling laws describe the predictable interplay between model size, data volume, resource constraints, and the effects of quantization on neural network performance. Modern research has established a suite of unified, empirically validated scaling laws that account for quantization-induced error, the structure of quantization schemes, and the allocation of training compute in quantized regimes. These laws enable precise predictions of cross-entropy loss, effective parameter scaling, performance trade-offs under precision constraints, and the emergence (and limitations) of low-precision models in large-scale machine learning.

1. Quantization Hypotheses and Classical Scaling Laws

The foundational quantized training scaling law extends neural scaling theory by attributing the power-law decrease of loss to the sequential acquisition of discrete “quanta” of knowledge or skill. The quantization model is built on three mathematical hypotheses (Michaud et al., 2023):

Discreteness (QH1): Model loss depends solely on which quanta (indexed by use frequency) have been learned.
Ordered Learning (QH2): For $n$ learned quanta, the model has acquired the $n$ most frequently required abilities; remaining quanta are not yet learned.
Zipfian Use Frequencies (QH3): The probability that an example uses the $k$ -th quantum follows a Zipf law:

$p_k = \frac{1}{\zeta(\alpha+1)} \cdot k^{-(\alpha+1)},\qquad \alpha > 0$

where $\zeta$ is the Riemann zeta function.

Aggregate test loss after learning $n$ quanta is then: $L_n \simeq a + \frac{b-a}{\alpha\zeta(\alpha+1)} n^{-\alpha}$ where $a$ and $b$ are the losses after and before learning each quantum.

For a model with $N$ parameters (each quantum requiring $C$ capacity), $n\approx N/C$ , yielding: $L(N) - L_\infty \propto N^{-\alpha}$ This formalizes the power-law scaling of loss with model size and extends to data scaling and optimization step scaling.

2. Unified Scaling Laws for Quantized and Compressed Training

Recent work has generalized the above to encompass arbitrary model compression, including quantization, sparsity, and their compositions. The key principle is to replace the dense parameter count $N$ with an effective parameter count $N_{\rm eff}$ , determined by a capacity factor $\rho(R)$ reflecting the representational efficiency under compression format $R$ (Panferov et al., 2 Jun 2025, Frantar et al., 23 Feb 2025): $\mathrm{Loss}(N, D; R) = A (N \rho(R))^{-\alpha} + B D^{-\beta} + E$ where:

$A$ , $B$ , $\alpha$ , $\beta$ , $E > 0$ are task- and domain-specific constants;
$\rho(R) \in (0, 1]$ quantifies representation capacity.

For uniform $b$ -bit scalar quantization: $N_{\rm eff}(b) = m(b) \cdot N,\qquad m(b) \simeq 1 - K 2^{-\gamma b}$ For $K \simeq 0.96$ , $\gamma \simeq 0.84$ , $m(4) \approx 0.92$ , $m(1) \approx 0.47$ (Frantar et al., 23 Feb 2025).

Joint sparse-quantization and vector quantization are compositional: $\rho_{R_1 \circ R_2} = \rho_{R_1} \cdot \rho_{R_2}$

A capacity mapping $\rho(R)$ can be empirically fit from root mean square error when compressing Gaussian data, providing a unified axis to compare disparate formats (Panferov et al., 2 Jun 2025).

3. Explicit Power-Law Formulas: Effects of Tokens, Model Size, and Bit Width

Direct post-training quantization introduces quantization-induced degradation (QiD), with a precisely measured scaling for low-bit LLM checkpoints (Ouyang et al., 26 Nov 2024): $\Delta_{q\text{Loss}}(N, T, b) \approx k \frac{T^{\beta}}{N^{\alpha} b^{\gamma}}$ with empirical fits: $k = 0.017,\;\; \alpha = 0.2261,\;\; \beta = 0.5251,\;\; \gamma = 5.4967$ where $N$ : non-embedding model parameters, $T$ : training tokens, $b$ : quantization bit-width.

Key empirical consequences:

QiD grows steeply with $T$ for fixed $N, b$ ( $\beta\approx 0.53$ ),
QiD shrinks slightly with $N$ ( $\alpha\approx 0.23$ ),
QiD falls off rapidly with $b$ ( $\gamma\approx 5.5$ ).

Calculated with $T = 10^{14}$ tokens, even 4-bit models at $N = 7 \times 10^{10}$ have $\Delta_q \sim 0.66$ nats, indicating severe limits as LLMs become more fully trained. Undertrained checkpoints exhibit far less degradation under quantization; hence QiD can serve as a practical proxy for training completeness.

4. Floating-Point Quantization: Structure, Critical Data Size, and Compute-Optimality

Precision-structured floating-point quantization reveals further refinements (Sun et al., 5 Jan 2025). The validation loss under floating-point quantization is modeled as: $L(N, D, E, M, B) = \frac{n}{N^{\alpha}} + \frac{d}{D^{\beta}} + \epsilon + \frac{D^{\beta}}{N^{\alpha}} \frac{\log_2 B}{\gamma (E + 0.5)^\delta (M + 0.5)^\nu}$ with $N = \text{params in million}$ , $D = \text{tokens in billion}$ , $(E, M)$ : exponent and mantissa bits, $B$ : block size for scaling.

Exponent bits contribute more to loss reduction than mantissa. The optimal split for given total bits $P = E + M + 1$ is: $M_{\mathrm{opt}} = \frac{\nu P}{\delta + \nu} - 0.5,\qquad E_{\mathrm{opt}} = \frac{\delta P}{\delta+\nu} - 0.5$ For FP8: E $_4$ M $_3$ and FP4: E $_2$ M $_1$ represent optimal layouts.

A critical data size $D_{\mathrm{crit}}$ emerges, beyond which quantization penalty dominates and further training data increases loss: $D_{\mathrm{crit}} = \left[ \frac{d \gamma N^\alpha (E+0.5)^\delta (M+0.5)^\nu}{\log_2 B} \right]^{1/(2\beta)}$ For lower-precision settings, $D_{\mathrm{crit}}$ can be within realistic datasets, yielding U-shaped loss curves.

Over fixed compute $C \approx k P N D$ , the precision $P$ that optimizes cost-performance is in the $4$–$8$ bit range under a wide set of configurations.

5. Quantization-Aware Training: Scaling Laws, Bottlenecks, and Mixed-Precision

Quantization-aware training (QAT) is characterized by a distinct scaling law for quantization error (difference in loss relative to full-precision) (Chen et al., 20 May 2025): $\epsilon_{\mathrm{QAT}}(N, T, G) = A N^{-\alpha} T^{\beta} [\log_2 G]^{\gamma}$ with $N$ : parameters, $T$ : training tokens, $G$ : quantization group size. Fitted parameters (for W4A4) are $A=0.1582$ , $\alpha=0.2186$ , $\beta=0.0745$ , $\gamma=0.7779$ .

Larger $N$ reduces $\epsilon_{\mathrm{QAT}}$ ( $\alpha > 0$ ).
Increasing $T$ increases $\epsilon_{\mathrm{QAT}}$ , implying that quantized models lag their full-precision counterparts as data grows.
Coarser quantization groupings $G$ severely deteriorate accuracy; the effect is dominant in activation quantization.

Decomposition into weights and activations shows that activation error is strongly sensitive to $G$ due to outliers, particularly in the FC2 layer. Applying mixed precision (e.g., using 8-bit only for the FC2-Proj input) greatly mitigates this granularity-induced error.

Given a target loss penalty $\epsilon_0$ , required $N$ can be computed as: $N \geq \left[ \frac{A T^{\beta} (\log_2 G)^\gamma}{\epsilon_0} \right]^{1/\alpha}$ allowing direct trade-off planning between $N$ , $T$ , and $G$ .

6. Compute-Optimality and Allocation in QAT Regimes

Optimally dividing training between a full-precision (FP) phase and a QAT phase leads to nontrivial, predictable scaling (Dremov et al., 26 Sep 2025). The key variable is the tokens-per-parameter-byte statistic: $S = \frac{D_{\text{tot}}}{N B}$ where $D_{\text{tot}}$ : total training tokens; $N$ : parameters; $B$ : QAT bit-width.

The compute-optimal fraction of training allocated to QAT, $f^* = D_{\rm QAT}/D_{\rm tot}$ , is given by: $f^*(S) = \exp\left(\frac{\ln S}{\ln S + a}\right),\qquad a = 6.7297$ The final loss is modeled as: $L(N, D_{\rm FP}, D_{\rm QAT}, B) = L_{\rm Chin}(N, D_{\rm tot}) + P_{\rm QAT}(N, D_{\rm FP}, D_{\rm QAT}, B)$ where the quantization penalty $P_{\rm QAT}$ has terms for irreducible error, pure-QAT adaptation, and FP–QAT interaction. The law achieves $R^2>0.98$ and $MAE < 0.1$ for four distinct bit-widths and multiple model sizes.

With this framework, it is possible to produce closed-form allocation for optimal QAT under compute and memory budgets, and to determine the best bit-width for a fixed deployment constraint.

7. Practical Design Guidance and Phase Boundary Implications

Parameter Multipliers: Effective model size under quantization scales as $N \cdot \mathrm{eff}(b)$ for weights or $N \cdot f(b_w, b_a)$ for weight-activation pairs, with empirical efficiency falling sharply below 4 bits.
Critical Regimes: As LLMs surpass $10^{13}$ – $10^{14}$ tokens, post hoc low-bit quantization becomes impractical except for extremely large models, unless QAT is performed during training.
Phase Diagrams: Bit-width/size trade-offs permit Pareto-optimal configurations under storage or compute constraints (e.g. 4–8 bit floating-point formats often optimal for training and inference).
Mixed-Precision: Employ mixed-precision only in layers with observed heavy-tailed activation distributions (notably FC2), to mitigate error concentration without excessive memory overhead.
QAT Planning: Allocate QAT phase proportionally to tokens-per-parameter-byte for maximal performance; early full-precision followed by late QAT, or a fusion scheme with staged learning-rate decay, minimizes loss at fixed compute.
Unified Capacity Axis: The “capacity” metric, related to the mean-squared-error of the representation, provides a universal axis for model comparison and training prescription across quantized, sparse, and jointly-compressed formats.

These scaling laws provide a rigorous, predictive formalism for evaluating, designing, and training quantized and compressed LLMs under realistic compute and deployment constraints.