Papers
Topics
Authors
Recent
2000 character limit reached

Quantized Training Scaling Laws

Updated 12 November 2025
  • Quantized training scaling laws are a framework that predicts loss reduction as a power-law function of model parameters, training tokens, and quantization precision.
  • They unify various compression techniques by replacing raw parameter count with an effective capacity measure, enabling precise performance trade-off analysis under quantization-induced error.
  • Empirical models, including those for quantization-induced degradation and optimal QAT allocation, guide the design of efficient low-precision large language models.

Quantized training scaling laws describe the predictable interplay between model size, data volume, resource constraints, and the effects of quantization on neural network performance. Modern research has established a suite of unified, empirically validated scaling laws that account for quantization-induced error, the structure of quantization schemes, and the allocation of training compute in quantized regimes. These laws enable precise predictions of cross-entropy loss, effective parameter scaling, performance trade-offs under precision constraints, and the emergence (and limitations) of low-precision models in large-scale machine learning.

1. Quantization Hypotheses and Classical Scaling Laws

The foundational quantized training scaling law extends neural scaling theory by attributing the power-law decrease of loss to the sequential acquisition of discrete “quanta” of knowledge or skill. The quantization model is built on three mathematical hypotheses (Michaud et al., 2023):

  • Discreteness (QH1): Model loss depends solely on which quanta (indexed by use frequency) have been learned.
  • Ordered Learning (QH2): For nn learned quanta, the model has acquired the nn most frequently required abilities; remaining quanta are not yet learned.
  • Zipfian Use Frequencies (QH3): The probability that an example uses the kk-th quantum follows a Zipf law:

pk=1ζ(α+1)k(α+1),α>0p_k = \frac{1}{\zeta(\alpha+1)} \cdot k^{-(\alpha+1)},\qquad \alpha > 0

where ζ\zeta is the Riemann zeta function.

Aggregate test loss after learning nn quanta is then: Lna+baαζ(α+1)nαL_n \simeq a + \frac{b-a}{\alpha\zeta(\alpha+1)} n^{-\alpha} where aa and bb are the losses after and before learning each quantum.

For a model with NN parameters (each quantum requiring CC capacity), nN/Cn\approx N/C, yielding: L(N)LNαL(N) - L_\infty \propto N^{-\alpha} This formalizes the power-law scaling of loss with model size and extends to data scaling and optimization step scaling.

2. Unified Scaling Laws for Quantized and Compressed Training

Recent work has generalized the above to encompass arbitrary model compression, including quantization, sparsity, and their compositions. The key principle is to replace the dense parameter count NN with an effective parameter count NeffN_{\rm eff}, determined by a capacity factor ρ(R)\rho(R) reflecting the representational efficiency under compression format RR (Panferov et al., 2 Jun 2025, Frantar et al., 23 Feb 2025): Loss(N,D;R)=A(Nρ(R))α+BDβ+E\mathrm{Loss}(N, D; R) = A (N \rho(R))^{-\alpha} + B D^{-\beta} + E where:

  • AA, BB, α\alpha, β\beta, E>0E > 0 are task- and domain-specific constants;
  • ρ(R)(0,1]\rho(R) \in (0, 1] quantifies representation capacity.

For uniform bb-bit scalar quantization: Neff(b)=m(b)N,m(b)1K2γbN_{\rm eff}(b) = m(b) \cdot N,\qquad m(b) \simeq 1 - K 2^{-\gamma b} For K0.96K \simeq 0.96, γ0.84\gamma \simeq 0.84, m(4)0.92m(4) \approx 0.92, m(1)0.47m(1) \approx 0.47 (Frantar et al., 23 Feb 2025).

Joint sparse-quantization and vector quantization are compositional: ρR1R2=ρR1ρR2\rho_{R_1 \circ R_2} = \rho_{R_1} \cdot \rho_{R_2}

A capacity mapping ρ(R)\rho(R) can be empirically fit from root mean square error when compressing Gaussian data, providing a unified axis to compare disparate formats (Panferov et al., 2 Jun 2025).

3. Explicit Power-Law Formulas: Effects of Tokens, Model Size, and Bit Width

Direct post-training quantization introduces quantization-induced degradation (QiD), with a precisely measured scaling for low-bit LLM checkpoints (Ouyang et al., 26 Nov 2024): ΔqLoss(N,T,b)kTβNαbγ\Delta_{q\text{Loss}}(N, T, b) \approx k \frac{T^{\beta}}{N^{\alpha} b^{\gamma}} with empirical fits: k=0.017,    α=0.2261,    β=0.5251,    γ=5.4967k = 0.017,\;\; \alpha = 0.2261,\;\; \beta = 0.5251,\;\; \gamma = 5.4967 where NN: non-embedding model parameters, TT: training tokens, bb: quantization bit-width.

Key empirical consequences:

  • QiD grows steeply with TT for fixed N,bN, b (β0.53\beta\approx 0.53),
  • QiD shrinks slightly with NN (α0.23\alpha\approx 0.23),
  • QiD falls off rapidly with bb (γ5.5\gamma\approx 5.5).

Calculated with T=1014T = 10^{14} tokens, even 4-bit models at N=7×1010N = 7 \times 10^{10} have Δq0.66\Delta_q \sim 0.66 nats, indicating severe limits as LLMs become more fully trained. Undertrained checkpoints exhibit far less degradation under quantization; hence QiD can serve as a practical proxy for training completeness.

4. Floating-Point Quantization: Structure, Critical Data Size, and Compute-Optimality

Precision-structured floating-point quantization reveals further refinements (Sun et al., 5 Jan 2025). The validation loss under floating-point quantization is modeled as: L(N,D,E,M,B)=nNα+dDβ+ϵ+DβNαlog2Bγ(E+0.5)δ(M+0.5)νL(N, D, E, M, B) = \frac{n}{N^{\alpha}} + \frac{d}{D^{\beta}} + \epsilon + \frac{D^{\beta}}{N^{\alpha}} \frac{\log_2 B}{\gamma (E + 0.5)^\delta (M + 0.5)^\nu} with N=params in millionN = \text{params in million}, D=tokens in billionD = \text{tokens in billion}, (E,M)(E, M): exponent and mantissa bits, BB: block size for scaling.

Exponent bits contribute more to loss reduction than mantissa. The optimal split for given total bits P=E+M+1P = E + M + 1 is: Mopt=νPδ+ν0.5,Eopt=δPδ+ν0.5M_{\mathrm{opt}} = \frac{\nu P}{\delta + \nu} - 0.5,\qquad E_{\mathrm{opt}} = \frac{\delta P}{\delta+\nu} - 0.5 For FP8: E4_4M3_3 and FP4: E2_2M1_1 represent optimal layouts.

A critical data size DcritD_{\mathrm{crit}} emerges, beyond which quantization penalty dominates and further training data increases loss: Dcrit=[dγNα(E+0.5)δ(M+0.5)νlog2B]1/(2β)D_{\mathrm{crit}} = \left[ \frac{d \gamma N^\alpha (E+0.5)^\delta (M+0.5)^\nu}{\log_2 B} \right]^{1/(2\beta)} For lower-precision settings, DcritD_{\mathrm{crit}} can be within realistic datasets, yielding U-shaped loss curves.

Over fixed compute CkPNDC \approx k P N D, the precision PP that optimizes cost-performance is in the $4$–$8$ bit range under a wide set of configurations.

5. Quantization-Aware Training: Scaling Laws, Bottlenecks, and Mixed-Precision

Quantization-aware training (QAT) is characterized by a distinct scaling law for quantization error (difference in loss relative to full-precision) (Chen et al., 20 May 2025): ϵQAT(N,T,G)=ANαTβ[log2G]γ\epsilon_{\mathrm{QAT}}(N, T, G) = A N^{-\alpha} T^{\beta} [\log_2 G]^{\gamma} with NN: parameters, TT: training tokens, GG: quantization group size. Fitted parameters (for W4A4) are A=0.1582A=0.1582, α=0.2186\alpha=0.2186, β=0.0745\beta=0.0745, γ=0.7779\gamma=0.7779.

  • Larger NN reduces ϵQAT\epsilon_{\mathrm{QAT}} (α>0\alpha > 0).
  • Increasing TT increases ϵQAT\epsilon_{\mathrm{QAT}}, implying that quantized models lag their full-precision counterparts as data grows.
  • Coarser quantization groupings GG severely deteriorate accuracy; the effect is dominant in activation quantization.

Decomposition into weights and activations shows that activation error is strongly sensitive to GG due to outliers, particularly in the FC2 layer. Applying mixed precision (e.g., using 8-bit only for the FC2-Proj input) greatly mitigates this granularity-induced error.

Given a target loss penalty ϵ0\epsilon_0, required NN can be computed as: N[ATβ(log2G)γϵ0]1/αN \geq \left[ \frac{A T^{\beta} (\log_2 G)^\gamma}{\epsilon_0} \right]^{1/\alpha} allowing direct trade-off planning between NN, TT, and GG.

6. Compute-Optimality and Allocation in QAT Regimes

Optimally dividing training between a full-precision (FP) phase and a QAT phase leads to nontrivial, predictable scaling (Dremov et al., 26 Sep 2025). The key variable is the tokens-per-parameter-byte statistic: S=DtotNBS = \frac{D_{\text{tot}}}{N B} where DtotD_{\text{tot}}: total training tokens; NN: parameters; BB: QAT bit-width.

The compute-optimal fraction of training allocated to QAT, f=DQAT/Dtotf^* = D_{\rm QAT}/D_{\rm tot}, is given by: f(S)=exp(lnSlnS+a),a=6.7297f^*(S) = \exp\left(\frac{\ln S}{\ln S + a}\right),\qquad a = 6.7297 The final loss is modeled as: L(N,DFP,DQAT,B)=LChin(N,Dtot)+PQAT(N,DFP,DQAT,B)L(N, D_{\rm FP}, D_{\rm QAT}, B) = L_{\rm Chin}(N, D_{\rm tot}) + P_{\rm QAT}(N, D_{\rm FP}, D_{\rm QAT}, B) where the quantization penalty PQATP_{\rm QAT} has terms for irreducible error, pure-QAT adaptation, and FP–QAT interaction. The law achieves R2>0.98R^2>0.98 and MAE<0.1MAE < 0.1 for four distinct bit-widths and multiple model sizes.

With this framework, it is possible to produce closed-form allocation for optimal QAT under compute and memory budgets, and to determine the best bit-width for a fixed deployment constraint.

7. Practical Design Guidance and Phase Boundary Implications

  • Parameter Multipliers: Effective model size under quantization scales as Neff(b)N \cdot \mathrm{eff}(b) for weights or Nf(bw,ba)N \cdot f(b_w, b_a) for weight-activation pairs, with empirical efficiency falling sharply below 4 bits.
  • Critical Regimes: As LLMs surpass 101310^{13}101410^{14} tokens, post hoc low-bit quantization becomes impractical except for extremely large models, unless QAT is performed during training.
  • Phase Diagrams: Bit-width/size trade-offs permit Pareto-optimal configurations under storage or compute constraints (e.g. 4–8 bit floating-point formats often optimal for training and inference).
  • Mixed-Precision: Employ mixed-precision only in layers with observed heavy-tailed activation distributions (notably FC2), to mitigate error concentration without excessive memory overhead.
  • QAT Planning: Allocate QAT phase proportionally to tokens-per-parameter-byte for maximal performance; early full-precision followed by late QAT, or a fusion scheme with staged learning-rate decay, minimizes loss at fixed compute.
  • Unified Capacity Axis: The “capacity” metric, related to the mean-squared-error of the representation, provides a universal axis for model comparison and training prescription across quantized, sparse, and jointly-compressed formats.

These scaling laws provide a rigorous, predictive formalism for evaluating, designing, and training quantized and compressed LLMs under realistic compute and deployment constraints.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Quantized Training Scaling Law.