Dice Question Streamline Icon: https://streamlinehq.com

Relationship Between Optimal QAT Fraction and Pretraining Precision

Determine how the optimal quantization-aware training fraction, defined as the fraction of total training tokens allocated to the quantization-aware training phase after full-precision pretraining in a two-stage pipeline, depends on the floating-point precision used during the pretraining stage (e.g., bfloat16, FP8, or FP4), and characterize this relationship across precisions in the context of training decoder-only transformer language models with quantization-aware training resumed from full-precision checkpoints.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper studies how to optimally divide training compute between a full-precision pretraining phase and a quantization-aware training (QAT) phase for LLMs, proposing both an empirical optimal allocation rule and a unified loss scaling law that predicts final loss as a function of model size, token counts for FP and QAT, and QAT bit width.

All experiments in the paper use bfloat16 automatic mixed precision during pretraining, and the presented results show that the optimal QAT fraction increases with the tokens-per-parameter-byte statistic. The authors note that emerging lower-precision pretraining (e.g., FP8 and FP4) may change this relationship, but the dependence of the optimal QAT fraction on pretraining precision has not yet been established.

References

First, the relationship between the optimal QAT fraction and pretraining precision remains unknown. This direction is especially interesting with the emergence of 8-bit floating-point training (Peng et al., 2023) and even 4-bit training (Zhou et al., 2025b; Wang et al., 2025).

Compute-Optimal Quantization-Aware Training (2509.22935 - Dremov et al., 26 Sep 2025) in Section 8 (Future Work)