Relationship Between Optimal QAT Fraction and Pretraining Precision
Determine how the optimal quantization-aware training fraction, defined as the fraction of total training tokens allocated to the quantization-aware training phase after full-precision pretraining in a two-stage pipeline, depends on the floating-point precision used during the pretraining stage (e.g., bfloat16, FP8, or FP4), and characterize this relationship across precisions in the context of training decoder-only transformer language models with quantization-aware training resumed from full-precision checkpoints.
References
First, the relationship between the optimal QAT fraction and pretraining precision remains unknown. This direction is especially interesting with the emergence of 8-bit floating-point training (Peng et al., 2023) and even 4-bit training (Zhou et al., 2025b; Wang et al., 2025).