FP4 All the Way: Fully Quantized Training of LLMs (2505.19115v1)

Published 25 May 2025 in cs.LG and cs.AI

Abstract: We demonstrate, for the first time, fully quantized training (FQT) of LLMs using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients on datasets up to 200 billion tokens. We extensively investigate key design choices for FP4, including block sizes, scaling formats, and rounding methods. Our analysis shows that the NVFP4 format, where each block of 16 FP4 values (E2M1) shares a scale represented in E4M3, provides optimal results. We use stochastic rounding for backward and update passes and round-to-nearest for the forward pass to enhance stability. Additionally, we identify a theoretical and empirical threshold for effective quantized training: when the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized training becomes less effective. Leveraging these insights, we successfully train a 7-billion-parameter model on 256 Intel Gaudi2 accelerators. The resulting FP4-trained model achieves downstream task performance comparable to a standard BF16 baseline, confirming that FP4 training is a practical and highly efficient approach for large-scale LLM training. A reference implementation is supplied in https://github.com/Anonymous1252022/fp4-all-the-way .

PDF Abstract

Fully Quantized Training of LLMs Using FP4 Precision

The paper "FP4 All the Way: Fully Quantized Training of LLMs" presents a novel approach to the training of LLMs by utilizing fully quantized precision through the adoption of 4-bit floating-point (FP4) data formats. This paper marks the first established instance of deploying FP4 precision for weights, activations, and gradients across extensive datasets, focusing on up to 200 billion tokens. The authors meticulously analyze and optimize key design aspects such as block sizes, scaling formats, and rounding methods to facilitate stable and effective training in FP4 format.

Key Contributions and Findings

FP4 Format Optimization: The research identifies the effectiveness of the NVFP4 format, which employs E4M3 scaling for each block containing 16 FP4 values (E2M1). This configuration is found to be optimal in comparison to alternatives like MXFP4, where blocks of FP4 values share scales within an E8M0 format. The choice of blocking size and encoding directly impacts both the training stability and accuracy, ultimately confirming NVIDIA's hardware design choices.
Split Rounding Strategy: The authors propose a strategy for rounding that involves the application of stochastic rounding in the backward and update passes, paired with round-to-nearest in the forward pass. This selective rounding approach enhances training stability and prevents quantization noise from degrading the model's accuracy prematurely.
Precision Transition Analysis: A critical threshold is identified theoretically and empirically, indicating that quantized training becomes ineffective when the gradient norm is lower than approximately $\sqrt{3}$ times the quantization noise. To address this issue, the authors suggest a precision transition during the later stages of training, utilizing higher precision Quantization Aware Finetuning (QAF) to bolster performance and ensure convergence.
End-to-End FP4 Training Across Multiple Hardware Units: The authors successfully perform complete FP4 training of a LLM with 7 billion parameters, leveraging 256 Intel Gaudi2 accelerators. The FP4-trained model exhibits downstream task performance comparable to that of a standard BF16 baseline, demonstrating practical viability for large-scale applications.

Implications for Future Research

This paper extends the frontier for practical implementations of FQT in LLMs, potentially setting a new paradigm for efficient LLM training that reduces computational cost and enhances resource utilization. The adoption of FP4 precision marks a significant step in the evolution of training methodologies, simultaneously promising improvements in throughput and energy efficiency.

However, there are limitations concerning the current hardware's support for FP4 execution, which may indicate challenges in assessing real-world speedup potential until hardware capabilities mature to fully accommodate FP4's capabilities. Future hardware developments could provide improvements in processing speed through optimized native FP4 operations.

Conclusion

The findings outlined herein present a compelling case for the widespread adoption of FP4 precision in training LLMs. Through targeted investigations into precision configurations and rounding techniques, this paper provides foundational insights into effectively managing quantization noise. These contributions enable practical application of FP4 quantization strategies in large-scale settings without sacrificing accuracy or quality, paving the way for new research directions and technological advancements in AI.