Quartet: Native FP4 Training Can Be Optimal for Large Language Models (2505.14669v2)

Published 20 May 2025 in cs.LG

Abstract: Training LLMs models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (8)

Roberto L. Castro (7 papers)
Andrei Panferov (7 papers)
Soroush Tabesh (7 papers)
Oliver Sieberling (6 papers)
Jiale Chen (43 papers)
Mahdi Nikdan (7 papers)
Saleh Ashkboos (20 papers)
Dan Alistarh (133 papers)

Tweets

https://twitter.com/DAlistarh/status/1927046866467725433

https://twitter.com/newlinedotco/status/1927334306415882727

Quartet: Native FP4 Training Can Be Optimal for Large Language Models (2505.14669v2)

Related Papers

Tweets