Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quartet: Native FP4 Training Can Be Optimal for Large Language Models (2505.14669v2)

Published 20 May 2025 in cs.LG

Abstract: Training LLMs models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Roberto L. Castro (7 papers)
  2. Andrei Panferov (7 papers)
  3. Soroush Tabesh (7 papers)
  4. Oliver Sieberling (6 papers)
  5. Jiale Chen (43 papers)
  6. Mahdi Nikdan (7 papers)
  7. Saleh Ashkboos (20 papers)
  8. Dan Alistarh (133 papers)