Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations (2502.05003v2)

Published 7 Feb 2025 in cs.LG

Abstract: One approach to reducing the massive costs of LLMs is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, for which we demonstrate optimality at 4-bits and stable convergence as low as 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

Stable Training of LLMs with QuEST

The paper "QuEST: Stable Training of LLMs with 1-Bit Weights and Activations" presents a novel method, QuEST, for training LLMs in low-precision numerical formats, focusing specifically on using 1-bit weights and activations. The authors tackle the challenge of computational efficiency and model compression, which are critical issues given the increasing scale of LLMs.

Overview

Quantized representations in neural networks are a common approach to reduce training and inference costs. While post-training quantization is widely used, quantization-aware training (QAT) promises more compact and accurate models by directly incorporating quantization in the training phase. Historically, the bit-width for quantization in QAT has hovered around 8-bit for weights and activations to maintain acceptable accuracy levels.

QuEST advances this field by enabling stable training of LLMs with precision as low as 1-bit for both weights and activations. It achieves this while maintaining competitive accuracy against higher precision models, specifically FP16 models, and introduces significant reductions in model size and memory requirements.

Key Contributions

  1. Improved Quantization Process: QuEST leverages a Hadamard normalization approach and mean-squared-error (MSE) optimal fitting for quantizing weights and activations. This technique ensures a more precise and balanced quantization process, addressing the stochastic distribution of neural model weights.
  2. Trust Gradient Estimator: The paper proposes a trust gradient estimator that distinguishes between reliable and unreliable gradient updates by considering the quantization error. This estimation minimizes errors in gradient measurement during training, thereby improving the convergence stability of low-bit quantization.
  3. Experimental Validation: Experiments carried out on Llama-type architectures exhibit stable scaling across various low-precision configurations, establishing new scaling laws for model training. QuEST achieves parity or exceeds BF16 models' accuracy at significantly reduced precision levels, demonstrating practical viability for deployment on hardware supporting low-precision operations.
  4. Implementation and Efficiency: The paper provides implementation details verifying QuEST’s compatibility with existing GPU architectures, highlighting its applicability in reducing inference costs without compromising accuracy. The authors make the code available, facilitating adoption and further research.

Implications and Future Directions

The introduction of QuEST could significantly influence the training and deployment of large-scale models by lowering the computational and memory demands of neural networks, making LLMs more accessible for deployment on a wider range of hardware, including devices with limited processing capabilities.

QuEST’s advancements confirm that the lower bounds of precision in quantized training can be reduced more than previously anticipated, inspiring further exploration into alternative quantization strategies and training dynamics for LLMs.

Looking forward, potential future developments may include validating QuEST on larger, more complex LLMs beyond 800M parameters and exploring its adaptability to other neural network architectures, such as those used in vision and reinforced learning tasks. Additionally, expanding the scope of QuEST to incorporate sparse representations alongside quantization might further optimize the model compression techniques used in neural computation.

Overall, the QuEST framework establishes a new frontier for the efficient and stable training of LLMs using very low-bit quantization, offering a promising avenue for both academic inquiry and practical application.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Andrei Panferov (7 papers)
  2. Jiale Chen (43 papers)
  3. Soroush Tabesh (7 papers)
  4. Roberto L. Castro (7 papers)
  5. Mahdi Nikdan (7 papers)
  6. Dan Alistarh (133 papers)