Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FlatQuant: Flatness Matters for LLM Quantization (2410.09426v1)

Published 12 Oct 2024 in cs.CL and cs.LG

Abstract: Recently, quantization has been widely used for the compression and acceleration of LLMs~(LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with the equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still remain steep and outspread. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach to enhance flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead, we apply Kronecker decomposition to the transformation matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments show that FlatQuant sets up a new state-of-the-art quantization benchmark. For instance, it achieves less than $\textbf{1}\%$ accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by $\textbf{7.5}\%$. For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely $\textbf{0.07x}$, bringing up to $\textbf{2.3x}$ speedup for prefill and $\textbf{1.7x}$ speedup for decoding, respectively. Code is available at: \url{https://github.com/ruikangliu/FlatQuant}.

Overview

The paper "FlatQuant: Flatness Matters for LLM Quantization" (Sun et al., 12 Oct 2024 ) introduces a novel post-training quantization (PTQ) framework designed to address the persistent issues caused by outlier distributions in both weights and activations in LLMs. By emphasizing the importance of flattening these distributions, the proposed method seeks to reduce quantization error when using equally spaced quantization levels. Rather than relying solely on pre-quantization transformations, such as per-channel scaling or Hadamard transforms, FlatQuant operates as a post-training approach that learns optimal, layer-specific affine transformations. This methodology results in significantly improved quantization accuracy and lower inference latency compared to state-of-the-art approaches.

Methodology

Learned Affine Transformations

A core innovation in FlatQuant is the optimization of an invertible affine transformation for each linear layer. For a given linear operation, expressed as:

Y=XWT,Y = XW^T,

the approach seeks an optimal transformation matrix PP such that the quantized operation minimizes the quantization error:

P=argminPYQ(XP)Q(P1WT)F2,P^* = \arg\min_P \|Y - Q(XP) \cdot Q(P^{-1}W^T)\|_F^2,

where Q()Q(\cdot) denotes the quantization function. This formulation permits the decoupling of the steep distributions caused by outliers by strategically learning a transformation that promotes “flatness” in the weight and activation distributions.

Kronecker Decomposition

To address the computational and memory overhead of storing a full transformation matrix for each layer, FlatQuant utilizes Kronecker decomposition. The transformation matrix PP is decomposed as:

P=P1P2,P = P_1 \otimes P_2,

with P1P_1 and P2P_2 being smaller invertible matrices. This decomposition not only reduces the number of learnable parameters but also lessens the computational burden during both calibration and inference. Such a decomposition enables effective back-propagation of quantization errors while maintaining the structural balance between the dimensions of the involved matrices.

Per-Channel Scaling and Clipping Thresholds

FlatQuant further incorporates learnable per-channel scaling vectors to harmonize the variance between weights and activations. This is critical in managing the impact of outliers prior to the affine transformation. Additionally, learnable clipping thresholds (αw\alpha_w and αa\alpha_a) are applied to ensure that extreme values, even after applying the affine transformations, do not adversely affect the quantization process. These parameters, calibrated with a modest set of calibration data, help in maintaining a tight distribution that is resilient to quantization-induced accuracy degradation.

Efficient Kernel Fusion

To mitigate the typical latency overhead introduced by pre-quantization transformations, the authors fuse the affine transformation, quantization, and Kronecker product operations into a single custom kernel. Implemented using OpenAI Triton, this fused operator loads the transformation matrices into SRAM, performs the requisite matrix operations entirely in memory, and subsequently writes back the results. This design choice minimizes memory access latency, facilitating significant speed improvements during both the prefill and decoding phases.

Experimental Evaluation

Accuracy and Performance Benchmarks

The experimental results presented in the paper are quite compelling with respect to both quantization error and inference speed:

  • Quantization Accuracy: When applying W4A4 quantization on the LLaMA-3-70B model, FlatQuant achieves an accuracy drop of less than 1%, which is particularly noteworthy given the high sensitivity of LLMs to quantization errors. This performance exceeds that of comparable methods such as SpinQuant by a margin of 7.5%.
  • Zero-Shot QA: The method also shows strong performance on zero-shot tasks across various QA benchmarks (ARC-Challenge, LAMBADA, etc.), reducing the gap between quantized models and FP16 baselines.

Inference Latency Improvements

  • Prefill and Decoding Speed: By fusing operations into a unified kernel, FlatQuant drastically reduces the latency overhead often incurred by pre-quantization transformations. Specifically, it reduces the additional runtime from 0.26x (as noted for QuaRot) to just 0.07x, resulting in up to a 2.3× speedup in prefill and a 1.7× speedup in decoding.
  • Memory Efficiency: The use of Kronecker decomposition plays a significant role in lowering both computational and memory requirements, making the method viable for deployment in resource-constrained environments.

Discussion and Implications

The FlatQuant approach underlines the importance of “flatness” in quantization strategies. By directly targeting and reducing the steepness of weight and activation distributions, the method facilitates more effective quantization even in low-bit regimes (e.g., W4A4). The framework’s reliance on learnable affine transformations, efficient matrix decompositions, and fused kernel operations renders it not only effective in terms of accuracy preservation but also highly efficient for practical deployment.

Strong numerical results reinforce the practicality of the approach—particularly the sub-1% accuracy drop in aggressive quantization scenarios, combined with notable speedups in inference—making it well-suited for real-world applications where both performance and latency are critical trade-offs.

Furthermore, the methodology is versatile enough to be extended to other quantization settings (e.g., weight-only quantization and KV cache quantization) with minimal performance degradation. For practitioners, these characteristics could lead to significant improvements in deploying LLMs on limited hardware without sacrificing model responsiveness or accuracy.

Conclusion

FlatQuant presents a sophisticated and highly practical framework for LLM quantization that directly tackles the challenge of outlier-induced quantization errors by enforcing flat distributions through learned affine transformations. Its incorporation of Kronecker decomposition minimizes overhead, while the use of fused kernels ensures negligible latency impact. The method sets a new benchmark in low-bit quantization for LLMs, making it an attractive option for both academic research and real-world deployment scenarios where inference speed and model accuracy are paramount.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Yuxuan Sun (79 papers)
  2. Ruikang Liu (5 papers)
  3. Haoli Bai (24 papers)
  4. Han Bao (77 papers)
  5. Kang Zhao (59 papers)
  6. Yuening Li (19 papers)
  7. Jiaxin Hu (10 papers)
  8. Xianzhi Yu (16 papers)
  9. Lu Hou (50 papers)
  10. Chun Yuan (127 papers)
  11. Xin Jiang (242 papers)
  12. Wulong Liu (38 papers)
  13. Jun Yao (36 papers)
Citations (1)