Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpinQuant: LLM quantization with learned rotations (2405.16406v3)

Published 26 May 2024 in cs.LG, cs.AI, cs.CL, and cs.CV
SpinQuant: LLM quantization with learned rotations

Abstract: Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of LLMs, but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot.

SpinQuant: LLM Quantization with Learned Rotations

Introduction

The paper "SpinQuant: LLM Quantization with Learned Rotations" addresses the challenges associated with quantizing LLMs for efficient deployment. Notably, it explores the application of learned rotation matrices in post-training quantization (PTQ) to improve quantization performance while minimizing degradation in reasoning accuracy.

Approach

The conventional quantization process poses challenges due to the presence of outliers that significantly affect the effective quantization range, resulting in increased reconstruction errors. This work posits that such issues can be mitigated through the introduction of statistical rotations to both activations and weights. The authors propose SpinQuant, a method that optimizes rotation matrices using the Cayley Stochastic Gradient Descent (Cayley SGD), a technique specifically designed for optimizing orthonormal matrices.

Numerical Results

Empirically, SpinQuant demonstrates substantial improvements over existing methodologies, including LLM-QAT and SmoothQuant. Specifically, on the LLaMA-2 7B model, the application of SpinQuant reduces the accuracy gap from a full-precision model to merely 2.9 points when using 4-bit quantization for weights, activations, and cache states. This marks a substantial enhancement over previous approaches — surpassing LLM-QAT and QuaRot, with QuaRot noted for integrating random rotations. For larger models like LLaMA-3 70B, SpinQuant also effectively decreases the gap, providing a benefit of 5.0 points over the prevailing state-of-the-art techniques.

Implications and Future Directions

From a theoretical perspective, this research introduces the concept of rotational invariance to the domain of LLM quantization, expanding the optimization space to identify configurations that maintain numerical accuracy while being more conducive to quantization. Practically, this advancement holds the potential to significantly reduce computational overhead associated with inference in server-side applications and small-scale devices alike, given the substantially reduced model footprint enabled by efficient quantization.

Future research could explore understanding the interaction between rotation matrices and specific network layers, refining the optimization process to further exploit architectural characteristics of LLMs. Moreover, as the paper indicates the intriguing possibility of reducing outliers further before applying rotation, exploring source mitigation strategies presents an avenue worth investigating.

Conclusion

In conclusion, SpinQuant emerges as a notable advancement in the quantization of LLMs, combining post-training techniques with learned rotations to bridge the gap between full-precision and low-bit quantized models, thereby providing an effective means to deploy state-of-the-art models with reduced computational and memory resources without sacrificing performance. The application of Cayley optimization to refine rotation matrices underscores its efficacy in balancing performance and resource efficiency, setting a new benchmark for LLM quantization trajectories.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zechun Liu (48 papers)
  2. Changsheng Zhao (17 papers)
  3. Igor Fedorov (24 papers)
  4. Bilge Soran (7 papers)
  5. Dhruv Choudhary (16 papers)
  6. Raghuraman Krishnamoorthi (29 papers)
  7. Vikas Chandra (74 papers)
  8. Yuandong Tian (128 papers)
  9. Tijmen Blankevoort (37 papers)
Citations (28)
Youtube Logo Streamline Icon: https://streamlinehq.com