Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpinQuant: Advanced PTQ for Transformer LLMs

Updated 4 March 2026
  • SpinQuant is an advanced post-training quantization method that uses learned rotation matrices to reduce quantization-induced errors in transformer LLMs.
  • It strategically inserts orthonormal rotations to disperse outliers, align data distributions closer to Gaussianity, and preserve full-precision performance.
  • Empirical results show that SpinQuant significantly narrows the accuracy gap for models like LLaMA-2 and LLaMA-3, enabling efficient low-bit inference with minimal retraining.

SpinQuant is an advanced post-training quantization (PTQ) methodology designed for LLMs that addresses the performance degradation caused by quantization outliers through the use of learned rotation matrices. By strategically inserting and optimizing orthonormal rotations within transformer architectures, SpinQuant preserves full-precision output equivalence while substantially reducing quantization-induced errors in both weights and activations. This rotation-based approach aims to align data distributions closer to Gaussianity, mitigate outlier effects, and thus enable highly efficient int8 or even 4-bit inference for large transformer models with minimal loss in downstream accuracy. SpinQuant’s core techniques revolve around the parameterization, learning, and integration of rotation matrices within the LLM quantization workflow, and have demonstrated state-of-the-art results when compared to both random rotations and prior PTQ/QAT baselines (Liu et al., 2024).

1. Quantization Challenges in LLMs

State-of-the-art transformer-based LLMs frequently undergo quantization during deployment to reduce memory, computational cost, and power use. PTQ methods, which quantize a fully-trained model without additional gradient updates to the weights, are particularly appealing for their simplicity. However, such methods are often undermined by extreme outliers—rare, large-magnitude values in the model’s weights or activations—which dominate the quantization scale and lead to severe information loss throughout the quantized range. The kurtosis (κ\kappa) of activation distributions in LLMs can exceed 200, in contrast to the Gaussian case, which has κ3\kappa\approx3 (Liu et al., 2024). This non-Gaussianity makes straightforward quantization suboptimal.

2. Rotation-Based Outlier Mitigation

SpinQuant leverages the rotation-invariance property of transformer architectures: for any orthonormal matrix RR (i.e., RR=IR^\top R = I), one can surround a linear mapping by RR and RR^\top, maintaining equivalence in full-precision computation. Specifically: y=Wx=W(RR)x=(WR)(Rx)y = W x = W (R^\top R) x = (W R^\top) (R x) A suitable choice of RR can “mix” the axes and disperse outliers across dimensions, rendering the input distributions more Gaussian-like and facilitating lower quantization error. Empirically, applying a random rotation can already narrow the quantization-performance gap, but the variance over different random choices is substantial—up to 13 percentage points in zero-shot accuracy on certain LLM benchmarks (Liu et al., 2024). Consequently, SpinQuant adopts a learning-based approach to optimize RR for quantization fidelity.

3. Parameterization and Learning of Rotation Matrices

SpinQuant introduces several learnable rotation matrices:

  • R1R_1: a full residual-stream rotation (Dtoken×DtokenD_{token}\times D_{token}) operating on the transformer’s token-wise representations.
  • R2R_2: applied per attention head, with dimension Dhead×DheadD_{head}\times D_{head}, for multi-head attention value and output pairs.
  • R3R_3, R4R_4: fast (Hadamard) rotations for the KV-cache and MLP inner activations, optionally used for additional outlier suppression.

R1R_1 and R2R_2 are constrained to lie on the Stiefel manifold, i.e., they are strictly orthonormal and thus preserve inner products. Optimization is performed by fixing pretrained weights and calibrating R1R_1, R2R_2 on a small data collection such as WikiText2, minimizing the end-task loss (e.g., cross-entropy) under quantized inference. The optimization uses an efficient variant of Stochastic Gradient Descent (Cayley-SGD) that preserves orthogonality:

R(Iη2Y)1(I+η2Y)RR \leftarrow \left(I - \frac{\eta}{2} Y\right)^{-1} \left(I + \frac{\eta}{2} Y\right) R

where YY is a function of the rotation’s gradient, designed to maintain the orthonormality constraint. Initialization of RR can be random (Hadamard or orthogonal); after Cayley-SGD, results are robust to initial choice (Liu et al., 2024).

4. SpinQuant Quantization Pipeline

The quantization workflow under SpinQuant involves learning and merging the appropriate rotations, quantizing weights/activations with high-precision methods, and optionally integrating lightweight online rotations:

  1. Rotation Learning
    • Optimize R1R_1 and R2R_2 over a dataset (typically N=800N=800 calibration examples) using up to T=100T=100 Cayley-SGD iterations.
  2. Weight Absorption
    • Merge the learned rotations into the existing weights, e.g., WresidWresidR1W_{resid} \leftarrow W_{resid} R_1^\top, WattnoutR2WattnoutW_{attn-out} \leftarrow R_2^\top W_{attn-out}.
  3. 4-Bit Quantization
    • Quantize the merged weights to 4 bits, using backend schemes such as GPTQ or RTN.
    • Optionally, insert fast Hadamard rotations (R3,R4R_3, R_4) inside the inference pipeline for MLPs and KV-cache.
  4. Inference
    • No additional architectural changes are required beyond replacing the quantized (rotated) weights.

Common hyperparameters include the size of the calibration set, the learning rate schedule for Cayley-SGD, number of optimization iterations, and clipping ratios for activations or KV-cache, although the empirical sensitivity to most of these is minor [Table 17 in (Liu et al., 2024)]. Asymmetric min-max quantization of activations and weights performs slightly better than symmetric variants [Table 16].

5. Empirical Results and Comparative Analysis

SpinQuant achieves state-of-the-art quantization results across major open-source LLMs such as LLaMA-2 7B and LLaMA-3 8B:

Model Precision FP SmoothQuant LLM-QAT QuaRot SpinQuanthad_{had}
LLaMA-2 7B W4A4KV4 66.9 39.0 44.9 58.6 64.0
LLaMA-3 8B W4A4KV4 69.6 63.3 65.5
  • For LLaMA-2 7B with 4-bit quantization, SpinQuant reduces the accuracy gap to full-precision from 15–28 points (for prior PTQ/QAT baselines) to just 2.9 points.
  • For LLaMA-3 8B, SpinQuant lowers the gap by up to 45.1% relative to the prior best (QuaRot), indicating particular suitability for difficult-to-quantize models (Liu et al., 2024).

Ablation studies indicate:

  • Learned rotations outperform random ones, narrowing the performance variance and boosting absolute task accuracy by up to 2.5 points (even more so when combined with the optional Hadamard rotations R3,R4R_3, R_4).
  • Performance is robust to both the initialization of RR and calibration set size; saturation occurs within T100T\approx 100 SGD iterations.
  • Layerwise analysis shows SpinQuant prioritizes layers with the poorest initial quantization SNR, as measured by

SNR (dB)=10log10(fFP(X)2fFP(X)fQ(X)2)\text{SNR (dB)} = 10\log_{10}\left( \frac{\|f_{FP}(X)\|^2}{\|f_{FP}(X) - f_Q(X)\|^2} \right)

SNR improves from –2.9dB (no rotation) to +6.8dB (learned rotation) (Liu et al., 2024).

6. Analysis of Mechanisms and Theoretical Justification

SpinQuant’s efficacy derives from the rotation-invariance of deep transformer architectures, enabling insertion of learnable orthogonal transformations that do not change the full-precision output but significantly regularize the input distributions for quantization. By dispersing outlier values and minimizing kurtosis, the method achieves quantized model behavior that better retains functional fidelity to the original network. The use of Cayley-SGD ensures the optimization of rotation matrices over the Stiefel manifold is both mathematically sound and efficient at the scale demanded by modern LLMs. This framework is agnostic to the backbone quantization scheme (e.g., works with GPTQ, RTN) and integrates efficiently as an extension atop standard PTQ pipelines.

7. Context, Limitations, and Future Directions

SpinQuant is distinct in its use of learnable rotations optimized specifically for quantization, outperforming prior approaches based on unlearned rotations (QuaRot) and factorization/clipping schemes (SmoothQuant, LLM-QAT) (Liu et al., 2024). The method introduces negligible inference overhead, as rotations can typically be merged into model weights or optionally implemented with fast Hadamard transforms. A practical implication is that it enables 4-bit W/A/KV quantization on architectures and tasks previously considered impractical for such low-precision deployment.

A plausible implication is that similar rotation-learning strategies may generalize to other neural quantization contexts beyond LLMs. The approach does not fundamentally alter network expressivity or require retraining, and calibration can be performed on small datasets, suggesting wide applicability. Limitations include dependence on the representativeness of the calibration samples and the computational load of rotation matrix optimization for extremely large models, although empirical results indicate rapid convergence.

SpinQuant represents a significant advancement in practical transformer model quantization, especially for server- and edge-grade LLMs where efficiency and accuracy must be tightly balanced (Liu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpinQuant.