QuaRot: Orthogonal Rotation in LLM Quantization
- QuaRot is a quantization scheme that applies fixed Hadamard rotations to disperse activation outliers, achieving efficient low-bit precision in LLMs.
- It preprocesses weights, activations, and the KV cache uniformly, ensuring that 4-bit quantization retains over 99% of full-precision accuracy.
- Empirical results show minimal perplexity increase (<0.5 points) on benchmarks, highlighting QuaRot's practicality for edge and large-scale deployments.
QuaRot refers to a class of quantization schemes for LLMs that use orthogonal rotations—primarily discrete Hadamard transforms—to preprocess weights, activations, and the key-value (KV) cache before quantization. The central aim is to suppress activation outliers, enabling uniform low-bit quantization (notably 4-bit) across the entire model without sacrificing computational accuracy or requiring mixed-precision retention. QuaRot methods are deployed to maximize memory efficiency and inference speed, making end-to-end low-bit quantization feasible for even very large transformer models.
1. Principles and Methodology
The core mechanism of QuaRot is computational invariance through orthogonal rotations, typically implemented via Hadamard matrices. Given a weight matrix and activation vector , an orthogonal rotation satisfies: Applying to weights and/or activations—before quantization—rotates the data so that high-magnitude, outlier channels are dispersed, minimizing their individual quantized ranges. This process, sometimes called “incoherence processing,” is norm-preserving and output-invariant in architectures with normalization layers such as RMSNorm: Rotations can be “fused” into the network weights (when possible) or applied as a standalone transformation. In transformers, this fusing can be performed efficiently—Hadamard matrices allow for rapid multiplication—and is typically applied at the boundaries of blocks, attention heads, and cache storage to ensure that all intermediate matrix multiplications are performed using low-precision INT4 arithmetic. Head-wise rotations are efficiently supported through block-Hadamard or Kronecker-structured matrices.
2. Outlier Suppression and Computational Invariance
In LLMs, statistically rare but extremely large weights or activations—known as outliers—can dominate the distribution, forcing uniform quantizers to allocate more dynamic range and thus significantly increase quantization error. QuaRot disperses these outliers through rotation, aligning the empirical channel distribution closer to a zero-mean, Gaussian-like (kurtosis ≈ 3) profile. This suppresses any individual channel from dominating, facilitating a substantially lower quantization step size and improving accuracy.
Hadamard rotations are attractive because:
- They are strictly orthogonal, fast to implement, and inherently low-overhead.
- For channels, they achieve the optimal worst-case coherence .
- When composed or randomized, they confound the alignment between weight and data axes, further promoting uniform error distribution.
Crucially, because the rotations are computationally invariant, the network’s logical function is exactly preserved (modulo numerical round-off), so long as the full computation path is rotated and restored correspondingly.
3. Empirical Performance and Qualitative Impact
Extensive evaluations demonstrate that QuaRot enables 4-bit quantization (W4A4K4) of all weights, activations, and KV cache in models up to LLaMA2-70B with minimal accuracy loss:
- On WikiText-2, perplexity increases are limited to points.
- Zero-shot task performance is preserved at of the full-precision baseline for multiple language understanding tasks (PIQA, WinoGrande, HellaSwag, LAMBADA, ARC).
- With 6- or 8-bit quantization, lossless quantization is achieved using only simple round-to-nearest (RTN), eliminating the quantization-induced accuracy drop entirely.
- All channels are quantized at the same low precision, obviating the need to specially preserve high-magnitude outlier channels at higher bit-widths.
These performance characteristics are particularly important for efficient deployment on edge devices and consumer hardware. QuaRot’s open-source implementation enables reproduction and further optimization in industrial or academic settings.
4. Comparison to Contemporary Methods
QuaRot’s approach contrasts with previous and subsequent quantization schemes in several respects:
Method | Rotation Type | Adaptivity | Outlier Handling | Performance (W4A4K4) |
---|---|---|---|---|
QuaRot | Fixed Hadamard | No | Incoherence by rotation | 99% baseline accuracy |
SpinQuant | Learned orthogonal | Yes | Optimized for each layer | Reduces FP gap by up to 45% vs QuaRot |
BASE-Q | Hadamard + PCA + bias/subscale | Block-wise | Explicit bias correction and asymmetric scaling | 50% tighter to FP than QuaRot |
BiSup | None (post-rotational) | N/A | Vertical+horizontal error suppression, low-rank compensation | 2x lower perplexity in challenging setups |
ButterflyQuant | Learnable butterfly | Layer-specific | Outlier suppression with O(n log n) complexity | 30% lower perplexity (2-bit) vs QuaRot |
QuaRot is primarily limited by the fixed, layer-agnostic nature of the Hadamard rotation, which fails to address layer-specific outlier structures (as evidenced by the substantial performance gain of approximately 30% in perplexity when replaced by learnable butterfly rotations in ButterflyQuant (Xu et al., 11 Sep 2025)). Additionally, methods like BASE-Q further reduce rounding and clipping errors by addressing residual mean misalignment and energy loss in tails through bias correction and asymmetric scaling (He et al., 26 May 2025).
5. Implementation Details and Engineering Considerations
QuaRot can be deployed as:
- Preprocessing: Fusing the Hadamard rotation into model weights prior to quantization.
- Online: Applying the rotations at inference time when structure prevents full fusion (e.g., non-linear activation, residual streams).
Key steps in code include:
- For a weight matrix , apply .
- In the attention mechanism, rotate the value and output matrices per head: , .
- All quantization is then applied uniformly: , with determined by the min/max (or per-group statistics).
Open-source code is provided at https://github.com/spcl/QuaRot for use and customization.
Compute overhead is minimal due to the binary (+1,−1) nature of Hadamard operations and batch-friendly schedule. Merging rotations into weights eliminates most runtime cost except where online rotation is necessary.
6. Limitations, Extensions, and Future Directions
Limitations:
- Fixed Hadamard transforms do not adapt to layer-wise statistical differences, which can limit outlier suppression, especially in architectures with significant heterogeneity (e.g., Mamba models (Xu et al., 23 Jan 2025)).
- Residual channel mean misalignment and nontrivial clipping error remain unsolved within QuaRot, but are mitigated in BASE-Q and other blockwise or learnable rotation adaptations.
- Heavily aggressive quantization (≤3 bits) remains susceptible to accuracy loss even with QuaRot, as layer-adaptive or learnable transformations may be required.
Extensions:
- SpinQuant (Liu et al., 26 May 2024) and ButterflyQuant (Xu et al., 11 Sep 2025) introduce learned, layer-adaptive orthogonal transforms that optimize rotation matrices for quantization loss, yielding significant accuracy gains over QuaRot.
- BASE-Q adds bias correction and asymmetric scaling to further suppress rounding and clipping noise while maintaining practical applicability through blockwise optimization (He et al., 26 May 2025).
- MambaQuant extends the paradigm to state-space sequence models, replacing Hadamard transforms with KLT-enhanced rotations and smoothing operations (Xu et al., 23 Jan 2025).
- Expanding model size after pre-training (Franco et al., 21 Mar 2025) (i.e., model expansion) combined with Hadamard incoherence further increases the quantization nullspace, providing an additional degree of freedom to absorb quantization error without significant retraining.
7. Impact and Significance in LLM Quantization
QuaRot has defined a robust and efficient paradigm for end-to-end full-precision to low-bit quantization in transformer LLMs, greatly increasing the practical deployability and memory efficiency of multi-billion parameter models via computationally invariant outlier suppression. It has provided the foundation for the subsequent evolution of quantization techniques, including learnable rotations, blockwise adaptations, and model expansion. Empirical evidence across multiple domains demonstrates near-lossless accuracy in most 4-bit quantization settings for contemporary LLMs, and the methodological framework underpins state-of-the-art solutions in the field. Open-source availability ensures continued progress and adoption throughout the research and engineering communities.