Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

QuaRot: Orthogonal Rotation in LLM Quantization

Updated 12 September 2025
  • QuaRot is a quantization scheme that applies fixed Hadamard rotations to disperse activation outliers, achieving efficient low-bit precision in LLMs.
  • It preprocesses weights, activations, and the KV cache uniformly, ensuring that 4-bit quantization retains over 99% of full-precision accuracy.
  • Empirical results show minimal perplexity increase (<0.5 points) on benchmarks, highlighting QuaRot's practicality for edge and large-scale deployments.

QuaRot refers to a class of quantization schemes for LLMs that use orthogonal rotations—primarily discrete Hadamard transforms—to preprocess weights, activations, and the key-value (KV) cache before quantization. The central aim is to suppress activation outliers, enabling uniform low-bit quantization (notably 4-bit) across the entire model without sacrificing computational accuracy or requiring mixed-precision retention. QuaRot methods are deployed to maximize memory efficiency and inference speed, making end-to-end low-bit quantization feasible for even very large transformer models.

1. Principles and Methodology

The core mechanism of QuaRot is computational invariance through orthogonal rotations, typically implemented via Hadamard matrices. Given a weight matrix WW and activation vector xx, an orthogonal rotation QQ satisfies: Wx=(WQT)(Qx)Wx = (WQ^T)(Qx) Applying QQ to weights and/or activations—before quantization—rotates the data so that high-magnitude, outlier channels are dispersed, minimizing their individual quantized ranges. This process, sometimes called “incoherence processing,” is norm-preserving and output-invariant in architectures with normalization layers such as RMSNorm: RMSNorm(Qx)=RMSNorm(x)\operatorname{RMSNorm}(Qx) = \operatorname{RMSNorm}(x) Rotations can be “fused” into the network weights (when possible) or applied as a standalone transformation. In transformers, this fusing can be performed efficiently—Hadamard matrices allow for rapid multiplication—and is typically applied at the boundaries of blocks, attention heads, and cache storage to ensure that all intermediate matrix multiplications are performed using low-precision INT4 arithmetic. Head-wise rotations are efficiently supported through block-Hadamard or Kronecker-structured matrices.

2. Outlier Suppression and Computational Invariance

In LLMs, statistically rare but extremely large weights or activations—known as outliers—can dominate the distribution, forcing uniform quantizers to allocate more dynamic range and thus significantly increase quantization error. QuaRot disperses these outliers through rotation, aligning the empirical channel distribution closer to a zero-mean, Gaussian-like (kurtosis ≈ 3) profile. This suppresses any individual channel from dominating, facilitating a substantially lower quantization step size and improving accuracy.

Hadamard rotations are attractive because:

  • They are strictly orthogonal, fast to implement, and inherently low-overhead.
  • For nn channels, they achieve the optimal worst-case coherence μ=1/n\mu = 1/\sqrt{n}.
  • When composed or randomized, they confound the alignment between weight and data axes, further promoting uniform error distribution.

Crucially, because the rotations are computationally invariant, the network’s logical function is exactly preserved (modulo numerical round-off), so long as the full computation path is rotated and restored correspondingly.

3. Empirical Performance and Qualitative Impact

Extensive evaluations demonstrate that QuaRot enables 4-bit quantization (W4A4K4) of all weights, activations, and KV cache in models up to LLaMA2-70B with minimal accuracy loss:

  • On WikiText-2, perplexity increases are limited to 0.47\leq 0.47 points.
  • Zero-shot task performance is preserved at >99%>99\% of the full-precision baseline for multiple language understanding tasks (PIQA, WinoGrande, HellaSwag, LAMBADA, ARC).
  • With 6- or 8-bit quantization, lossless quantization is achieved using only simple round-to-nearest (RTN), eliminating the quantization-induced accuracy drop entirely.
  • All channels are quantized at the same low precision, obviating the need to specially preserve high-magnitude outlier channels at higher bit-widths.

These performance characteristics are particularly important for efficient deployment on edge devices and consumer hardware. QuaRot’s open-source implementation enables reproduction and further optimization in industrial or academic settings.

4. Comparison to Contemporary Methods

QuaRot’s approach contrasts with previous and subsequent quantization schemes in several respects:

Method Rotation Type Adaptivity Outlier Handling Performance (W4A4K4)
QuaRot Fixed Hadamard No Incoherence by rotation 99% baseline accuracy
SpinQuant Learned orthogonal Yes Optimized for each layer Reduces FP gap by up to 45% vs QuaRot
BASE-Q Hadamard + PCA + bias/subscale Block-wise Explicit bias correction and asymmetric scaling 50% tighter to FP than QuaRot
BiSup None (post-rotational) N/A Vertical+horizontal error suppression, low-rank compensation 2x lower perplexity in challenging setups
ButterflyQuant Learnable butterfly Layer-specific Outlier suppression with O(n log n) complexity 30% lower perplexity (2-bit) vs QuaRot

QuaRot is primarily limited by the fixed, layer-agnostic nature of the Hadamard rotation, which fails to address layer-specific outlier structures (as evidenced by the substantial performance gain of approximately 30% in perplexity when replaced by learnable butterfly rotations in ButterflyQuant (Xu et al., 11 Sep 2025)). Additionally, methods like BASE-Q further reduce rounding and clipping errors by addressing residual mean misalignment and energy loss in tails through bias correction and asymmetric scaling (He et al., 26 May 2025).

5. Implementation Details and Engineering Considerations

QuaRot can be deployed as:

  • Preprocessing: Fusing the Hadamard rotation into model weights prior to quantization.
  • Online: Applying the rotations at inference time when structure prevents full fusion (e.g., non-linear activation, residual streams).

Key steps in code include:

  • For a weight matrix WW, apply Wmod=QTdiag(α)WW_{mod} = Q^{T} \cdot \operatorname{diag}(\alpha) \cdot W.
  • In the attention mechanism, rotate the value and output matrices per head: Wv(h)Wv(h)(Hdh)W_v^{(h)} \leftarrow W_v^{(h)} \cdot (\otimes H_{d_h}), Wout(h)(Hdh)Wout(h)W_{out}^{(h)} \leftarrow (\otimes H_{d_h}) \cdot W_{out}^{(h)}.
  • All quantization is then applied uniformly: Q(x)=round((x/s))×sQ(x) = \operatorname{round}((x/s)) \times s, with ss determined by the min/max (or per-group statistics).

Open-source code is provided at https://github.com/spcl/QuaRot for use and customization.

Compute overhead is minimal due to the binary (+1,−1) nature of Hadamard operations and batch-friendly schedule. Merging rotations into weights eliminates most runtime cost except where online rotation is necessary.

6. Limitations, Extensions, and Future Directions

Limitations:

  • Fixed Hadamard transforms do not adapt to layer-wise statistical differences, which can limit outlier suppression, especially in architectures with significant heterogeneity (e.g., Mamba models (Xu et al., 23 Jan 2025)).
  • Residual channel mean misalignment and nontrivial clipping error remain unsolved within QuaRot, but are mitigated in BASE-Q and other blockwise or learnable rotation adaptations.
  • Heavily aggressive quantization (≤3 bits) remains susceptible to accuracy loss even with QuaRot, as layer-adaptive or learnable transformations may be required.

Extensions:

  • SpinQuant (Liu et al., 26 May 2024) and ButterflyQuant (Xu et al., 11 Sep 2025) introduce learned, layer-adaptive orthogonal transforms that optimize rotation matrices for quantization loss, yielding significant accuracy gains over QuaRot.
  • BASE-Q adds bias correction and asymmetric scaling to further suppress rounding and clipping noise while maintaining practical applicability through blockwise optimization (He et al., 26 May 2025).
  • MambaQuant extends the paradigm to state-space sequence models, replacing Hadamard transforms with KLT-enhanced rotations and smoothing operations (Xu et al., 23 Jan 2025).
  • Expanding model size after pre-training (Franco et al., 21 Mar 2025) (i.e., model expansion) combined with Hadamard incoherence further increases the quantization nullspace, providing an additional degree of freedom to absorb quantization error without significant retraining.

7. Impact and Significance in LLM Quantization

QuaRot has defined a robust and efficient paradigm for end-to-end full-precision to low-bit quantization in transformer LLMs, greatly increasing the practical deployability and memory efficiency of multi-billion parameter models via computationally invariant outlier suppression. It has provided the foundation for the subsequent evolution of quantization techniques, including learnable rotations, blockwise adaptations, and model expansion. Empirical evidence across multiple domains demonstrates near-lossless accuracy in most 4-bit quantization settings for contemporary LLMs, and the methodological framework underpins state-of-the-art solutions in the field. Open-source availability ensures continued progress and adoption throughout the research and engineering communities.