Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

ButterflyQuant: Adaptive 2-bit Quantization

Updated 12 September 2025
  • ButterflyQuant is a quantization method that introduces learnable, structured orthogonal butterfly transforms to suppress outliers in LLM activation statistics for 2-bit quantization.
  • It employs a sparse, hierarchical product of differentiable Givens rotations that enables rapid convergence with minimal calibration data and maintains O(n log n) computational efficiency.
  • Empirical results on LLaMA-2-7B show state-of-the-art perplexity improvements (15.4 vs 22.1) compared to fixed-rotation methods, demonstrating its practical deployment efficiency.

ButterflyQuant is a quantization method designed for ultra-low-bit deployment of LLMs, such as LLaMA-2-7B, that leverages learnable orthogonal butterfly transforms to suppress outliers prior to quantization. It contrasts with previous approaches that employ fixed orthogonal rotations, achieving both superior outlier mitigation and substantial computational efficiency. The method harnesses a structured, parameter-efficient implementation, yielding state-of-the-art perplexity for 2-bit quantization with minimal calibration data and rapid convergence.

1. Motivation and Context

LLMs require immense memory, and for deployment on consumer hardware, model weights and activations must often be quantized to extremely low precision such as 2 bits. This process is impeded by catastrophic accuracy drops due to outlier values in layerwise activations, which are not effectively addressed by naive or fixed-rotation approaches. Rotation-based quantization methods such as QuIP and QuaRot attempt to balance activation distributions via pre-quantization orthogonal transformations—traditionally fixed Hadamard rotations—but these cannot adapt to the highly variable outlier structure encountered in different transformer layers.

ButterflyQuant replaces these fixed, discrete transformations with learnable, layer-specific, structured orthogonal matrices constructed from butterfly patterns of Givens rotations, thereby directly addressing distributional heterogeneity and facilitating accurate low-bit quantization.

2. Orthogonal Butterfly Transforms: Structure and Learnability

ButterflyQuant parameterizes its orthogonal transforms using a sparse, hierarchical product of 2×2 Givens rotations organized in a butterfly factorization:

Q=i=1log2nBiQ = \prod_{i=1}^{\log_2 n} B_i

where each BiB_i block mixes n/2n/2 pairs of vector elements in parallel. Each 2×2 block is a Givens rotation:

G(θ)=[cosθsinθ sinθcosθ]G(\theta) = \begin{bmatrix} \cos \theta & -\sin \theta \ \sin \theta & \cos \theta \end{bmatrix}

The θ\theta parameters are continuous and directly trainable via gradient descent. In contrast to Hadamard matrices with entries in {+1,1}\{ +1, -1 \}, these rotations are differentiable, ensuring every transform Q can be smoothly optimized for the particular outlier structure of a given transformer's layer. Despite this flexibility, the total number of learnable parameters remains just (nlog2n)/2(n \log_2 n)/2, and the computational complexity for application is O(nlogn)O(n \log n), several orders of magnitude faster than dense orthogonalization schemes.

Orthogonality is enforced by construction, guaranteeing that QTQ=IQ^T Q = I and thus preserving both activation norms and computational invariance:

y=Wx=(WQT)(Qx)\mathbf{y} = \mathbf{W}\mathbf{x} = (\mathbf{WQ}^T)(\mathbf{Qx})

This property ensures that the transformed, quantized computations are functionally equivalent to those of the untransformed model, modulo quantization error.

3. Quantization Workflow and Uniformity Regularization

The ButterflyQuant quantization pipeline, applied per layer, consists of the following steps:

  1. Apply Learnable Rotation: For each layer, compute the rotated weights and activations:

W=WQT,x=Qx\mathbf{W'} = \mathbf{W}Q^T, \quad \mathbf{x'} = Q\mathbf{x}

  1. Quantize: Apply standard uniform low-bit quantization to W\mathbf{W'} and x\mathbf{x'}.

A central innovation is layer-adaptive learning of the butterfly parameters using a small calibration set (128 samples). The parameter update objective combines a reconstruction loss with a uniformity regularization term on quantized activations: L=Lrecon+λuniformDKL(Pbins(Qx)U)\mathcal{L} = \mathcal{L}_\text{recon} + \lambda_\text{uniform} D_{KL}(P_\text{bins}(Q x) || \mathcal{U}) where DKLD_{KL} denotes KL-divergence and U\mathcal{U} is the uniform distribution over quantization bins.

This regularization pushes the transformed post-rotation distribution to more evenly exploit available quantization levels, directly reducing quantization error and the ill effects of outlier activations.

4. Comparative Performance and Calibration Efficiency

On the LLaMA-2-7B model, 2-bit quantization with ButterflyQuant yields a perplexity score of 15.4, a substantial improvement relative to QuaRot’s 22.1 under the same quantization regime. Calibration for these results requires only 128 samples (e.g., from WikiText-2) and completes within minutes on a single GPU. Approximately 86% of the performance gain manifests within the first 200 calibration iterations, establishing practicality for production deployment.

These results validate the hypothesis that transformer layers require rotation matrices tailored to their unique activation statistics; a one-size-fits-all rotation (e.g., fixed Hadamard) is insufficient for extreme quantization.

Method Rotation Type Perplexity@2bit (LLaMA-2-7B) Calibration Effort
QuaRot Fixed Hadamard 22.1 None (preset)
ButterflyQuant Adaptive butterfly 15.4 128 samples, minutes (GPU)

5. Theoretical Guarantees and Practical Implications

The orthogonality of the butterfly transform ensures preservation of inner products and invariance under change of basis, permitting quantization without loss of model expressivity under ideal conditions. The structured form achieves optimal worst-case coherence comparable to Hadamard (for large n), but with the crucial benefit of learnability and layer specificity. The method remains O(n log n) in both parameter count and compute, maintaining both memory and speed efficiency for even the largest LLMs.

A plausible implication is that these properties generalize to a wide class of structured, orthogonal transforms beyond the butterfly pattern and may lay groundwork for universally adaptive quantization frameworks.

6. Significance Within the Quantization Landscape

ButterflyQuant advances the state of low-bit LLM quantization by simultaneously mitigating outlier-induced accuracy loss and ensuring practical hardware performance. Its adaptivity directly addresses the heterogeneity of transformer layer activation distributions, previously an unmet need in fixed-rotation quantization methods. The structured parameterization enables tractable optimization even for models with billions of parameters.

Recent empirical results on LLaMA-2-7B demonstrate both the quantitative performance benefits (substantial perplexity reduction at 2 bits) and qualitative improvements in deployability—marked by minimal calibration cost and robustness across diverse transformer architectures.

7. Summary

ButterflyQuant introduces a learnable, structured, orthogonal transformation—using a butterfly factorization of Givens rotations—to outlier suppression in low-bit LLM quantization. This design enables gradient-based learning for each layer, whereas prior work with fixed, discrete rotations cannot adapt to varied outlier patterns throughout the model stack. Its O(n log n) computational and parameter complexity allows scaling to massive models, with calibration converging rapidly. Empirical results show significantly improved perplexity over fixed-rotation competitors for 2-bit deployment. ButterflyQuant thereby provides a theoretically principled and practically efficient solution to the ultra-low-bit quantization problem for modern LLMs (Xu et al., 11 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)