Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs (2504.18415v2)

Published 25 Apr 2025 in cs.CL and cs.LG

Abstract: Efficient deployment of 1-bit LLMs is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.

This paper, "BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs" (Wang et al., 25 Apr 2025 ), introduces a novel framework to address the challenge of quantizing activations in 1-bit LLMs to low bit-widths, specifically 4 bits, for improved inference efficiency on emerging hardware.

The core problem tackled is that while 1.58-bit weights (ternary: -1, 0, 1), as used in BitNet b1.58, significantly reduce memory bandwidth, the models often still rely on 8-bit activations. This limits the full utilization of hardware designed for 4-bit computations, shifting the bottleneck from memory bandwidth to computation. Aggressively quantizing activations to 4 bits is difficult because intermediate states within LLMs (outputs of attention output projection and FFN down projection) often have non-Gaussian distributions with significant outliers, which are challenging for low-bit fixed-point representations.

BitNet v2 proposes to enable native 4-bit activations across the entire model (except potentially input/output embeddings). The key innovation is a new module, ,whichreplacesthestandardlinearlayersforattentionoutputprojections(, which replaces the standard linear layers for attention output projections (\mathbf{W}\text{o})andFFNdownprojections() and FFN down projections (\mathbf{W}\text{down}).ThelayerappliesanonlineHadamardtransformationtotheactivations<em>before</em>theyarequantized.</p><p>TheHadamardtransformation). The layer applies an online Hadamard transformation to the activations <em>before</em> they are quantized.</p> <p>The Hadamard transformation \mathbf{H_m}isa is a 2^m \times 2^morthogonalmatrixconstructibleviaarecursiveformula.Foraninputvector orthogonal matrix constructible via a recursive formula. For an input vector \mathbf{X}ofsize of size n=2^m,thetransformationis, the transformation is \text{Hadamard}(\mathbf{X}) = \mathbf{H_m} \mathbf{X}.ThepaperutilizesafastHadamardtransformimplementationwith. The paper utilizes a fast Hadamard transform implementation with \mathcal{O}(n \log n)complexity.Thepurposeofthistransformationistostrategicallyreshapethedistributionoftheintermediatestates.WhileinputstoattentionandFFNlayersareoftennaturallyGaussianlike,theintermediatestatesarecharacterizedbysharpdistributionsandnumerousoutliers.TheHadamardtransformationsmoothsthesesharpdistributions,makingthemmoreamenabletolowbitquantization,asillustratedbytheactivationdistributionplotsinthepaper.</p><p>ThequantizationschemeforBitNetv2involves:</p><ul><li><strong>Weights:</strong>1.58bitternaryquantization complexity. The purpose of this transformation is to strategically reshape the distribution of the intermediate states. While inputs to attention and FFN layers are often naturally Gaussian-like, the intermediate states are characterized by sharp distributions and numerous outliers. The Hadamard transformation smooths these sharp distributions, making them more amenable to low-bit quantization, as illustrated by the activation distribution plots in the paper.</p> <p>The quantization scheme for BitNet v2 involves:</p> <ul> <li><strong>Weights:</strong> 1.58-bit ternary quantization \{-1, 0, 1\}usingapertensorabsolutemeanscalingfactor:</p><p> using a per-tensor absolute mean scaling factor:</p> <p>\text{Q}_{w}(\mathbf{W}) = \alpha\text{RoundClip}(\frac{\mathbf{W}}{\alpha+\epsilon}, -1, 1),\,\alpha = \text{mean}(|\mathbf{W}|)</li><li><strong>Activations:</strong><ul><li><p>8bitactivations(forinitialtrainingandcomparison)usepertokenabsmaxquantization:</p><p></li> <li><strong>Activations:</strong> <ul> <li><p>8-bit activations (for initial training and comparison) use per-token absmax quantization:</p> <p>\text{Q}_{\text{INT8}}(\mathbf{X}) = \frac{\gamma}{127}\text{RoundClip}(\frac{127}{\gamma+\epsilon}\mathbf{X}, -128, 127),\,\gamma = \max(|\mathbf{X}|)</p></li><li><p>4bitactivations(forefficientinference)usepertokenabsmeanquantization:</p><p></p></li> <li><p>4-bit activations (for efficient inference) use per-token absmean quantization:</p> <p>\text{Q}_\text{INT4}(\mathbf{X}) = \frac{\beta}{\sqrt{7}}\text{RoundClip}(\frac{\sqrt{7}}{\beta+\epsilon}\mathbf{X}, -8, 7),\,\beta = \text{mean}(|\mathbf{X}|)</p></li></ul></li></ul><p>Thecomputationwithinthelayers(specifically</p></li> </ul></li> </ul> <p>The computation within the layers (specifically \mathbf{W}_\text{o}and and \mathbf{W}_\text{down})isformulatedas) is formulated as \mathbf{Y} = \text{Q}_{w}(\mathbf{W}) \cdot \text{Q}_{\text{INT8/4}}(\mathbf{X_r}),where, where \mathbf{X_r} = \text{Hadamard}(\text{LN}(\mathbf{X}))andLNisLayerNormalization.Forotherlayers(e.g., and LN is Layer Normalization. For other layers (e.g., \mathbf{W}_\text{qkv},, \mathbf{W}_\text{up,gate}),theHadamardtransformationisnotappliedastheirinputsalreadyexhibitbetterdistributions.</p><p>BitNetv2employsatwostagetrainingstrategy.Initially,themodelistrainedfromscratchwith1.58bitweightsand8bitactivationsforasignificantportionofthetrainingtokens(e.g.,95B).Then,itundergoesacontinuetrainingphasewithnative4bitactivationsforalllinearlayers(exceptembeddings)forasmallernumberoftokens(e.g.,5B),reusingoptimizerstates.TrainingutilizestheStraightThroughEstimator(STE)forgradientapproximationandmixedprecisionupdatesforfullprecisionlatentweights.ThebackwardpassfortheHadamardtransformationleveragesitsorthogonalitybyapplyingthetransformationtothegradients(), the Hadamard transformation is not applied as their inputs already exhibit better distributions.</p> <p>BitNet v2 employs a two-stage training strategy. Initially, the model is trained from scratch with 1.58-bit weights and 8-bit activations for a significant portion of the training tokens (e.g., 95B). Then, it undergoes a continue-training phase with native 4-bit activations for all linear layers (except embeddings) for a smaller number of tokens (e.g., 5B), reusing optimizer states. Training utilizes the Straight-Through Estimator (STE) for gradient approximation and mixed-precision updates for full-precision latent weights. The backward pass for the Hadamard transformation leverages its orthogonality by applying the transformation to the gradients (\cfrac{\partial \mathcal{L}}{\partial \mathbf{X}} = \text{Hadamard}(\cfrac{\partial \mathcal{L}}{\partial\,\text{Hadamard}(\mathbf{X})})$).

Experimental results demonstrate the effectiveness of BitNet v2:

  • BitNet v2 trained with 8-bit activations (BitNet v2 (a8)) achieves performance comparable to or slightly better than BitNet b1.58 (W1.58A8), indicating that the insertion of the Hadamard transformation is not detrimental.
  • The 4-bit activation variant (BitNet v2 (a4)), obtained after continue-training, shows minimal performance degradation compared to its 8-bit counterpart and performs comparably to BitNet a4.8 (W1.58A4/A8 hybrid with sparsification) on perplexity and downstream tasks. Crucially, BitNet v2 (a4) uses dense 4-bit computations, making it more efficient for batched inference on hardware supporting native 4-bit operations compared to methods involving sparsification.
  • BitNet v2 (a4) significantly outperforms post-training quantization methods like SpinQuant and QuaRot when applied to BitNet b1.58 to achieve W1.58A4.
  • Ablation studies confirm that the Hadamard transformation is necessary for stable training with low-bit activations for intermediate states and that applying it only to activations is sufficient.
  • The QKV cache states can also be quantized to 3-bit or 4-bit in BitNet v2 with marginal performance impact.

In summary, BitNet v2 provides a practical architecture and training methodology to realize the efficiency benefits of 4-bit activations in 1-bit LLMs by incorporating a Hadamard transformation layer to condition intermediate state distributions for low-bit quantization. This enables significant memory and computational savings, particularly for batched inference on modern hardware.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hongyu Wang (104 papers)
  2. Shuming Ma (83 papers)
  3. Furu Wei (291 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com