BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
(2504.18415v2)
Published 25 Apr 2025 in cs.CL and cs.LG
Abstract: Efficient deployment of 1-bit LLMs is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.
This paper, "BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs" (Wang et al., 25 Apr 2025), introduces a novel framework to address the challenge of quantizing activations in 1-bit LLMs to low bit-widths, specifically 4 bits, for improved inference efficiency on emerging hardware.
The core problem tackled is that while 1.58-bit weights (ternary: -1, 0, 1), as used in BitNet b1.58, significantly reduce memory bandwidth, the models often still rely on 8-bit activations. This limits the full utilization of hardware designed for 4-bit computations, shifting the bottleneck from memory bandwidth to computation. Aggressively quantizing activations to 4 bits is difficult because intermediate states within LLMs (outputs of attention output projection and FFN down projection) often have non-Gaussian distributions with significant outliers, which are challenging for low-bit fixed-point representations.
BitNet v2 proposes to enable native 4-bit activations across the entire model (except potentially input/output embeddings). The key innovation is a new module, ,whichreplacesthestandardlinearlayersforattentionoutputprojections(\mathbf{W}\text{o})andFFNdownprojections(\mathbf{W}\text{down}).ThelayerappliesanonlineHadamardtransformationtotheactivations<em>before</em>theyarequantized.</p><p>TheHadamardtransformation\mathbf{H_m}isa2^m \times 2^morthogonalmatrixconstructibleviaarecursiveformula.Foraninputvector\mathbf{X}ofsizen=2^m,thetransformationis\text{Hadamard}(\mathbf{X}) = \mathbf{H_m} \mathbf{X}.ThepaperutilizesafastHadamardtransformimplementationwith\mathcal{O}(n \log n)complexity.Thepurposeofthistransformationistostrategicallyreshapethedistributionoftheintermediatestates.WhileinputstoattentionandFFNlayersareoftennaturallyGaussian−like,theintermediatestatesarecharacterizedbysharpdistributionsandnumerousoutliers.TheHadamardtransformationsmoothsthesesharpdistributions,makingthemmoreamenabletolow−bitquantization,asillustratedbytheactivationdistributionplotsinthepaper.</p><p>ThequantizationschemeforBitNetv2involves:</p><ul><li><strong>Weights:</strong>1.58−bitternaryquantization\{-1, 0, 1\}usingaper−tensorabsolutemeanscalingfactor:</p><p>\text{Q}_{w}(\mathbf{W}) = \alpha\text{RoundClip}(\frac{\mathbf{W}}{\alpha+\epsilon}, -1, 1),\,\alpha = \text{mean}(|\mathbf{W}|)</li><li><strong>Activations:</strong><ul><li><p>8−bitactivations(forinitialtrainingandcomparison)useper−tokenabsmaxquantization:</p><p>\text{Q}_{\text{INT8}}(\mathbf{X}) = \frac{\gamma}{127}\text{RoundClip}(\frac{127}{\gamma+\epsilon}\mathbf{X}, -128, 127),\,\gamma = \max(|\mathbf{X}|)</p></li><li><p>4−bitactivations(forefficientinference)useper−tokenabsmeanquantization:</p><p>\text{Q}_\text{INT4}(\mathbf{X}) = \frac{\beta}{\sqrt{7}}\text{RoundClip}(\frac{\sqrt{7}}{\beta+\epsilon}\mathbf{X}, -8, 7),\,\beta = \text{mean}(|\mathbf{X}|)</p></li></ul></li></ul><p>Thecomputationwithinthelayers(specifically\mathbf{W}_\text{o}and\mathbf{W}_\text{down})isformulatedas\mathbf{Y} = \text{Q}_{w}(\mathbf{W}) \cdot \text{Q}_{\text{INT8/4}}(\mathbf{X_r}),where\mathbf{X_r} = \text{Hadamard}(\text{LN}(\mathbf{X}))andLNisLayerNormalization.Forotherlayers(e.g.,\mathbf{W}_\text{qkv},\mathbf{W}_\text{up,gate}),theHadamardtransformationisnotappliedastheirinputsalreadyexhibitbetterdistributions.</p><p>BitNetv2employsatwo−stagetrainingstrategy.Initially,themodelistrainedfromscratchwith1.58−bitweightsand8−bitactivationsforasignificantportionofthetrainingtokens(e.g.,95B).Then,itundergoesacontinue−trainingphasewithnative4−bitactivationsforalllinearlayers(exceptembeddings)forasmallernumberoftokens(e.g.,5B),reusingoptimizerstates.TrainingutilizestheStraight−ThroughEstimator(STE)forgradientapproximationandmixed−precisionupdatesforfull−precisionlatentweights.ThebackwardpassfortheHadamardtransformationleveragesitsorthogonalitybyapplyingthetransformationtothegradients(\cfrac{\partial \mathcal{L}}{\partial \mathbf{X}} = \text{Hadamard}(\cfrac{\partial \mathcal{L}}{\partial\,\text{Hadamard}(\mathbf{X})})$).
Experimental results demonstrate the effectiveness of BitNet v2:
BitNet v2 trained with 8-bit activations (BitNet v2 (a8)) achieves performance comparable to or slightly better than BitNet b1.58 (W1.58A8), indicating that the insertion of the Hadamard transformation is not detrimental.
The 4-bit activation variant (BitNet v2 (a4)), obtained after continue-training, shows minimal performance degradation compared to its 8-bit counterpart and performs comparably to BitNet a4.8 (W1.58A4/A8 hybrid with sparsification) on perplexity and downstream tasks. Crucially, BitNet v2 (a4) uses dense 4-bit computations, making it more efficient for batched inference on hardware supporting native 4-bit operations compared to methods involving sparsification.
BitNet v2 (a4) significantly outperforms post-training quantization methods like SpinQuant and QuaRot when applied to BitNet b1.58 to achieve W1.58A4.
Ablation studies confirm that the Hadamard transformation is necessary for stable training with low-bit activations for intermediate states and that applying it only to activations is sufficient.
The QKV cache states can also be quantized to 3-bit or 4-bit in BitNet v2 with marginal performance impact.
In summary, BitNet v2 provides a practical architecture and training methodology to realize the efficiency benefits of 4-bit activations in 1-bit LLMs by incorporating a Hadamard transformation layer to condition intermediate state distributions for low-bit quantization. This enables significant memory and computational savings, particularly for batched inference on modern hardware.