This technical report introduces BitNet b1.58 2B4T, the first open-source LLM with 2 billion parameters trained natively with 1.58-bit weights (Ma et al., 16 Apr 2025 ). The primary motivation is to address the significant computational resource demands (memory, energy, latency) of traditional full-precision LLMs, which hinder their deployment on edge devices and in resource-constrained environments. Unlike post-training quantization (PTQ) methods that can degrade performance, or previous smaller-scale native 1-bit models, BitNet b1.58 2B4T aims to achieve performance comparable to leading full-precision models of similar size while offering substantial efficiency benefits. The model weights and dedicated inference code for GPU and CPU are released to encourage further research and adoption.
Architecture:
The model architecture builds upon the standard Transformer but replaces torch.nn.Linear
layers with custom BitLinear
layers, central to the BitNet approach (Wang et al., 2023
, Ma et al., 27 Feb 2024
). Key architectural features include:
- Weight Quantization: Weights are quantized to ternary values {-1, 0, +1} (representing 1.58 bits) during the forward pass using an absolute mean (absmean) quantization scheme.
- Activation Quantization: Activations are quantized to 8-bit integers using an absolute maximum (absmax) scheme applied per token.
- Normalization: Subln normalization [22xx.xxxxx] is used for improved training stability.
- Activation Function: Squared ReLU (ReLU²) is used in the Feed-Forward Network (FFN) sub-layers instead of SwiGLU, potentially enhancing sparsity and computational properties (Wang et al., 15 Jul 2024 , Wang et al., 7 Nov 2024 ).
- Positional Embeddings: Rotary Position Embeddings (RoPE) [24xx.xxxxx] are employed.
- Bias Removal: All bias terms are removed from linear and normalization layers, similar to LLaMA (Touvron et al., 2023 ).
- Tokenizer: Uses the LLaMA 3 tokenizer (Dubey et al., 31 Jul 2024 ), a byte-level BPE with a 128,256 token vocabulary.
Training:
The model was trained on 4 trillion tokens using a three-phase process: pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO).
- Pre-training:
- Data: Utilized public text/code (e.g., DCLM [24xx.xxxxx], FineWeb-EDU [24xx.xxxxx]) and synthetic math data.
- Learning Rate: Employed a two-stage cosine decay schedule: high initial LR followed by a sharp decay and lower LR cooldown phase. This exploits the observed stability of 1-bit training.
- Weight Decay: Used a cosine schedule peaking at 0.1 during stage 1, then set to zero during stage 2.
- Data Strategy: Processed bulk web data in stage 1 and higher-quality curated data during the stage 2 cooldown.
- Supervised Fine-tuning (SFT):
- Data: Leveraged public instruction/conversation datasets (WildChat [24xx.xxxxx], LMSYS-Chat-1M [24xx.xxxxx], WizardLM [24xx.xxxxx], SlimOrca [23xx.xxxxx]) and synthetic data (GLAN (Li et al., 20 Feb 2024 ), MathScale [24xx.xxxxx]).
- Optimization: Used summation instead of mean reduction for the loss, finding it improved convergence. Required a relatively larger learning rate and more epochs compared to typical full-precision fine-tuning.
- Chat Template: A specific format (
<|begin_of_text|>System:...<|eot_id|>User:...<|eot_id|>Assistant:...<|eot_id|>...
) was used.
- Direct Preference Optimization (DPO):
- Goal: Align model outputs with human preferences for helpfulness and safety without needing a separate reward model.
- Data: Used public preference datasets UltraFeedback [24xx.xxxxx] and MagPie (Xu et al., 12 Jun 2024 ).
- Details: Trained for 2 epochs with LR and DPO beta 0.1. Employed Liger kernels (Hsu et al., 14 Oct 2024 ) for optimization.
Evaluation:
BitNet b1.58 2B4T was evaluated on benchmarks covering language understanding, reasoning, knowledge, math/code, and conversation, comparing it against:
- Leading open-weight full-precision LLMs (1B-2B parameters).
- INT4 post-training quantized (PTQ) versions of Qwen2.5 1.5B (Qwen et al., 19 Dec 2024 ).
- Other native and PTQ 1-bit models.
Key Findings:
- Efficiency: Shows dramatically lower non-embedding memory footprint (0.4GB vs. 1.4GB-4.8GB for competitors), estimated energy consumption (0.028J vs. 0.186J-0.649J), and CPU latency (29ms vs. 41ms-124ms per token).
- Performance vs. Full Precision: Achieves performance on par with or exceeding state-of-the-art full-precision models of similar size on several benchmarks (e.g., ARC-Challenge, GSM8K, WinoGrande). Its overall average performance is highly competitive.
- Performance vs. PTQ: Outperforms INT4 quantized versions of Qwen2.5 1.5B (GPTQ [23xx.xxxxx], AWQ (Lin et al., 2023 )) while using significantly less memory, suggesting native 1-bit training offers a better efficiency-performance trade-off than standard INT4 PTQ.
- Performance vs. Other 1-bit Models: Substantially outperforms existing smaller native 1-bit models and even larger models (7B, 8B) that were post-training quantized to 1.58 bits.
Inference Implementation:
Since standard libraries lack optimized kernels for the W1.58A8 (1.58-bit weight, 8-bit activation) format, custom implementations were developed and open-sourced:
- GPU: A custom CUDA kernel was created. It packs four ternary weights into an
int8
for memory storage (HBM), loads packed data into faster on-chip memory (SRAM), unpacks them, and then performs matrix multiplication with the 8-bit activations. This strategy, detailed in the Ladder framework [23xx.xxxxx], optimizes memory bandwidth. - CPU: The
bitnet.cpp
library (Wang et al., 17 Feb 2025 ) provides an official C++ reference implementation with optimized kernels for efficient and accurate CPU inference.
Conclusion and Future Work:
BitNet b1.58 2B4T demonstrates that native 1-bit training at scale can produce LLMs that are both highly efficient and competitive in performance with full-precision counterparts. This opens possibilities for deploying powerful AI on resource-limited devices. Future research directions include exploring scaling laws for larger 1-bit models, hardware co-design, extending context length, adding multilingual/multimodal capabilities, and deepening the theoretical understanding of 1-bit training.