BitNet b1.58 2B4T Technical Report (2504.12285v1)

Published 16 Apr 2025 in cs.CL and cs.LG

Abstract: We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit LLM at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.

PDF Abstract

This technical report introduces BitNet b1.58 2B4T, the first open-source LLM with 2 billion parameters trained natively with 1.58-bit weights (Ma et al., 16 Apr 2025 ). The primary motivation is to address the significant computational resource demands (memory, energy, latency) of traditional full-precision LLMs, which hinder their deployment on edge devices and in resource-constrained environments. Unlike post-training quantization (PTQ) methods that can degrade performance, or previous smaller-scale native 1-bit models, BitNet b1.58 2B4T aims to achieve performance comparable to leading full-precision models of similar size while offering substantial efficiency benefits. The model weights and dedicated inference code for GPU and CPU are released to encourage further research and adoption.

Architecture:

The model architecture builds upon the standard Transformer but replaces torch.nn.Linear layers with custom BitLinear layers, central to the BitNet approach (Wang et al., 2023 , Ma et al., 27 Feb 2024 ). Key architectural features include:

Weight Quantization: Weights are quantized to ternary values {-1, 0, +1} (representing 1.58 bits) during the forward pass using an absolute mean (absmean) quantization scheme.
Activation Quantization: Activations are quantized to 8-bit integers using an absolute maximum (absmax) scheme applied per token.
Normalization: Subln normalization [22xx.xxxxx] is used for improved training stability.
Activation Function: Squared ReLU (ReLU²) is used in the Feed-Forward Network (FFN) sub-layers instead of SwiGLU, potentially enhancing sparsity and computational properties (Wang et al., 15 Jul 2024 , Wang et al., 7 Nov 2024 ).
Positional Embeddings: Rotary Position Embeddings (RoPE) [24xx.xxxxx] are employed.
Bias Removal: All bias terms are removed from linear and normalization layers, similar to LLaMA (Touvron et al., 2023 ).
Tokenizer: Uses the LLaMA 3 tokenizer (Dubey et al., 31 Jul 2024 ), a byte-level BPE with a 128,256 token vocabulary.

Training:

The model was trained on 4 trillion tokens using a three-phase process: pre-training, supervised fine-tuning (SFT), and direct preference optimization (DPO).

Pre-training:
- Data: Utilized public text/code (e.g., DCLM [24xx.xxxxx], FineWeb-EDU [24xx.xxxxx]) and synthetic math data.
- Learning Rate: Employed a two-stage cosine decay schedule: high initial LR followed by a sharp decay and lower LR cooldown phase. This exploits the observed stability of 1-bit training.
- Weight Decay: Used a cosine schedule peaking at 0.1 during stage 1, then set to zero during stage 2.
- Data Strategy: Processed bulk web data in stage 1 and higher-quality curated data during the stage 2 cooldown.
Supervised Fine-tuning (SFT):
- Data: Leveraged public instruction/conversation datasets (WildChat [24xx.xxxxx], LMSYS-Chat-1M [24xx.xxxxx], WizardLM [24xx.xxxxx], SlimOrca [23xx.xxxxx]) and synthetic data (GLAN (Li et al., 20 Feb 2024 ), MathScale [24xx.xxxxx]).
- Optimization: Used summation instead of mean reduction for the loss, finding it improved convergence. Required a relatively larger learning rate and more epochs compared to typical full-precision fine-tuning.
- Chat Template: A specific format (<|begin_of_text|>System:...<|eot_id|>User:...<|eot_id|>Assistant:...<|eot_id|>...) was used.
Direct Preference Optimization (DPO):
- Goal: Align model outputs with human preferences for helpfulness and safety without needing a separate reward model.
- Data: Used public preference datasets UltraFeedback [24xx.xxxxx] and MagPie (Xu et al., 12 Jun 2024 ).
- Details: Trained for 2 epochs with LR $2 \times 10^{-7}$ and DPO beta 0.1. Employed Liger kernels (Hsu et al., 14 Oct 2024 ) for optimization.

Evaluation:

BitNet b1.58 2B4T was evaluated on benchmarks covering language understanding, reasoning, knowledge, math/code, and conversation, comparing it against:

Leading open-weight full-precision LLMs (1B-2B parameters).
INT4 post-training quantized (PTQ) versions of Qwen2.5 1.5B (Qwen et al., 19 Dec 2024 ).
Other native and PTQ 1-bit models.

Key Findings:

Efficiency: Shows dramatically lower non-embedding memory footprint (0.4GB vs. 1.4GB-4.8GB for competitors), estimated energy consumption (0.028J vs. 0.186J-0.649J), and CPU latency (29ms vs. 41ms-124ms per token).
Performance vs. Full Precision: Achieves performance on par with or exceeding state-of-the-art full-precision models of similar size on several benchmarks (e.g., ARC-Challenge, GSM8K, WinoGrande). Its overall average performance is highly competitive.
Performance vs. PTQ: Outperforms INT4 quantized versions of Qwen2.5 1.5B (GPTQ [23xx.xxxxx], AWQ (Lin et al., 2023 )) while using significantly less memory, suggesting native 1-bit training offers a better efficiency-performance trade-off than standard INT4 PTQ.
Performance vs. Other 1-bit Models: Substantially outperforms existing smaller native 1-bit models and even larger models (7B, 8B) that were post-training quantized to 1.58 bits.

Inference Implementation:

Since standard libraries lack optimized kernels for the W1.58A8 (1.58-bit weight, 8-bit activation) format, custom implementations were developed and open-sourced:

GPU: A custom CUDA kernel was created. It packs four ternary weights into an int8 for memory storage (HBM), loads packed data into faster on-chip memory (SRAM), unpacks them, and then performs matrix multiplication with the 8-bit activations. This strategy, detailed in the Ladder framework [23xx.xxxxx], optimizes memory bandwidth.
CPU: The bitnet.cpp library (Wang et al., 17 Feb 2025 ) provides an official C++ reference implementation with optimized kernels for efficient and accurate CPU inference.

Conclusion and Future Work:

BitNet b1.58 2B4T demonstrates that native 1-bit training at scale can produce LLMs that are both highly efficient and competitive in performance with full-precision counterparts. This opens possibilities for deploying powerful AI on resource-limited devices. Future research directions include exploring scaling laws for larger 1-bit models, hardware co-design, extending context length, adding multilingual/multimodal capabilities, and deepening the theoretical understanding of 1-bit training.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Shuming Ma (83 papers)
Hongyu Wang (104 papers)
Shaohan Huang (79 papers)
Xingxing Zhang (65 papers)
Ying Hu (121 papers)
Ting Song (9 papers)
Yan Xia (169 papers)
Furu Wei (291 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/Sumanth_077/status/1914325798078857361

https://twitter.com/DataChaz/status/1919327521385886200

https://twitter.com/cortensor/status/1914484816827072903

https://twitter.com/_akhaliq/status/1912802667622965425

https://twitter.com/mrm8488/status/1912940970238267394

https://twitter.com/realHongyu_Wang/status/1912707743711912003

BitNet b1.58 2B4T Technical Report (2504.12285v1)

Related Papers

Tweets

YouTube

HackerNews

Reddit