BitNet b1.58: Ternary Quantization Model

Updated 14 December 2025

The paper introduces a quantization-aware training paradigm that constrains weights to {-1, 0, +1}, achieving an average of 1.58 bits per parameter.
It replaces full-precision layers with custom BitLinear modules and uses straight-through estimators for efficient gradient updates on non-differentiable quantizers.
Empirical results show that doubling hidden sizes in small models restores 16-bit performance while enabling efficient and robust deployment on resource-constrained hardware.

BitNet b1.58 is a quantization-aware training (QAT) paradigm in which all trainable weights are constrained to a ternary set $\{-1, 0, +1\}$ , yielding an average representation of 1.58 bits per parameter. This architecture and methodology have been shown to achieve state-of-the-art performance in both LLMs and, as explored in more recent work, in small-scale language and vision models. The core mechanism involves tightly integrated ternary quantization of weights and integer quantization of activations, realized via custom “BitLinear” layers and straight-through estimators (STE) for non-differentiable quantizers. The BitNet b1.58 approach, including its “median” variant for enhanced robustness, enables efficient deployment of deep learning models on resource-constrained hardware while maintaining predictive fidelity comparable to traditional 16-bit or 32-bit floating-point networks (Nielsen et al., 24 Jun 2024).

1. Mathematical Formalism of BitNet b1.58 Quantization

BitNet b1.58 replaces conventional full-precision affine layers

$y = W x + b, \quad W \in \mathbb{R}^{m \times n},\ x \in \mathbb{R}^n$

with BitLinear layers that apply quantization to both activations and weights.

Activation quantization (default $k$ -bit, e.g., $k=8$ ):

LayerNorm is first applied: $\hat x = \textrm{LayerNorm}(x)$ .
Each element is rescaled:

$Q_b = 2^{k-1}; \quad x_{\textrm{scale}} = \frac{Q_b}{\max_i |\hat x_i| + \epsilon}$

Quantization function:

$x_{\textrm{quant}, i} = Q(\hat x_i; \Delta_x, T = Q_b) = \textrm{clip}\left(\textrm{round}(\hat x_i x_{\textrm{scale}}), -Q_b, Q_b - 1\right)$

Weight quantization (ternary per layer, mean or median scaling):

Retain shadow copy $W$ in training; forward pass uses

$w_{\textrm{scale}} = \frac{1}{\textrm{Measure}(|W|) + \epsilon}$

where $\textrm{Measure}$ can be either mean ( $\frac{1}{mn}\sum_{i,j} |W_{i,j}|$ ) or median ( $\textrm{median}(\{|W_{i,j}|\})$ ).

Quantized weight:

$W_{\textrm{quant}, i,j} = Q(W_{i,j}; \Delta_w, T=1) = \textrm{clip}\left(\textrm{round}(W_{i,j} w_{\textrm{scale}}), -1, 1\right)$

During the backward pass, $x_{\textrm{quant}}$ and $W_{\textrm{quant}}$ are detached and gradient estimation proceeds via the straight-through estimator:

$\frac{\partial \mathcal{L}}{\partial W} \approx \frac{\partial \mathcal{L}}{\partial W_{\textrm{quant}}}$

This quantization regime results in a strict ternary weight assignment for all trainable parameters, and integer activations in forward computation (Nielsen et al., 24 Jun 2024).

2. Model Architectures in BitNet b1.58

BitNet b1.58 is adaptable to both language and vision tasks through systematic integration of BitLinear modules in place of full-precision fully connected layers.

Small LLMs (SLMs):

Four-layer “Mistral-like” Transformer architectures, with hidden dimensions $\{32, 64, 128, 256\}$ , yielding parameter counts of $\{6\,\text{M}, 12\,\text{M}, 24\,\text{M}, 48\,\text{M}\}$ .
Empirically, ternary quantization reduces effective capacity, requiring approximately double the hidden size to match 16-bit perplexity metrics.

Small Vision Models:

Fully connected classifiers for MNIST (100K parameters).
Small convolutional architectures for CIFAR-10 (2.1M) and CIFAR-100 (2.2M), with shared convolutional backbones and parameter count adjusted only at the final classification head.

Integration Strategy:

Each fully connected/linear layer replaced by BitLinear.
Embedding layers and normalization typically remain full-precision.

The “median” scaling variant is optionally used as a robustness hyperparameter for weight quantization (Nielsen et al., 24 Jun 2024).

3. Training Protocols and Hyperparameters

Language:

Tokenization: BPE, vocabulary size $|V| = 8,\!000$ .
Dataset: 135M tokens; 10 training epochs (1.35B token updates).
Adam optimizer, batch size 128.

Vision:

MNIST: 60K train, 10K test.
CIFAR-10/100: 50K train, 10K test.
Learning rate $\alpha \in \{10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}\}$ ; weight decay $wd \in \{0, 0.01, 0.05, 0.10\}$ .

Quantization-Aware Training Flow:

LayerNorm $\rightarrow$ quantize activations.
Quantize shadow weights to $\{-1, 0, 1\}$ .
Perform integer matrix multiplication.
Dequantize with $w_{\textrm{scale}}, x_{\textrm{scale}}$ .
Backward update via STE.

No additional regularization (e.g., dropout, label smoothing) is applied beyond weight decay (Nielsen et al., 24 Jun 2024).

4. Empirical Analysis: Performance, Robustness, and Scaling Laws

Small LLMs (SLMs)

Hidden Size	Params	Bits	lr	wd	Perplexity
32	6M	16-bit	1e-3	0.00	77.8
32	6M	1.58-mean	1e-2	0.00	130.1
32	6M	1.58-median	1e-2	0.00	116.6
64	12M	16-bit	1e-3	0.00	36.7
128	24M	16-bit	1e-3	0.05	21.4
256	48M	16-bit	1e-3	0.00	16.6
128	24M	1.58-mean	1e-3	0.05	36.3
256	48M	1.58-mean	1e-3	0.05	27.1
128	24M	1.58-median	1e-2	0.00	42.3
256	48M	1.58-median	1e-2	0.05	63.8

Doubling hidden size in 1.58-bit models restores 16-bit performance (e.g., $h=64$ FP16 vs $h=128$ b1.58).
Median scaling offers improved convergence stability at high learning rates, sometimes requiring more training steps at low rates (Nielsen et al., 24 Jun 2024).

Small Vision Models

Dataset	Params	Bits	lr	wd	Accuracy (%)
MNIST	100K	16-bit	1e-3	0.00	96.93
MNIST	100K	1.58-mean	1e-3	0.00	96.01
MNIST	100K	1.58-median	1e-3	0.00	95.80
CIFAR-10	2.1M	16-bit	1e-3	0.00	70.06
CIFAR-10	2.1M	1.58-mean	1e-4	0.05	71.47
CIFAR-10	2.1M	1.58-median	1e-4	0.00	71.21
CIFAR-100	2.2M	16-bit	1e-3	0.00	36.62
CIFAR-100	2.2M	1.58-mean	1e-4	0.01	41.57
CIFAR-100	2.2M	1.58-median	1e-4	0.01	42.27

On CIFAR benchmarks, b1.58 often surpasses full-precision baselines by 1–5 percentage points.
For MNIST, accuracy gap $<$ 1 pp.
Low learning rates ( $10^{-4}$ ) are optimal even when training from scratch, in contrast to earlier large-model studies requiring large $\alpha$ for QAT.
Ternary quantization confers increased robustness to moderate weight decay ( $\geq$ 5%), where 16-bit models may collapse, indicating an inherent regularization effect (Nielsen et al., 24 Jun 2024).

5. Deployment, Hardware, and Practical Guidelines

Memory and Compute:

BitNet b1.58 models use only $\{-1,0,1\}$ weights and integer activations, enabling multiplication-free kernels (add/subtract or zero-skip).
Model size reduction: $16$-bit to $1.58$-bit yields approx. $10\times$ compression.
Integer formats facilitate efficient inference on CPUs, GPUs, and custom ASICs.

Best Practices for Adapting BitNet b1.58:

Treat mean vs. median scaling as a tunable hyperparameter.
Retune learning rate in $10^{-4}$ – $10^{-2}$ ; do not use full-precision defaults.
Employ modest weight decay (up to $5\%$ ); quantization provides regularization.
For small LLMs, double the hidden size to recover capacity.
Gradient updates must employ a straight-through estimator applied to 16-bit shadow weights.

Applicability:

BitNet b1.58 Reloaded demonstrates state-of-the-art perplexity with sub-50M parameter models (given 2 $\times$ hidden), and strong performance on sub-3M parameter vision tasks.
This quantization-aware training regime is tractable for small models, directly enabling energy-efficient, low-memory inference suitable for edge or resource-limited deployment (Nielsen et al., 24 Jun 2024).

6. Interpretative Notes and Implications

The successful transfer of 1.58-bit QAT into the regime of small language and vision models signals that extreme low-bit quantization, with proper scaling, is not restricted to overparameterized, large-scale LLMs. The median-based scaling variant augments stability for small networks and severe learning rate schedules, while the general regime acts as an implicit regularizer—sometimes outperforming full-precision counterparts. This suggests that low-bit quantization may serve both compression and generalization objectives for a range of architectures.

A plausible implication is that future hardware-software co-designs will increasingly target ternary/integer arithmetic as a fundamental substrate, leveraging optimized inference kernels to realize the theoretical memory, energy, and latency reductions empirically validated in BitNet b1.58 studies.

7. References

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks (Nielsen et al., 24 Jun 2024)

PDF Markdown Chat (Pro)

References (1)

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to BitNet b1.58 Model.