1.58-bit Quantization Techniques in Deep Learning

Updated 10 February 2026

1.58-bit quantization is a technique that maps neural network weights to {-1, 0, +1} using scaling and rounding, achieving an effective bit-width of approximately 1.585 bits.
It leverages quantization-aware training and post-training quantization with STE and optimizers like AdamW to maintain near full-precision accuracy across architectures.
Practical implementations in transformers, CNNs, and TTS demonstrate significant storage reductions and computational efficiency, making it ideal for resource-limited applications.

A 1.58-bit quantization technique refers to weight quantization schemes in which each parameter is ternarized to one of three discrete levels—commonly $-1$ , $0$, or $+1$ —achieving an effective model bit-width of $\log_2 3 \approx 1.585$ bits per weight. Such ultra-low-bitwidth quantization dramatically reduces model size, memory bandwidth, and compute requirements, while preserving model accuracy near full-precision levels across a broad range of deep learning architectures, including transformers, CNNs, GNNs, and specialized models for text, vision, and speech domains.

1. Formal Definition and Quantization Functions

The canonical 1.58-bit quantization maps each floating-point weight $w$ to the set $\{-1,0,+1\}$ via a scale factor and a rounding/clipping operation. The most basic formulation is:

$q(w) = \mathrm{clip}(\mathrm{round}(w / \gamma), -1, +1)$

where the scaling factor $\gamma$ is typically the layerwise mean or median of the absolute weight values:

$\gamma = \frac{1}{N} \sum_{i=1}^N |w_i|$

with $N$ the number of weights in the layer. Extensions may use robustified statistics (e.g., channelwise or blockwise means/medians) or learned, tensor-specific scales. In practice, weights are stored as small signed integers and a single floating-point scale factor per layer or block.

This ternary quantizer yields three possible values per weight, whose empirical Shannon entropy under optimized training is approximately $\log_2 3 \approx 1.585$ bits per weight (Ma et al., 2024, Nielsen et al., 2024, Nielsen et al., 2024).

Post-training quantization and quantization-aware training (QAT) both employ this mapping, but QAT additionally uses STE-based gradient flows to update the underlying shadow weights.

2. Training Schemes and Optimization Algorithms

Quantization-aware training: Most 1.58-bit methods maintain full-precision "shadow" weights for optimization. The forward pass uses quantized weights, and the backward pass applies the straight-through estimator (STE): $\partial L/\partial w_q \approx \partial L/\partial w$ if $w_q=w$ in the quantization range, and zero otherwise (Ma et al., 2024, Nielsen et al., 2024, Nielsen et al., 2024). Optimizers are typically AdamW with careful tuning of learning rates ( $10^{-4}$ to $10^{-3}$ for small models, higher for LLMs) and $\ell_2$ or mild $\ell_1$ regularization.

Post-training quantization (PTQ): PTQ methods such as AdaRound, BRECQ, and OBC adapt scale and zero-point per layer/channel to minimize reconstruction loss between quantized and original layer outputs. These can be adapted directly to a ternary scheme by using $K=3$ quantization levels, per-channel scaling, and by potentially learning rounding parameters for each weight (Zhang et al., 17 Dec 2025).

Advanced schemes: ParetoQ introduces a Stretched Elastic Quantizer (SEQ) with a learned scale, supporting unified experimentation from binary to 4-bit settings and analytical scaling-law studies (Liu et al., 4 Feb 2025). HESTIA adopts a differentiable softmax-based quantizer, annealed according to a Hessian-guided schedule, to preserve smooth gradients early in quantized training and harden assignments later (Wang et al., 28 Jan 2026).

3. Practical Implementations and Model Architectures

The 1.58-bit paradigm is realized across a wide spectrum of architectures:

Transformer-based LLMs: BitNet b1.58 and its derivatives apply ternary quantization to all major weight matrices (attention QKV, MLP, output heads) with RMSNorm and STE, reaching or exceeding the accuracy of 16-bit LLaMA and Mistral models at equivalent scale (Ma et al., 2024, Nielsen et al., 2024).
CNNs & MLPs: Encoder-only, encoder-decoder, and MLP-based models for classification match or outperform 16/32-bit counterparts when scaling width for expressiveness (Nielsen et al., 2024).
Text-to-Speech (TTS) and Vision Transformers: BitTTS applies 1.58-bit QAT combined with a weight-indexing scheme that packs five ternary weights into a single byte, yielding $7.6$ MB models with minimal MOS loss (RTF and synthesis quality close to full precision), while FLUX achieves 7.7 $\times$ model storage reductions on T2I pipelines via post-training ternarization and kernel fusion (Kawamura et al., 4 Jun 2025, Yang et al., 2024).
KV-cache and VideoLLMs: 1.58-bit quantization of KV caches (value) combined with per-channel assignments and semantic token protection enables up to $10\times$ compression of inference memory with negligible performance drop (Tao et al., 20 Mar 2025).

Table: Core Quantization Functions

Method	Quantization Function	Scale Type
BitNet b1.58	$q(w) = \mathrm{clip}(\mathrm{round}(w/\gamma), -1, 1)$	$\gamma$ = layerwise mean( $\|w\|$ )
SDQ-LLM	$\Sigma$ – $\Delta$ ternary $Q(x_n)$ at OSR $\cdot$ d per column	None (pre-processing: Hadamard)
ParetoQ	SEQ: $Q_{\rm SEQ}$ with learned $\alpha$ , see main text	Learnable per-tensor
PTQ (AdaRound)	$\hat W = s \cdot (\mathrm{clamp}(\lfloor W/s\rceil + z,\,0,2) - z)$	Per-channel min/max
HESTIA	$\mathcal{H}(w;\tau) = \gamma\sum_{q\in\{-1,0,1\}} q\cdot\pi_\tau(q\|w)$	Hessian-guided, per-tensor

4. Empirical Results and Comparative Performance

Extensive evaluations demonstrate that 1.58-bit quantization (QAT-trained from scratch or via fine-tuning) maintains near-parity with full-precision baselines across multiple tasks and scales:

LLM Perplexity and Accuracy: For LLaMA and OLMo architectures with up to $8$B parameters, BitNet b1.58 achieves validation perplexity within $0.1$ of FP16, and on some tasks, even surpasses FP16 accuracy (Ma et al., 2024, Nielsen et al., 2024, Liu et al., 4 Feb 2025).
Text & Vision Tasks: On CIFAR-10/100 and standard NLP benchmarks, ternary (b1.58) models achieve $98-100\%$ of full-precision accuracy. For text-to-image (FLUX), a $5.1\times$ inference memory reduction is realized with only a $1$–$2$ point metric drop (Yang et al., 2024).
Ablation and Scaling Studies: Doubling hidden size in small LMs or vision models compensates for ternary capacity loss at minimal overhead (Nielsen et al., 2024). For encoder–decoder models, b1.58 sometimes outperforms full-precision with no capacity increase (Nielsen et al., 2024).

Best practices further include merging knowledge distillation or layer-wise normalization (extra RMSNorm), and gradual quantization schedules ( $\lambda$ -schedules) to stabilize convergence (Steinmetz et al., 12 May 2025).

5. Hardware Realizations and Computational Advantages

1.58-bit quantization enables highly efficient hardware implementations:

Representation: $3$ states per weight allow $\approx 1.585$ bits of entropy per parameter. Weight packing (e.g., grouping $5$ ternaries into a byte) and entropy coding can reach this theoretical minimum (Kawamura et al., 4 Jun 2025).
Accelerators: The BitROM CiROM architecture stores two ternary weights per transistor, achieves $20.8$ TOPS/W (65 nm), and $4\,967$ kB/mm $^2$ density. The computation pipeline eliminates multiply units in favor of conditional add/sub and zero-skipping accumulators (Zhang et al., 10 Sep 2025, Ma et al., 2024).
Kernels: Custom GPU/ASIC kernels realize $7.7\times$ storage and $5.1\times$ RAM reductions (FLUX), while dedicated ternary matmul logic delivers $4$– $8\times$ theoretical speedups over GEMMs (Yang et al., 2024, Nielsen et al., 2024).
Inference Efficiency: Multiplication becomes sign-tested addition/subtraction (for $W_{ij}\in\{-1,0,1\}$ ), supporting aggressive pipelining and bit-packed storage.

6. Extensions, Limitations, and Applications

Mixed-Precision and Hybrid Schemes: 1.58-bit backbones can be coupled to low-rank FP16 correction (Hybrid Gated Flow) for recovery of up to $55\%$ of the quality gap at minimal overhead (total $1.68$ bits/weight) (Pizzo, 5 Feb 2026).
Scaling Laws: Dedicated analysis in ParetoQ and BitNet b1.58 confirm a new scaling law in the ternary regime; information capacity $N_{\text{eff}}=N \times \frac{1.58}{16}$ yields an accuracy–size Pareto frontier often superior to 2- and 4-bit baselines (Ma et al., 2024, Liu et al., 4 Feb 2025).
QAT Transitions: Continual pre-training with early 16-to-1.58 bit transitions outperforms training from scratch at 1.58 bits on LLM benchmarks (Nielsen et al., 17 Feb 2025).
Limitations: In small LMs, hidden sizes must be inflated ( $\sim 2\times$ ) to achieve comparable PPL; for vision, fine-grained textures can degrade in ultra-low bits. Some architectural choices (RMSNorm, bias-free linears) enhance stability (Nielsen et al., 2024, Steinmetz et al., 12 May 2025, Yang et al., 2024).
Applications: Edge LLMs, on-device TTS, real-time text-to-image, privacy-preserving DNNs, and resource-limited inference are among the principal domains benefiting from 1.58-bit quantization (Kawamura et al., 4 Jun 2025, Yang et al., 2024, Zhang et al., 17 Dec 2025).

7. Privacy and Security Implications

Aggressive 1.58-bit PTQ reduces vulnerability to membership inference attacks by up to an order of magnitude relative to FP16, indicating possible benefits for privacy-by-design (Zhang et al., 17 Dec 2025). Adjusting the final or input layer to higher bit-widths restores accuracy with partial retention of privacy gains, allowing fine-grained control along the privacy–utility spectrum.

For a comprehensive set of empirical benchmarks, detailed algorithms, and scaling law investigations, see (Ma et al., 2024, Nielsen et al., 2024, Steinmetz et al., 12 May 2025, Kawamura et al., 4 Jun 2025, Zhang et al., 10 Sep 2025, Tao et al., 20 Mar 2025, Yang et al., 2024, Xia et al., 27 Sep 2025, Pizzo, 5 Feb 2026, Wang et al., 28 Jan 2026, Nielsen et al., 2024, Zhang et al., 17 Dec 2025, Liu et al., 4 Feb 2025), and (Nielsen et al., 17 Feb 2025).

Markdown Upgrade to Chat

References (14)

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2024)

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks (2024)

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization (2024)

Bits for Privacy: Evaluating Post-Training Quantization via Membership Inference (2025)

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization (2025)

HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs (2026)

BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing (2025)

1.58-bit FLUX (2024)

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models (2025)

10.

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits (2025)

11.

BitROM: Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference (2025)

12.

Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction (2026)

13.

Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models? (2025)

14.

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 1.58-bit Quantization Techniques.