Papers
Topics
Authors
Recent
Search
2000 character limit reached

1.58-bit Quantization Techniques in Deep Learning

Updated 10 February 2026
  • 1.58-bit quantization is a technique that maps neural network weights to {-1, 0, +1} using scaling and rounding, achieving an effective bit-width of approximately 1.585 bits.
  • It leverages quantization-aware training and post-training quantization with STE and optimizers like AdamW to maintain near full-precision accuracy across architectures.
  • Practical implementations in transformers, CNNs, and TTS demonstrate significant storage reductions and computational efficiency, making it ideal for resource-limited applications.

A 1.58-bit quantization technique refers to weight quantization schemes in which each parameter is ternarized to one of three discrete levels—commonly 1-1, $0$, or +1+1—achieving an effective model bit-width of log231.585\log_2 3 \approx 1.585 bits per weight. Such ultra-low-bitwidth quantization dramatically reduces model size, memory bandwidth, and compute requirements, while preserving model accuracy near full-precision levels across a broad range of deep learning architectures, including transformers, CNNs, GNNs, and specialized models for text, vision, and speech domains.

1. Formal Definition and Quantization Functions

The canonical 1.58-bit quantization maps each floating-point weight ww to the set {1,0,+1}\{-1,0,+1\} via a scale factor and a rounding/clipping operation. The most basic formulation is:

q(w)=clip(round(w/γ),1,+1)q(w) = \mathrm{clip}(\mathrm{round}(w / \gamma), -1, +1)

where the scaling factor γ\gamma is typically the layerwise mean or median of the absolute weight values:

γ=1Ni=1Nwi\gamma = \frac{1}{N} \sum_{i=1}^N |w_i|

with NN the number of weights in the layer. Extensions may use robustified statistics (e.g., channelwise or blockwise means/medians) or learned, tensor-specific scales. In practice, weights are stored as small signed integers and a single floating-point scale factor per layer or block.

This ternary quantizer yields three possible values per weight, whose empirical Shannon entropy under optimized training is approximately log231.585\log_2 3 \approx 1.585 bits per weight (Ma et al., 2024, Nielsen et al., 2024, Nielsen et al., 2024).

Post-training quantization and quantization-aware training (QAT) both employ this mapping, but QAT additionally uses STE-based gradient flows to update the underlying shadow weights.

2. Training Schemes and Optimization Algorithms

Quantization-aware training: Most 1.58-bit methods maintain full-precision "shadow" weights for optimization. The forward pass uses quantized weights, and the backward pass applies the straight-through estimator (STE): L/wqL/w\partial L/\partial w_q \approx \partial L/\partial w if wq=ww_q=w in the quantization range, and zero otherwise (Ma et al., 2024, Nielsen et al., 2024, Nielsen et al., 2024). Optimizers are typically AdamW with careful tuning of learning rates (10410^{-4} to 10310^{-3} for small models, higher for LLMs) and 2\ell_2 or mild 1\ell_1 regularization.

Post-training quantization (PTQ): PTQ methods such as AdaRound, BRECQ, and OBC adapt scale and zero-point per layer/channel to minimize reconstruction loss between quantized and original layer outputs. These can be adapted directly to a ternary scheme by using K=3K=3 quantization levels, per-channel scaling, and by potentially learning rounding parameters for each weight (Zhang et al., 17 Dec 2025).

Advanced schemes: ParetoQ introduces a Stretched Elastic Quantizer (SEQ) with a learned scale, supporting unified experimentation from binary to 4-bit settings and analytical scaling-law studies (Liu et al., 4 Feb 2025). HESTIA adopts a differentiable softmax-based quantizer, annealed according to a Hessian-guided schedule, to preserve smooth gradients early in quantized training and harden assignments later (Wang et al., 28 Jan 2026).

3. Practical Implementations and Model Architectures

The 1.58-bit paradigm is realized across a wide spectrum of architectures:

  • Transformer-based LLMs: BitNet b1.58 and its derivatives apply ternary quantization to all major weight matrices (attention QKV, MLP, output heads) with RMSNorm and STE, reaching or exceeding the accuracy of 16-bit LLaMA and Mistral models at equivalent scale (Ma et al., 2024, Nielsen et al., 2024).
  • CNNs & MLPs: Encoder-only, encoder-decoder, and MLP-based models for classification match or outperform 16/32-bit counterparts when scaling width for expressiveness (Nielsen et al., 2024).
  • Text-to-Speech (TTS) and Vision Transformers: BitTTS applies 1.58-bit QAT combined with a weight-indexing scheme that packs five ternary weights into a single byte, yielding $7.6$ MB models with minimal MOS loss (RTF and synthesis quality close to full precision), while FLUX achieves 7.7×\times model storage reductions on T2I pipelines via post-training ternarization and kernel fusion (Kawamura et al., 4 Jun 2025, Yang et al., 2024).
  • KV-cache and VideoLLMs: 1.58-bit quantization of KV caches (value) combined with per-channel assignments and semantic token protection enables up to 10×10\times compression of inference memory with negligible performance drop (Tao et al., 20 Mar 2025).

Table: Core Quantization Functions

Method Quantization Function Scale Type
BitNet b1.58 q(w)=clip(round(w/γ),1,1)q(w) = \mathrm{clip}(\mathrm{round}(w/\gamma), -1, 1) γ\gamma = layerwise mean(w|w|)
SDQ-LLM Σ\SigmaΔ\Delta ternary Q(xn)Q(x_n) at OSR\cdotd per column None (pre-processing: Hadamard)
ParetoQ SEQ: QSEQQ_{\rm SEQ} with learned α\alpha, see main text Learnable per-tensor
PTQ (AdaRound) W^=s(clamp(W/s+z,0,2)z)\hat W = s \cdot (\mathrm{clamp}(\lfloor W/s\rceil + z,\,0,2) - z) Per-channel min/max
HESTIA H(w;τ)=γq{1,0,1}qπτ(qw)\mathcal{H}(w;\tau) = \gamma\sum_{q\in\{-1,0,1\}} q\cdot\pi_\tau(q|w) Hessian-guided, per-tensor

4. Empirical Results and Comparative Performance

Extensive evaluations demonstrate that 1.58-bit quantization (QAT-trained from scratch or via fine-tuning) maintains near-parity with full-precision baselines across multiple tasks and scales:

  • LLM Perplexity and Accuracy: For LLaMA and OLMo architectures with up to $8$B parameters, BitNet b1.58 achieves validation perplexity within $0.1$ of FP16, and on some tasks, even surpasses FP16 accuracy (Ma et al., 2024, Nielsen et al., 2024, Liu et al., 4 Feb 2025).
  • Text & Vision Tasks: On CIFAR-10/100 and standard NLP benchmarks, ternary (b1.58) models achieve 98100%98-100\% of full-precision accuracy. For text-to-image (FLUX), a 5.1×5.1\times inference memory reduction is realized with only a $1$–$2$ point metric drop (Yang et al., 2024).
  • Ablation and Scaling Studies: Doubling hidden size in small LMs or vision models compensates for ternary capacity loss at minimal overhead (Nielsen et al., 2024). For encoder–decoder models, b1.58 sometimes outperforms full-precision with no capacity increase (Nielsen et al., 2024).

Best practices further include merging knowledge distillation or layer-wise normalization (extra RMSNorm), and gradual quantization schedules (λ\lambda-schedules) to stabilize convergence (Steinmetz et al., 12 May 2025).

5. Hardware Realizations and Computational Advantages

1.58-bit quantization enables highly efficient hardware implementations:

  • Representation: $3$ states per weight allow 1.585\approx 1.585 bits of entropy per parameter. Weight packing (e.g., grouping $5$ ternaries into a byte) and entropy coding can reach this theoretical minimum (Kawamura et al., 4 Jun 2025).
  • Accelerators: The BitROM CiROM architecture stores two ternary weights per transistor, achieves $20.8$ TOPS/W (65 nm), and 49674\,967 kB/mm2^2 density. The computation pipeline eliminates multiply units in favor of conditional add/sub and zero-skipping accumulators (Zhang et al., 10 Sep 2025, Ma et al., 2024).
  • Kernels: Custom GPU/ASIC kernels realize 7.7×7.7\times storage and 5.1×5.1\times RAM reductions (FLUX), while dedicated ternary matmul logic delivers $4$–8×8\times theoretical speedups over GEMMs (Yang et al., 2024, Nielsen et al., 2024).
  • Inference Efficiency: Multiplication becomes sign-tested addition/subtraction (for Wij{1,0,1}W_{ij}\in\{-1,0,1\}), supporting aggressive pipelining and bit-packed storage.

6. Extensions, Limitations, and Applications

  • Mixed-Precision and Hybrid Schemes: 1.58-bit backbones can be coupled to low-rank FP16 correction (Hybrid Gated Flow) for recovery of up to 55%55\% of the quality gap at minimal overhead (total $1.68$ bits/weight) (Pizzo, 5 Feb 2026).
  • Scaling Laws: Dedicated analysis in ParetoQ and BitNet b1.58 confirm a new scaling law in the ternary regime; information capacity Neff=N×1.5816N_{\text{eff}}=N \times \frac{1.58}{16} yields an accuracy–size Pareto frontier often superior to 2- and 4-bit baselines (Ma et al., 2024, Liu et al., 4 Feb 2025).
  • QAT Transitions: Continual pre-training with early 16-to-1.58 bit transitions outperforms training from scratch at 1.58 bits on LLM benchmarks (Nielsen et al., 17 Feb 2025).
  • Limitations: In small LMs, hidden sizes must be inflated (2×\sim 2\times) to achieve comparable PPL; for vision, fine-grained textures can degrade in ultra-low bits. Some architectural choices (RMSNorm, bias-free linears) enhance stability (Nielsen et al., 2024, Steinmetz et al., 12 May 2025, Yang et al., 2024).
  • Applications: Edge LLMs, on-device TTS, real-time text-to-image, privacy-preserving DNNs, and resource-limited inference are among the principal domains benefiting from 1.58-bit quantization (Kawamura et al., 4 Jun 2025, Yang et al., 2024, Zhang et al., 17 Dec 2025).

7. Privacy and Security Implications

Aggressive 1.58-bit PTQ reduces vulnerability to membership inference attacks by up to an order of magnitude relative to FP16, indicating possible benefits for privacy-by-design (Zhang et al., 17 Dec 2025). Adjusting the final or input layer to higher bit-widths restores accuracy with partial retention of privacy gains, allowing fine-grained control along the privacy–utility spectrum.


For a comprehensive set of empirical benchmarks, detailed algorithms, and scaling law investigations, see (Ma et al., 2024, Nielsen et al., 2024, Steinmetz et al., 12 May 2025, Kawamura et al., 4 Jun 2025, Zhang et al., 10 Sep 2025, Tao et al., 20 Mar 2025, Yang et al., 2024, Xia et al., 27 Sep 2025, Pizzo, 5 Feb 2026, Wang et al., 28 Jan 2026, Nielsen et al., 2024, Zhang et al., 17 Dec 2025, Liu et al., 4 Feb 2025), and (Nielsen et al., 17 Feb 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 1.58-bit Quantization Techniques.