Ternary Quantization in Neural Networks

Updated 2 April 2026

Ternary quantization is a technique that maps real-valued network weights to {-1, 0, +1} using thresholding and scaling to optimize model efficiency and performance.
It employs methods like threshold-based assignment, scaling, and cosine similarity maximization to minimize quantization error while enhancing hardware compatibility.
The method supports both post-training and quantization-aware training, offering a balanced trade-off between compression, speed, and accuracy.

A ternary quantization scheme reduces neural network weights (and sometimes activations) to the discrete alphabet {−1, 0, +1}, typically with an attached positive scale or set of scales, enabling efficient storage, computation, and deployment on constrained hardware. Ternary quantization balances the competing goals of model size, computational efficiency, and accuracy loss by introducing a single zero-bit per weight, which is either not present in binary quantization or not efficiently exploited in higher-precision schemes.

1. Quantization Principles and Mathematical Formulation

Ternary quantization maps a real-valued weight vector $W \in \mathbb{R}^n$ to a ternary vector $T \in \{-1, 0, +1\}^n$ , applying an optional scaling factor $\alpha > 0$ per layer, channel, or block. The core problem is to determine, for each weight (or block), to which of the three values to quantize, and the corresponding scale(s) to minimize the quantization error in some sense (e.g., Euclidean, cosine, or task-specific loss).

Several formulations are commonly used:

Threshold-based quantization: Use a symmetric threshold $\Delta>0$ (possibly learned or analytically set) to assign ternary values (Li et al., 2016, Zhu et al., 2016, Liu et al., 2022, Xu et al., 2022)

$q(w; \Delta) = \begin{cases} +1, & w > +\Delta \ 0, & |w| \le \Delta \ -1, & w < -\Delta \end{cases}$

Scaling: Post-quantization, a scale $\alpha$ matches the magnitude of the original weights. Typical closed-form is

$\alpha = \frac{1}{|I_{\Delta}|} \sum_{i \in I_{\Delta}} |w_i|$

where $I_{\Delta} = \{i : |w_i| > \Delta\}$ (Li et al., 2016).

Optimal threshold and scaling: Joint minimization of quantization error with respect to $T$ and $\alpha$ via iterative or closed-form methods, including support and mass equalization (Yvinec et al., 2023) and truncated Gaussian statistics (He et al., 2018).
Cosine similarity maximization: As in the Target None-retraining Ternary (TNT) method, the ternary vector $T \in \{-1, 0, +1\}^n$ 0 is chosen such that

$T \in \{-1, 0, +1\}^n$ 1

leading to an efficient $T \in \{-1, 0, +1\}^n$ 2 sort-and-scan algorithm (Zhang et al., 2019).

The quantization function is often applied per channel, per filter, or per group for convolutional layers, and per row or block for linear layers, with blockwise scales used in certain architectures to enable hardware-friendly packing (Huang et al., 12 Jan 2026, Yoon, 30 Mar 2026).

2. Algorithms and Training Methodologies

The deployment of ternary quantized networks can be achieved via several pipelines, including post-training quantization (PTQ), quantization-aware training (QAT), or hybrid methods.

PTQ: Weights are quantized after full-precision training, using closed-form or data-driven selection of thresholds and scales. TNT achieves this without retraining by maximizing cosine similarity, while MQuant optimizes mass equalization for mean-square loss (Zhang et al., 2019, Yvinec et al., 2023).
QAT: The quantization operation is inserted into the forward pass during training, and the network is trained or fine-tuned with a straight-through estimator (STE) for non-differentiable operations, often augmented with regularization. Hyperspherical and loss-aware regularizers are applied to align weight directions prior to hard ternarization, substantially closing the accuracy gap to full precision (Liu et al., 2022, Liu et al., 2022).
Learnable thresholds and soft quantization: Approaches such as Soft Threshold Ternary Networks (STTN) replace hard thresholds with learnable, smooth parameterizations (e.g., sums of shifted sign or sigmoid functions), substantially improving trainability and accuracy (Xu et al., 2022).
Simultaneous quantizer and weight optimization: Some frameworks integrate the quantization operator's thresholds as trainable variables, co-optimized with weights via surrogate gradients. The truncated Gaussian approach yields analytic updates for both (He et al., 2018).

A schematic training algorithm in QAT typically consists of:

Forward: Quantize real-valued weights, optionally compute/learn $T \in \{-1, 0, +1\}^n$ 3, $T \in \{-1, 0, +1\}^n$ 4, and propagate quantized weights.
Backward: Use STE for threshold and sign functions, possibly with custom gradient rescaling (e.g., simulating sigmoid derivatives) (Liu et al., 2022).
Update: Apply SGD, Adam, or custom signSGD (for ternary-adapted layers) (Chen et al., 24 May 2025).
Optional regularization terms enforce closeness of real-valued and quantized weights (e.g., $T \in \{-1, 0, +1\}^n$ 5 penalty (Sundaram et al., 2024)), angular similarity (Liu et al., 2022), or alignment with the quantization grid (Chen et al., 24 May 2025).

3. Hardware and Compression Considerations

Ternary quantization is particularly compelling for resource-constrained inference due to dramatic reductions in model size, memory traffic, and computational cost:

Model size: Packing 2 bits per weight yields 16× compression over FP32; advanced schemes (block sparsity, interleaved encoding) achieve 1.25–1.67 bits/weight (Huang et al., 12 Jan 2026, Wang et al., 17 Feb 2025).
Sparse storage: Block-structured sparsity (e.g., 3:4 pattern per block) enables regular packing for fast lookup and SIMD alignment (Huang et al., 12 Jan 2026), and hardware such as TOM’s ROM/SRAM hybrid accelerator leverages the high zero rate for extreme density (Guan et al., 24 Feb 2026).
MAC reduction: At inference, multiplications with ternary weights reduce to accumulations, sign flips, and conditional skipping of zeros. System kernels exploit INT2 representations and element-level LUTs to accelerate mixed-precision matrix-matrix multiplications, achieving more than 2×–6× speedup over FP baselines on real edge devices (Wang et al., 17 Feb 2025).
Blockwise quantization: Advanced methods (e.g., ITQ3_S) pre-rotate blocks via FWHT to Gaussianize distributions, ternarize in the transformed domain, and invert with high fidelity in the backend, optimizing for both compression and throughput (Yoon, 30 Mar 2026).

A summary table of representative bit-widths and model sizes:

Approach	Bits/Weight	Key Hardware Aspect	Achieved Compression
Standard Ternary	2.0	Bit-packed INT2	up to 16×
Sherry	1.25	3:4 sparse, 5-bit	25%–50% better than 2b
Bitnet.cpp TL2	1.67	Element-wise 3-group	1.67b/w, fast GEMM
ITQ3_S	3.125	FWHT, 96B/256 block	3.125b/w, high-fidelity
Adapt. Binary/Try	1–2	Per-layer schedule	up to 32× (mixed)

Empirical speed and memory gains range from 3–6× on CPUs to more than 10× on bandwidth- or memory-bound accelerators (Wang et al., 17 Feb 2025, Huang et al., 12 Jan 2026, Guan et al., 24 Feb 2026).

4. Accuracy, Trade-offs, and Applications

Ternary quantization yields modest accuracy degradation compared to full-precision baselines when jointly optimizing thresholds/scales and using QAT or regularized PTQ:

Classification: TWN achieves ~0.3% drop on MNIST and CIFAR-10; ~3–4% on ImageNet (ResNet-18) for full ternarization (Li et al., 2016, Zhang et al., 2019). Advanced QAT approaches reduce the ImageNet gap to 1–2% (ResNet-18 HLA: 68.6% vs FP 69.8%) (Liu et al., 2022).
LLMs: BitNet-1.58B ternary models achieve performance close to 4-bit and mixed-precision LLMs, with 3–6× speed and memory savings (Wang et al., 17 Feb 2025, Sundaram et al., 2024, Huang et al., 12 Jan 2026).
Diffusion and Multimodal Models: TerDiT demonstrates tunable ternary quantization for diffusion transformers, with ≤2% gap in FID/IS on ImageNet (Lu et al., 2024). LLaVaOLMoBitNet1B applies ternary weights to multimodal LLMs with competitive performance and high efficiency (Sundaram et al., 2024).

Application domains span vision, NLP, diffusion models, wireless coding (Neu et al., 2019), and on-device intelligence with co-designed accelerators (Guan et al., 24 Feb 2026).

5. Recent Innovations and Variants

Several notable advances and variants have emerged:

Block Sparse/Hardware-Aligned Ternary: Sherry’s 3:4 block pattern and 5-bit group encoding matches hardware word sizes for end-to-end efficient LLM inference (Huang et al., 12 Jan 2026).
Lossless Ternary Adaptation: LoTA-QAF allows lossless merging of ternary adapters into N-bit backbone quantized weights by grid-constrained addition, enabling true low-bit QLoRA (Chen et al., 24 May 2025).
Soft and Learnable Thresholding: STTN and FATNN replace fixed thresholds with learnable, data-driven soft boundaries, outperforming hard Δ approaches by up to 4% on ImageNet (Xu et al., 2022).
Hyperspherical Regularization: Both HLA (Liu et al., 2022) and HQ (Liu et al., 2022) encourage angular alignment of high-precision and ternary weights, reducing STE bias and closing the quantization gap without complex Hessian-aware methods.
Hybrid and Adaptive Bit Allocation: Smart Quantization dynamically chooses per-layer binary vs ternary quantization in a single pass, matching or exceeding fixed-scheme baselines (Razani et al., 2019).
Rotation-domain Preprocessing: ITQ3_S applies FWHT to remove heavy tails and allow more precise ternarization under a fixed bit budget, with mathematically bounded loss and CUDA-fused dequantization (Yoon, 30 Mar 2026).

6. Practical Implementation and Use Cases

Practical deployment leverages key features:

Packing and memory alignment for dense hardware arrays (Huang et al., 12 Jan 2026, Yoon, 30 Mar 2026).
Fine-tuning on quantized weights (QAT or QLoRA) with lossless adapter merging (Chen et al., 24 May 2025, Guan et al., 24 Feb 2026).
Compatibility with standard software/hardware stacks, as in Bitnet.cpp (Wang et al., 17 Feb 2025), with fast mpGEMM and element-wise LUT acceleration.
Toolkit and code resources: Public frameworks available for many recent ternary architectures (e.g., BitNet, LLaVaOLMoBitNet1B, Sherry, TerDiT) (Sundaram et al., 2024, Huang et al., 12 Jan 2026, Lu et al., 2024).

7. Theoretical and Practical Implications

Ternary quantization is a regime at the intersection of binary and multi-bit quantization, offering:

A large configuration space: a 3×3 filter has $T \in \{-1, 0, +1\}^n$ 6 ternary patterns vs $T \in \{-1, 0, +1\}^n$ 7 binary (Li et al., 2016).
Theoretical global optimality in cosine-similarity maximization (TNT) (Zhang et al., 2019).
Strong trade-off flexibility: single-pass adaptivity, soft thresholding, block/group sparsity, and regularization all improve the Pareto front of memory–compute–accuracy.
Hardware–algorithm co-design: fine-grained block sparsity and ternary pattern matching in logic synthesis, memory layouts, and pipelined MPUs (Guan et al., 24 Feb 2026, Huang et al., 12 Jan 2026).

Ternary quantization underpins state-of-the-art deployments of edge LLMs, vision systems, on-device multimodal models, efficient polar code decoders (Neu et al., 2019), and meets stringent real-time and energy constraints without severe accuracy loss. Ongoing work addresses optimal code assignment, adaptive gradient estimation, and hardware–software co-optimization for broader adoption and improved fidelity.