Ternary Quantization in Neural Networks
- Ternary quantization is a process that maps real-valued parameters to a discrete set of three values {–α, 0, +α}, enabling efficient model compression and reduced energy consumption.
- It utilizes threshold-based operators and adaptive scaling methods to minimize quantization error and maintain accuracy across neural network architectures.
- This approach achieves significant storage reduction and faster inference on hardware, making it ideal for edge deployment and resource-constrained learning scenarios.
Ternary quantization refers to the process of mapping real-valued parameters (typically neural network weights or activations) to a discrete set of three values: , where is a scaling factor. This approach seeks to achieve substantial reductions in model size, memory footprint, and inference energy while maintaining acceptable accuracy in deep learning and other signal-processing systems. Ternary quantization has become a critical methodology for model compression, efficient hardware deployment, and resource-constrained learning scenarios.
1. Mathematical Formulation and Fundamental Operators
Ternary quantization operates by projecting each real-valued scalar to one of three levels. The canonical hard-threshold operator is defined as
where is a quantization threshold, and is either fixed or optimized via least-squares or direct learning (Li et al., 2016, Zhu et al., 2016, Liu et al., 2023).
The scaling factor is typically computed for a given set of weights as
minimizing the Euclidean quantization error between the full-precision and ternary weights (Li et al., 2016, Zhang et al., 2019).
Advanced operators include:
- Support equalization (TQuant): Thresholds are chosen so that the three quantization bins split the dynamic range into equal-length intervals (Yvinec et al., 2023).
- Mass equalization (MQuant): Thresholds are set so each bin contains equal probability mass under a reference distribution, minimizing mean squared error (Yvinec et al., 2023).
- Soft-assignment projections: Sigmoid/tanh/Gumbel-softmax relaxations enable differentiable training, improving gradient behavior at quantization boundaries (Liu et al., 2023).
For ternarizing activations, the operator is typically similar: with chosen analogously for activations (Xu et al., 2022, Li et al., 2019).
2. Representative Algorithms and Training Strategies
Notable ternary quantization algorithms include:
Ternary Weight Networks (TWN): Direct thresholding and closed-form scale computation per filter, with STE for backward propagation. Achieves 16 compression and within 2–3% top-1 accuracy loss versus full-precision on ImageNet (Li et al., 2016).
Trained Ternary Quantization (TTQ): Jointly learns positive and negative scaling factors and assignments with a fixed threshold; employs STE and per-layer learnable scales. Outperforms prior ternary methods and in some settings even full-precision (ResNet-32/44/56 on CIFAR-10) (Zhu et al., 2016).
Soft Threshold Ternary Networks (STTN): Abandons hard thresholding in favor of a dual-binary kernel decomposition, enabling "soft" ternarization of both weights and activations and automatic interval learning, yielding new state-of-the-art accuracy for full-ternary ResNet-18 (68.2% top-1 ImageNet) (Xu et al., 2022).
Hyperspherical Quantization (HQ/HLATQ): Hyperspherical constraints during pre-training, iterative pruning, and loss-aware regularization minimize angular discrepancy before ternary quantization, thereby mitigating gradient bias and enabling 30–50 compression with minimal accuracy drop (Liu et al., 2022, Liu et al., 2022).
Fine-Grained Quantization (FGQ): Groups weights into blocks that share a scaling factor and threshold, dramatically reducing the number of multiplications in inference pipelines and supporting sub-8-bit full-network quantization (Mellempudi et al., 2017).
Adaptive Binary-Ternary (Smart Quantization, SQ): Per-layer learned regularization adaptively determines whether a layer should be binary or ternary, optimizing the trade-off between memory saving and accuracy (Razani et al., 2019).
3. Optimization Methods and Backward Propagation
Optimization during ternary quantized training relies chiefly on the straight-through estimator (STE), defined as: for hard assignments (Li et al., 2016, Zhu et al., 2016, Liu et al., 2023). STE enables gradient flow through non-differentiable quantization steps but introduces bias, which is mitigated in hyperspherical or soft-threshold schemes (Liu et al., 2022, Xu et al., 2022).
Other approaches:
- Proximal-gradient (ProxQuant): Iterative optimization using a regularizer enforcing proximity to the ternary grid (Liu et al., 2023).
- ADMM/Alternating minimization: Separates continuous and discrete variables, alternately projecting onto the ternary set and optimizing the loss (Liu et al., 2023).
- Temperature-based soft quantization: Gradually sharpens relaxed quantizers during training, improving assignment fidelity (Liu et al., 2023, Liu et al., 2022).
Backpropagation updates treat scaling factors (and sometimes thresholds) as learnable network parameters, with gradients computed by aggregation over assigned sets: (Zhu et al., 2016, Li et al., 2016).
Regularization terms, such as cosine-similarity in TNT (Zhang et al., 2019) or hyperspherical alignment loss (Liu et al., 2022), further improve quantization fidelity.
4. Hardware Implications and Inference Efficiency
Ternary quantization yields substantial advantages for hardware deployment:
- Storage reduction: Each weight in can be encoded in 2 bits, reducing model size up to (Li et al., 2016, Zhu et al., 2016, Mellempudi et al., 2017).
- Computation: Multiplies in MACs are replaced by conditional sign-operations and additions; skip-zero masks induce sparsity, reducing both memory bandwidth and energy (Chen et al., 2020, Gope et al., 2019, Zhu et al., 2016).
- Specialized kernels: Bitwise engines exploit simple encodings and popcount primitives for dot-products, e.g., FATNN's reduction from to bit-ops per inner product (Chen et al., 2020).
- Group-wise scaling: FGQ and hybrid filter-bank designs allow block-wise computation, reducing multiply load by 75–99% and enabling high-throughput fixed-point pipelines (Mellempudi et al., 2017, Gope et al., 2019).
- Edge deployment: Ternary LLMs (BitNetb1.58, LLaVaOLMoBitnet1B) with lookup-table and scaled int2 kernels realize inference speedup over FP16 and smaller weights for LLMs (Wang et al., 17 Feb 2025, Sundaram et al., 2024).
Energy consumption per operation can be reduced up to compared to full-precision inference on custom hardware (FPGA/ASIC) (Li et al., 2019).
5. Empirical Performance and Accuracy Trade-offs
Across benchmarks, ternary quantization can achieve performance close to full precision:
- ImageNet: TWN/TTQ yield top-1 drop (ResNet-18); HQ narrows the gap to $2$– at $30$– compression (Liu et al., 2022, Zhu et al., 2016, Li et al., 2016).
- Edge LLMs: BitNetb1.58 and LLaVaOLMoBitnet1B provide $10$– memory saving with $10$– absolute drop in QA benchmarks; competitive in VQA tasks (Sundaram et al., 2024, Wang et al., 17 Feb 2025).
- MobileNets: Per-layer hybrid filter banks halve model size and energy with accuracy loss (Gope et al., 2019).
- CIFAR-10 / MNIST: TTQ and TWN often match or beat full-precision for medium-depth networks (Zhu et al., 2016, Li et al., 2016, Zhang et al., 2019).
Statistical analysis indicates that, for certain sparse feature spaces, ternary quantization can improve feature discrimination and classification accuracy over unquantized data, providing “free” denoising and signal selection (Lu et al., 18 Apr 2025, Lu et al., 2022).
Typical compression ratios are , and inference speedups $2$–, with accuracy drops contingent on the architecture, depth, and quantization methodology (Li et al., 2016, Mellempudi et al., 2017, Liu et al., 2023).
6. Extensions, Variants, and Limitations
Several extensions and refinements exist:
- Mixed Precision / Adaptive Depth: Smart Quantization adaptively determines per-layer binary or ternary depth, balancing memory savings and accuracy (Razani et al., 2019).
- Group-wise, per-channel, or per-filter scaling: Improves representation and accuracy in heterogeneous layers (Liu et al., 2023, Gope et al., 2019).
- Non-retraining post-hoc quantization (TNT): Rapid, theoretically optimal ternary mapping by cosine similarity without retraining, at the cost of some accuracy for large networks (Zhang et al., 2019).
- Hyperspherical methods: Regularization prior to quantization improves gradient matching and mitigates bias (Liu et al., 2022, Liu et al., 2022).
- Federated learning: FTTQ and T-FedAvg exploit ternary quantization for ultra-low-cost communication, with theoretical unbiasedness and reduced weight divergence (Xu et al., 2020).
- Fine-Grained Quantization: Enables almost full-precision accuracy at extreme speedups by block-wise scaling (Mellempudi et al., 2017).
Limitations typically relate to:
- Accuracy degradation in large or high-capacity models at extreme compression levels without fine-tuning (Zhang et al., 2019, Mellempudi et al., 2017).
- Hardware execution dependency: Realized speedup/memory savings depend on software kernels and accelerator logic (e.g., SIMD, DSP, bitwise LUT) (Wang et al., 17 Feb 2025, Chen et al., 2020).
- STE-induced bias, especially near quantizer boundaries, unless mitigated by regularization or sophisticated projection (Liu et al., 2022, Liu et al., 2022, Xu et al., 2022).
- Ternary networks generally outperform binary for comparable compression ratios but require more memory and compute than strict binarization (Li et al., 2016, Razani et al., 2019).
7. Recent Directions and Practical Guidelines
Contemporary research pursues:
- Multimodal ternary LLM deployment: Optimizing training, quantization, and inference for edge devices and multimodal input scenarios (Sundaram et al., 2024, Wang et al., 17 Feb 2025).
- Hard vs. soft thresholding trade-off: STTN and hyperspherical models show learned interval adaptation can close the quantization-accuracy gap (Xu et al., 2022, Liu et al., 2022).
- Statistical operator design: Support and mass equalization operators establish strong QAT/PTQ/DFQ baselines, outperforming naive rounding in deep networks (Yvinec et al., 2023).
Recommended best practices include:
- Pretraining a full-precision model, retaining a FP copy during ternary SGD or QAT (Liu et al., 2023, Zhu et al., 2016).
- Layer-wise or group-wise choice of scaling factors and thresholds, optimized for local statistics (Yvinec et al., 2023, Mellempudi et al., 2017, Gope et al., 2019).
- Monitoring zero-assignment to avoid output collapse or excessive sparsity (Liu et al., 2023).
- Prefer ternary quantization over binary for tasks sensitive to full-precision emulation or requiring sparsity (Li et al., 2016, Zhu et al., 2016).
- Empirically tuning quantization parameters for each architecture and dataset; leveraging hyperspherical regularization when feasible (Liu et al., 2022, Liu et al., 2022).
Ternary quantization provides a robust, flexible framework for neural compression, federated learning, multimodal LLM deployment, and efficient edge inference, offering an optimal balance between bit-width, memory savings, inference throughput, and task accuracy under a range of settings (Liu et al., 2023, Wang et al., 17 Feb 2025, Liu et al., 2022, Sundaram et al., 2024).