Papers
Topics
Authors
Recent
2000 character limit reached

Ternary Quantization in Neural Networks

Updated 7 January 2026
  • Ternary quantization is a process that maps real-valued parameters to a discrete set of three values {–α, 0, +α}, enabling efficient model compression and reduced energy consumption.
  • It utilizes threshold-based operators and adaptive scaling methods to minimize quantization error and maintain accuracy across neural network architectures.
  • This approach achieves significant storage reduction and faster inference on hardware, making it ideal for edge deployment and resource-constrained learning scenarios.

Ternary quantization refers to the process of mapping real-valued parameters (typically neural network weights or activations) to a discrete set of three values: {α,0,+α}\{-\alpha, 0, +\alpha\}, where α>0\alpha>0 is a scaling factor. This approach seeks to achieve substantial reductions in model size, memory footprint, and inference energy while maintaining acceptable accuracy in deep learning and other signal-processing systems. Ternary quantization has become a critical methodology for model compression, efficient hardware deployment, and resource-constrained learning scenarios.

1. Mathematical Formulation and Fundamental Operators

Ternary quantization operates by projecting each real-valued scalar ww to one of three levels. The canonical hard-threshold operator is defined as

Q(w)={+αw>Δ, 0wΔ, αw<Δ,Q(w) = \begin{cases} +\alpha & w > \Delta,\ 0 & |w| \le \Delta,\ -\alpha & w < -\Delta, \end{cases}

where Δ0\Delta\geq 0 is a quantization threshold, and α>0\alpha>0 is either fixed or optimized via least-squares or direct learning (Li et al., 2016, Zhu et al., 2016, Liu et al., 2023).

The scaling factor is typically computed for a given set of weights as

α=i:wi>Δwi#{i:wi>Δ},\alpha^* = \frac{\sum_{i:|w_i| > \Delta} |w_i|}{\#\{i : |w_i| > \Delta\}},

minimizing the Euclidean quantization error between the full-precision and ternary weights (Li et al., 2016, Zhang et al., 2019).

Advanced operators include:

  • Support equalization (TQuant): Thresholds are chosen so that the three quantization bins split the dynamic range into equal-length intervals (Yvinec et al., 2023).
  • Mass equalization (MQuant): Thresholds are set so each bin contains equal probability mass under a reference distribution, minimizing mean squared error (Yvinec et al., 2023).
  • Soft-assignment projections: Sigmoid/tanh/Gumbel-softmax relaxations enable differentiable training, improving gradient behavior at quantization boundaries (Liu et al., 2023).

For ternarizing activations, the operator is typically similar: Ait=sign(Ai)1Ai>ΔaA_i^t = \mathrm{sign}(A_i) \cdot \mathbf{1}_{|A_i| > \Delta_a} with Δa\Delta_a chosen analogously for activations (Xu et al., 2022, Li et al., 2019).

2. Representative Algorithms and Training Strategies

Notable ternary quantization algorithms include:

Ternary Weight Networks (TWN): Direct thresholding and closed-form scale computation per filter, with STE for backward propagation. Achieves 16×\times compression and within 2–3% top-1 accuracy loss versus full-precision on ImageNet (Li et al., 2016).

Trained Ternary Quantization (TTQ): Jointly learns positive and negative scaling factors and assignments with a fixed threshold; employs STE and per-layer learnable scales. Outperforms prior ternary methods and in some settings even full-precision (ResNet-32/44/56 on CIFAR-10) (Zhu et al., 2016).

Soft Threshold Ternary Networks (STTN): Abandons hard thresholding in favor of a dual-binary kernel decomposition, enabling "soft" ternarization of both weights and activations and automatic interval learning, yielding new state-of-the-art accuracy for full-ternary ResNet-18 (68.2% top-1 ImageNet) (Xu et al., 2022).

Hyperspherical Quantization (HQ/HLATQ): Hyperspherical constraints during pre-training, iterative pruning, and loss-aware regularization minimize angular discrepancy before ternary quantization, thereby mitigating gradient bias and enabling 30–50×\times compression with minimal accuracy drop (Liu et al., 2022, Liu et al., 2022).

Fine-Grained Quantization (FGQ): Groups weights into blocks that share a scaling factor and threshold, dramatically reducing the number of multiplications in inference pipelines and supporting sub-8-bit full-network quantization (Mellempudi et al., 2017).

Adaptive Binary-Ternary (Smart Quantization, SQ): Per-layer learned regularization adaptively determines whether a layer should be binary or ternary, optimizing the trade-off between memory saving and accuracy (Razani et al., 2019).

3. Optimization Methods and Backward Propagation

Optimization during ternary quantized training relies chiefly on the straight-through estimator (STE), defined as: Q(w)w1w1\frac{\partial Q(w)}{\partial w} \approx \mathbf{1}_{|w| \le 1} for hard assignments (Li et al., 2016, Zhu et al., 2016, Liu et al., 2023). STE enables gradient flow through non-differentiable quantization steps but introduces bias, which is mitigated in hyperspherical or soft-threshold schemes (Liu et al., 2022, Xu et al., 2022).

Other approaches:

  • Proximal-gradient (ProxQuant): Iterative optimization using a regularizer enforcing proximity to the ternary grid (Liu et al., 2023).
  • ADMM/Alternating minimization: Separates continuous and discrete variables, alternately projecting onto the ternary set and optimizing the loss (Liu et al., 2023).
  • Temperature-based soft quantization: Gradually sharpens relaxed quantizers during training, improving assignment fidelity (Liu et al., 2023, Liu et al., 2022).

Backpropagation updates treat scaling factors (and sometimes thresholds) as learnable network parameters, with gradients computed by aggregation over assigned sets: Lα=i:wi>ΔLw^i\frac{\partial L}{\partial \alpha} = \sum_{i: |w_i|>\Delta} \frac{\partial L}{\partial \hat w_i} (Zhu et al., 2016, Li et al., 2016).

Regularization terms, such as cosine-similarity in TNT (Zhang et al., 2019) or hyperspherical alignment loss (Liu et al., 2022), further improve quantization fidelity.

4. Hardware Implications and Inference Efficiency

Ternary quantization yields substantial advantages for hardware deployment:

  • Storage reduction: Each weight in {1,0,+1}\{-1,0,+1\} can be encoded in 2 bits, reducing model size up to 16×16\times (Li et al., 2016, Zhu et al., 2016, Mellempudi et al., 2017).
  • Computation: Multiplies in MACs are replaced by conditional sign-operations and additions; skip-zero masks induce sparsity, reducing both memory bandwidth and energy (Chen et al., 2020, Gope et al., 2019, Zhu et al., 2016).
  • Specialized kernels: Bitwise engines exploit simple encodings and popcount primitives for dot-products, e.g., FATNN's reduction from O(4N)O(4N) to O(2N)O(2N) bit-ops per inner product (Chen et al., 2020).
  • Group-wise scaling: FGQ and hybrid filter-bank designs allow block-wise computation, reducing multiply load by 75–99% and enabling high-throughput fixed-point pipelines (Mellempudi et al., 2017, Gope et al., 2019).
  • Edge deployment: Ternary LLMs (BitNetb1.58, LLaVaOLMoBitnet1B) with lookup-table and scaled int2 kernels realize 6.3×6.3\times inference speedup over FP16 and 10×10\times smaller weights for LLMs (Wang et al., 17 Feb 2025, Sundaram et al., 2024).

Energy consumption per operation can be reduced up to 46×46\times compared to full-precision inference on custom hardware (FPGA/ASIC) (Li et al., 2019).

5. Empirical Performance and Accuracy Trade-offs

Across benchmarks, ternary quantization can achieve performance close to full precision:

Statistical analysis indicates that, for certain sparse feature spaces, ternary quantization can improve feature discrimination and classification accuracy over unquantized data, providing “free” denoising and signal selection (Lu et al., 18 Apr 2025, Lu et al., 2022).

Typical compression ratios are 16×16\times, and inference speedups $2$–15×15\times, with accuracy drops contingent on the architecture, depth, and quantization methodology (Li et al., 2016, Mellempudi et al., 2017, Liu et al., 2023).

6. Extensions, Variants, and Limitations

Several extensions and refinements exist:

  • Mixed Precision / Adaptive Depth: Smart Quantization adaptively determines per-layer binary or ternary depth, balancing memory savings and accuracy (Razani et al., 2019).
  • Group-wise, per-channel, or per-filter scaling: Improves representation and accuracy in heterogeneous layers (Liu et al., 2023, Gope et al., 2019).
  • Non-retraining post-hoc quantization (TNT): Rapid, theoretically optimal ternary mapping by cosine similarity without retraining, at the cost of some accuracy for large networks (Zhang et al., 2019).
  • Hyperspherical methods: Regularization prior to quantization improves gradient matching and mitigates bias (Liu et al., 2022, Liu et al., 2022).
  • Federated learning: FTTQ and T-FedAvg exploit ternary quantization for ultra-low-cost communication, with theoretical unbiasedness and reduced weight divergence (Xu et al., 2020).
  • Fine-Grained Quantization: Enables almost full-precision accuracy at extreme speedups by block-wise scaling (Mellempudi et al., 2017).

Limitations typically relate to:

7. Recent Directions and Practical Guidelines

Contemporary research pursues:

  • Multimodal ternary LLM deployment: Optimizing training, quantization, and inference for edge devices and multimodal input scenarios (Sundaram et al., 2024, Wang et al., 17 Feb 2025).
  • Hard vs. soft thresholding trade-off: STTN and hyperspherical models show learned interval adaptation can close the quantization-accuracy gap (Xu et al., 2022, Liu et al., 2022).
  • Statistical operator design: Support and mass equalization operators establish strong QAT/PTQ/DFQ baselines, outperforming naive rounding in deep networks (Yvinec et al., 2023).

Recommended best practices include:

Ternary quantization provides a robust, flexible framework for neural compression, federated learning, multimodal LLM deployment, and efficient edge inference, offering an optimal balance between bit-width, memory savings, inference throughput, and task accuracy under a range of settings (Liu et al., 2023, Wang et al., 17 Feb 2025, Liu et al., 2022, Sundaram et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Ternary Quantization.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube