Ternary Quantization in Neural Networks

Updated 7 January 2026

Ternary quantization is a process that maps real-valued parameters to a discrete set of three values {–α, 0, +α}, enabling efficient model compression and reduced energy consumption.
It utilizes threshold-based operators and adaptive scaling methods to minimize quantization error and maintain accuracy across neural network architectures.
This approach achieves significant storage reduction and faster inference on hardware, making it ideal for edge deployment and resource-constrained learning scenarios.

Ternary quantization refers to the process of mapping real-valued parameters (typically neural network weights or activations) to a discrete set of three values: $\{-\alpha, 0, +\alpha\}$ , where $\alpha>0$ is a scaling factor. This approach seeks to achieve substantial reductions in model size, memory footprint, and inference energy while maintaining acceptable accuracy in deep learning and other signal-processing systems. Ternary quantization has become a critical methodology for model compression, efficient hardware deployment, and resource-constrained learning scenarios.

1. Mathematical Formulation and Fundamental Operators

Ternary quantization operates by projecting each real-valued scalar $w$ to one of three levels. The canonical hard-threshold operator is defined as

$Q(w) = \begin{cases} +\alpha & w > \Delta,\ 0 & |w| \le \Delta,\ -\alpha & w < -\Delta, \end{cases}$

where $\Delta\geq 0$ is a quantization threshold, and $\alpha>0$ is either fixed or optimized via least-squares or direct learning (Li et al., 2016, Zhu et al., 2016, Liu et al., 2023).

The scaling factor is typically computed for a given set of weights as

$\alpha^* = \frac{\sum_{i:|w_i| > \Delta} |w_i|}{\#\{i : |w_i| > \Delta\}},$

minimizing the Euclidean quantization error between the full-precision and ternary weights (Li et al., 2016, Zhang et al., 2019).

Advanced operators include:

Support equalization (TQuant): Thresholds are chosen so that the three quantization bins split the dynamic range into equal-length intervals (Yvinec et al., 2023).
Mass equalization (MQuant): Thresholds are set so each bin contains equal probability mass under a reference distribution, minimizing mean squared error (Yvinec et al., 2023).
Soft-assignment projections: Sigmoid/tanh/Gumbel-softmax relaxations enable differentiable training, improving gradient behavior at quantization boundaries (Liu et al., 2023).

For ternarizing activations, the operator is typically similar: $A_i^t = \mathrm{sign}(A_i) \cdot \mathbf{1}_{|A_i| > \Delta_a}$ with $\Delta_a$ chosen analogously for activations (Xu et al., 2022, Li et al., 2019).

2. Representative Algorithms and Training Strategies

Notable ternary quantization algorithms include:

Ternary Weight Networks (TWN): Direct thresholding and closed-form scale computation per filter, with STE for backward propagation. Achieves 16 $\times$ compression and within 2–3% top-1 accuracy loss versus full-precision on ImageNet (Li et al., 2016).

Trained Ternary Quantization (TTQ): Jointly learns positive and negative scaling factors and assignments with a fixed threshold; employs STE and per-layer learnable scales. Outperforms prior ternary methods and in some settings even full-precision (ResNet-32/44/56 on CIFAR-10) (Zhu et al., 2016).

Soft Threshold Ternary Networks (STTN): Abandons hard thresholding in favor of a dual-binary kernel decomposition, enabling "soft" ternarization of both weights and activations and automatic interval learning, yielding new state-of-the-art accuracy for full-ternary ResNet-18 (68.2% top-1 ImageNet) (Xu et al., 2022).

Hyperspherical Quantization (HQ/HLATQ): Hyperspherical constraints during pre-training, iterative pruning, and loss-aware regularization minimize angular discrepancy before ternary quantization, thereby mitigating gradient bias and enabling 30–50 $\times$ compression with minimal accuracy drop (Liu et al., 2022, Liu et al., 2022).

Fine-Grained Quantization (FGQ): Groups weights into blocks that share a scaling factor and threshold, dramatically reducing the number of multiplications in inference pipelines and supporting sub-8-bit full-network quantization (Mellempudi et al., 2017).

Adaptive Binary-Ternary (Smart Quantization, SQ): Per-layer learned regularization adaptively determines whether a layer should be binary or ternary, optimizing the trade-off between memory saving and accuracy (Razani et al., 2019).

3. Optimization Methods and Backward Propagation

Optimization during ternary quantized training relies chiefly on the straight-through estimator (STE), defined as: $\frac{\partial Q(w)}{\partial w} \approx \mathbf{1}_{|w| \le 1}$ for hard assignments (Li et al., 2016, Zhu et al., 2016, Liu et al., 2023). STE enables gradient flow through non-differentiable quantization steps but introduces bias, which is mitigated in hyperspherical or soft-threshold schemes (Liu et al., 2022, Xu et al., 2022).

Other approaches:

Proximal-gradient (ProxQuant): Iterative optimization using a regularizer enforcing proximity to the ternary grid (Liu et al., 2023).
ADMM/Alternating minimization: Separates continuous and discrete variables, alternately projecting onto the ternary set and optimizing the loss (Liu et al., 2023).
Temperature-based soft quantization: Gradually sharpens relaxed quantizers during training, improving assignment fidelity (Liu et al., 2023, Liu et al., 2022).

Backpropagation updates treat scaling factors (and sometimes thresholds) as learnable network parameters, with gradients computed by aggregation over assigned sets: $\frac{\partial L}{\partial \alpha} = \sum_{i: |w_i|>\Delta} \frac{\partial L}{\partial \hat w_i}$ (Zhu et al., 2016, Li et al., 2016).

Regularization terms, such as cosine-similarity in TNT (Zhang et al., 2019) or hyperspherical alignment loss (Liu et al., 2022), further improve quantization fidelity.

4. Hardware Implications and Inference Efficiency

Ternary quantization yields substantial advantages for hardware deployment:

Storage reduction: Each weight in $\{-1,0,+1\}$ can be encoded in 2 bits, reducing model size up to $16\times$ (Li et al., 2016, Zhu et al., 2016, Mellempudi et al., 2017).
Computation: Multiplies in MACs are replaced by conditional sign-operations and additions; skip-zero masks induce sparsity, reducing both memory bandwidth and energy (Chen et al., 2020, Gope et al., 2019, Zhu et al., 2016).
Specialized kernels: Bitwise engines exploit simple encodings and popcount primitives for dot-products, e.g., FATNN's reduction from $O(4N)$ to $O(2N)$ bit-ops per inner product (Chen et al., 2020).
Group-wise scaling: FGQ and hybrid filter-bank designs allow block-wise computation, reducing multiply load by 75–99% and enabling high-throughput fixed-point pipelines (Mellempudi et al., 2017, Gope et al., 2019).
Edge deployment: Ternary LLMs (BitNetb1.58, LLaVaOLMoBitnet1B) with lookup-table and scaled int2 kernels realize $6.3\times$ inference speedup over FP16 and $10\times$ smaller weights for LLMs (Wang et al., 17 Feb 2025, Sundaram et al., 2024).

Energy consumption per operation can be reduced up to $46\times$ compared to full-precision inference on custom hardware (FPGA/ASIC) (Li et al., 2019).

5. Empirical Performance and Accuracy Trade-offs

Across benchmarks, ternary quantization can achieve performance close to full precision:

ImageNet: TWN/TTQ yield $<2\%$ top-1 drop (ResNet-18); HQ narrows the gap to $2$– $4\%$ at $30$– $50\times$ compression (Liu et al., 2022, Zhu et al., 2016, Li et al., 2016).
Edge LLMs: BitNetb1.58 and LLaVaOLMoBitnet1B provide $10$– $20\times$ memory saving with $10$– $20\%$ absolute drop in QA benchmarks; competitive in VQA tasks (Sundaram et al., 2024, Wang et al., 17 Feb 2025).
MobileNets: Per-layer hybrid filter banks halve model size and energy with $<0.6\%$ accuracy loss (Gope et al., 2019).
CIFAR-10 / MNIST: TTQ and TWN often match or beat full-precision for medium-depth networks (Zhu et al., 2016, Li et al., 2016, Zhang et al., 2019).

Statistical analysis indicates that, for certain sparse feature spaces, ternary quantization can improve feature discrimination and classification accuracy over unquantized data, providing “free” denoising and signal selection (Lu et al., 18 Apr 2025, Lu et al., 2022).

Typical compression ratios are $16\times$ , and inference speedups $2$– $15\times$ , with accuracy drops contingent on the architecture, depth, and quantization methodology (Li et al., 2016, Mellempudi et al., 2017, Liu et al., 2023).

6. Extensions, Variants, and Limitations

Several extensions and refinements exist:

Mixed Precision / Adaptive Depth: Smart Quantization adaptively determines per-layer binary or ternary depth, balancing memory savings and accuracy (Razani et al., 2019).
Group-wise, per-channel, or per-filter scaling: Improves representation and accuracy in heterogeneous layers (Liu et al., 2023, Gope et al., 2019).
Non-retraining post-hoc quantization (TNT): Rapid, theoretically optimal ternary mapping by cosine similarity without retraining, at the cost of some accuracy for large networks (Zhang et al., 2019).
Hyperspherical methods: Regularization prior to quantization improves gradient matching and mitigates bias (Liu et al., 2022, Liu et al., 2022).
Federated learning: FTTQ and T-FedAvg exploit ternary quantization for ultra-low-cost communication, with theoretical unbiasedness and reduced weight divergence (Xu et al., 2020).
Fine-Grained Quantization: Enables almost full-precision accuracy at extreme speedups by block-wise scaling (Mellempudi et al., 2017).

Limitations typically relate to:

Accuracy degradation in large or high-capacity models at extreme compression levels without fine-tuning (Zhang et al., 2019, Mellempudi et al., 2017).
Hardware execution dependency: Realized speedup/memory savings depend on software kernels and accelerator logic (e.g., SIMD, DSP, bitwise LUT) (Wang et al., 17 Feb 2025, Chen et al., 2020).
STE-induced bias, especially near quantizer boundaries, unless mitigated by regularization or sophisticated projection (Liu et al., 2022, Liu et al., 2022, Xu et al., 2022).
Ternary networks generally outperform binary for comparable compression ratios but require more memory and compute than strict binarization (Li et al., 2016, Razani et al., 2019).

7. Recent Directions and Practical Guidelines

Contemporary research pursues:

Multimodal ternary LLM deployment: Optimizing training, quantization, and inference for edge devices and multimodal input scenarios (Sundaram et al., 2024, Wang et al., 17 Feb 2025).
Hard vs. soft thresholding trade-off: STTN and hyperspherical models show learned interval adaptation can close the quantization-accuracy gap (Xu et al., 2022, Liu et al., 2022).
Statistical operator design: Support and mass equalization operators establish strong QAT/PTQ/DFQ baselines, outperforming naive rounding in deep networks (Yvinec et al., 2023).

Recommended best practices include:

Pretraining a full-precision model, retaining a FP copy during ternary SGD or QAT (Liu et al., 2023, Zhu et al., 2016).
Layer-wise or group-wise choice of scaling factors and thresholds, optimized for local statistics (Yvinec et al., 2023, Mellempudi et al., 2017, Gope et al., 2019).
Monitoring zero-assignment to avoid output collapse or excessive sparsity (Liu et al., 2023).
Prefer ternary quantization over binary for tasks sensitive to full-precision emulation or requiring sparsity (Li et al., 2016, Zhu et al., 2016).
Empirically tuning quantization parameters for each architecture and dataset; leveraging hyperspherical regularization when feasible (Liu et al., 2022, Liu et al., 2022).

Ternary quantization provides a robust, flexible framework for neural compression, federated learning, multimodal LLM deployment, and efficient edge inference, offering an optimal balance between bit-width, memory savings, inference throughput, and task accuracy under a range of settings (Liu et al., 2023, Wang et al., 17 Feb 2025, Liu et al., 2022, Sundaram et al., 2024).

Markdown Upgrade to Chat

References (18)

Ternary Weight Networks (2016)

Trained Ternary Quantization (2016)

Ternary Quantization: A Survey (2023)

Neural Networks Weights Quantization: Target None-retraining Ternary (TNT) (2019)

Designing strong baselines for ternary neural network quantization through support and mass equalization (2023)

Soft Threshold Ternary Networks (2022)

RTN: Reparameterized Ternary Network (2019)

Hyperspherical Quantization: Toward Smaller and More Accurate Models (2022)

Hyperspherical Loss-Aware Ternary Quantization (2022)

10.

Ternary Neural Networks with Fine-Grained Quantization (2017)

11.

Adaptive Binary-Ternary Quantization (2019)

12.

FATNN: Fast and Accurate Ternary Neural Networks (2020)

13.

Ternary MobileNets via Per-Layer Hybrid Filter Banks (2019)

14.

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs (2025)

15.

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal! (2024)

16.

The Binary and Ternary Quantization Can Improve Feature Discrimination (2025)

17.

Ternary and Binary Quantization for Improved Classification (2022)

18.

Ternary Compression for Communication-Efficient Federated Learning (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ternary Quantization.

Ternary Quantization in Neural Networks

1. Mathematical Formulation and Fundamental Operators

2. Representative Algorithms and Training Strategies

3. Optimization Methods and Backward Propagation

4. Hardware Implications and Inference Efficiency

5. Empirical Performance and Accuracy Trade-offs

6. Extensions, Variants, and Limitations

7. Recent Directions and Practical Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Ternary Quantization in Neural Networks

1. Mathematical Formulation and Fundamental Operators

2. Representative Algorithms and Training Strategies

3. Optimization Methods and Backward Propagation

4. Hardware Implications and Inference Efficiency

5. Empirical Performance and Accuracy Trade-offs

6. Extensions, Variants, and Limitations

7. Recent Directions and Practical Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research