Ternary Weight Quantization

Updated 27 December 2025

Ternary weight quantization is a method that constrains network weights to the set {−1, 0, +1} using adaptive thresholds and scaling factors for efficient inference.
It integrates quantization-aware training with straight-through estimators and optimized thresholding to maintain accuracy during compression.
This scheme achieves significant hardware efficiency and energy savings by reducing multiplications and compressing models up to 49× while supporting diverse architectures.

A ternary weight quantization scheme constrains neural network weights to the discrete set $\{-1, 0, +1\}$ , typically combined with one or more scaling factors for representational fidelity. This systematic quantization dramatically reduces model size (generally to 2 bits per weight), eliminates most multiplications in inference, and yields hardware-friendly sparsity and energy savings. Modern ternary schemes optimize the quantizer and quantization threshold using minimum mean-square error criteria, distributional matching, layer-wise statistics, or direct training via straight-through estimators. Variant schemes integrate pruning, hyperspherical norm constraints, hybrid filter banks, and quantization-aware fine-tuning. Ternary quantization is empirically validated for classification, detection, segmentation, generative diffusion transformers, spiking neural networks, and transformer LLMs.

1. Mathematical Formulation of Ternary Quantization

Ternary quantization is typically expressed as mapping each scalar weight $w_i$ in a full-precision tensor $w$ to a quantized value $w_{q,i}\in\{-s, 0, +s\}$ . The mapping uses a symmetric threshold $\Delta$ and positive scaling $s$ : $w_{q,i} = \begin{cases} +s & \text{if } w_i > +\Delta, \ -s & \text{if } w_i < -\Delta, \ 0 & \text{otherwise}. \end{cases}$ Equivalently, $w_q = s \cdot \mathrm{sign}(w) \cdot 1\{|w| > \Delta\}$ , with the indicator function $1\{\cdot\}$ (Liu et al., 2021, Li et al., 2016).

The threshold $\Delta$ and scaling $s$ are derived under a mean-square-error criterion or optimized by closed-form or numerically. Common choices:

$\Delta = \alpha \cdot E[|w|]$ with hyperparameter $\alpha\in(0,1)$ .
$s = E[ |w| \cdot 1\{|w| > \Delta\} ] / E[ 1\{|w| > \Delta\} ]$ .

Some approaches admit per-group, per-channel, or per-layer adaptation, and allow for asymmetric positive/negative scale parameters (Hou et al., 2018, Zhu et al., 2016).

Alternative formulations—such as fine-grained (group-wise) quantization, truncated Gaussian optimization, cosine-similarity-based assignment, and two-branch binary decomposition—yield additional flexibility, regularization, or training stability (Mellempudi et al., 2017, He et al., 2018, Zhang et al., 2019, Xu et al., 2022).

2. Quantization-Aware Training and Optimization Methods

In contrast to static post-training quantization, quantization-aware training (QAT) integrates the quantizer into the forward pass and propagates gradients using a straight-through estimator (STE), which approximates derivatives of the piecewise-constant quantizer as $\frac{\partial w_q}{\partial w} \approx 1$ in the clipped range (Liu et al., 2021, Lu et al., 2024). The loss typically combines the primary task objective with regularization terms: $\min_{w}\; \mathcal{L}(f(x; w_q)) + \eta\|w - w_q\|_2^2 + \lambda\|w\|_2^2,$ where $\mathcal{L}$ is the training loss, $\eta$ controls pruning pressure, and $\lambda$ is weight decay (Liu et al., 2021).

Variations include:

Simultaneous optimization of quantizer thresholds with truncated Gaussian approximations, allowing back-propagation into threshold parameters (He et al., 2018).
Pruning and re-initialization cycles to drive weights toward angular (cosine) alignment with ternary codebooks, minimizing the bias induced by the STE (Liu et al., 2022).
Integration with complex models (e.g., diffusion transformers, spiking neural networks) by replacing all linear and projection layers with ternary quantized counterparts, sometimes with architectural adjustments such as RMS-norm layers for robust training (Lu et al., 2024, Deckers et al., 2024).

Sparsity induced by ternary quantization can be further controlled by scheduling the threshold parameter $\Delta$ , either statically, learned, or gradually increased during training (Liu et al., 2021).

3. Algorithmic Schemes and Implementation Pipelines

Below is a general skeleton for ternary QAT, applicable to convolutional, transformer, or feed-forward architectures:

for epoch in range(num_epochs):
    for minibatch x, y:
        μ = mean(abs(w))
        Δ = α * μ  # layerwise threshold
        s = sum(abs(w_i) * (abs(w_i) > Δ)) / sum(abs(w_i) > Δ)

        w_q = s * sign(w) * (abs(w) > Δ)
        out = model(x, w_q)
        loss = loss_fn(out, y) + η * norm(w - w_q) ** 2 + λ * norm(w) ** 2

        grad_wq = backprop(loss, w_q)
        # Straight-through estimator
        grad_w = grad_wq * (abs(w) <= 1) * s + 2 * η * (w - w_q) + 2 * λ * w
        w -= learning_rate * grad_w

    # Optionally adjust α, η, λ per layer or per epoch

This can be fused with architectural modules such as batch-norm folding, hybrid filter banks (mixing full-precision and ternary filters), or activation quantization as needed (Liu et al., 2021, Gope et al., 2019, Xu et al., 2022, Mellempudi et al., 2017).

Classic post-training approaches (e.g., TNT, FGQ) replace each weight (or group of weights) by its optimal ternary proxy according to closed-form statistics, sometimes with per-group scaling factors or cosine similarity maximization (Zhang et al., 2019, Mellempudi et al., 2017).

4. Pruning, Hyperspherical, and Hybrid Techniques

Recent advances combine ternary quantization with additional constraints, regularizers, and computational constructs:

Pruning Ternary Quantization (PTQ): Embeds L2-norm regularization and pruning penalties, reducing weight discrepancy in the gradient estimator, offering compression rates up to $49\times$ with modest accuracy drops (e.g., $<2\%$ top-1 for ResNet-18/ImageNet) (Liu et al., 2021).
Hyperspherical Quantization (HQ, HLA): Constrains weights to live on the unit sphere ( $\|w_j\|_2=1$ ), incorporates iterative column-wise pruning and angular discrepancy penalties, enabling up to $48\times$ compression with accuracy retention superior to prior work (Liu et al., 2022, Liu et al., 2022).
Hybrid Filter Banks: Layerwise assignment of full-precision and ternary filters—optimized to retain sensitive filters in float, quantizing the rest—delivers adjustable energy and model-size savings (e.g., $51\%$ reduction, $28\%$ energy savings in MobileNets) (Gope et al., 2019).
Twin Network Augmentation (TNA): For spiking neural networks, co-training a full-precision "twin" model alongside a ternary-quantized base with logit-matching losses, enhances performance of the compressed SNN, often matching or exceeding FP accuracy (Deckers et al., 2024).

5. Scaling, Thresholding, and Adaptation Mechanisms

Robust quantization depends critically on threshold selection, scaling, and adaptation strategy:

Fixed heuristics (e.g., $\Delta \sim 0.7$ –$0.8$ times mean $|w|$ ) (Liu et al., 2021, Li et al., 2016).
Adaptive learning of thresholds via truncated Gaussian matching, cosine similarity, or minimum mean/maximum error (He et al., 2018, Yvinec et al., 2023).
Group-wise scale and threshold optimization in post-training conversion (FGQ), trading compute savings for fine control over accuracy (Mellempudi et al., 2017).
Loss-aware (LAT) and trained ternary quantization (TTQ) methods use per-layer or per-sign scaling, explicitly optimizing for network-level loss during assignment (Hou et al., 2018, Zhu et al., 2016).
Ternary adaptation for fine-tuning quantized LLMs (LoTA-QAF) aligns ternary weights with the quantization grid for lossless merging and efficient inference (Chen et al., 24 May 2025).

Empirical analysis favors worst-case error minimization (TQuant) for robustness in data-free/QAT settings and mean-error minimization (MQuant) for PTQ with limited data (Yvinec et al., 2023).

6. Hardware Efficiency, Sparsity, and Energy Savings

Ternary quantization offers compelling hardware advantages:

Storage: $2$ bits/weight achieves $16\times$ – $49\times$ model compression over $32$-bit float baselines (Liu et al., 2021, Li et al., 2016).
Compute: Inference eliminates nearly all multiplies, relying largely on additions and sign operations—custom tensor accelerators, FPGAs, and ASICs exploit this for $3$– $10\times$ energy savings, $4$– $15\times$ throughput (Gope et al., 2019, Li et al., 2016, Wang et al., 17 Feb 2025).
Packing: Efficient runtime packing and unpacking (2-bit/word schemes) further shrink memory bandwidth requirements, as in ternary LLMs and diffusion transformers (Lu et al., 2024, Wang et al., 17 Feb 2025).
Sparsity: Induced by thresholding, can reach $63\%$ zero weights in typical architectures (AlexNet), or be scheduled per layer or group for optimal energy/accuracy trade-off (Zhu et al., 2016, Mellempudi et al., 2017).
Bitwise Operations: Dedicated computation patterns (bitwise XNOR, popcount, lookup-tables) replace floating-point MACs, scaling with activation bit-width and packing strategy (Li et al., 2019, Wang et al., 17 Feb 2025).

7. Empirical Performance, Benchmark Results, and Limitations

Representative task performance for ternary models is as follows:

ResNet-18/ImageNet: PTQ achieves $68.7\%$ top-1 ( $-1.4\%$ ), $2.75$ MB ( $16\times$ smaller) (Liu et al., 2021); Hyperspherical Quantization yields $67.0$– $65.5\%$ at $37\times$ – $48\times$ size reduction (Liu et al., 2022).
Mask R-CNN/COCO: PTQ compresses $170$ MB to $5$ MB ( $34\times$ ) with only $2.8\%$ drop in AP (Liu et al., 2021).
MobileNets: Hybrid filter banks retain accuracy with $51\%$ size and $28\%$ energy savings (Gope et al., 2019).
LLMs/LLMs: LoTA-QAF recovers and sometimes exceeds the accuracy of full-precision LoRA in quantized Llama-3.1/QWen-2.5. Inference speedup $1.7\times$ – $2\times$ over low-bit adapters (Chen et al., 24 May 2025). Bitnet.cpp achieves up to $6.25\times$ speedup, sub-2-bit lossless inference over baseline (Wang et al., 17 Feb 2025).
Fine-grained quantization (FGQ): Post-hoc conversion with group size $N=4$ preserves accuracy within $4\%$ of baseline on ImageNet, with $9\times$ speedup (Mellempudi et al., 2017).
Spiking Neural Networks (TNA): Ternary SNN outperforms binary and matches or exceeds full precision in several benchmarks (CIFAR-10, -100, Fashion-MNIST, CIFAR10-DVS), with energy-sparse inference (Deckers et al., 2024).

Performance gaps remain most pronounced in ultra-large models and challenging quantization of early/final layers, where mixed-precision or layerwise sensitivity scheduling is recommended. Post-training quantization may require light retraining for largest group sizes or non-Gaussian weight distributions (Mellempudi et al., 2017, Yvinec et al., 2023). Block-fitting, lookup-table bandwidth, and inference stage optimization remain active development targets in emerging accelerators (Wang et al., 17 Feb 2025).

This entry consolidates mathematical formalism, algorithmics, engineering practices, and empirical findings in ternary weight quantization. The referenced schemes can be implemented per layer or architecture, adapted to domain-specific accuracy constraints, and deployed across diverse hardware with predictable benefits in compression, latency, and power consumption.

Markdown Upgrade to Chat

References (17)

Pruning Ternary Quantization (2021)

Ternary Weight Networks (2016)

Loss-aware Weight Quantization of Deep Networks (2018)

Trained Ternary Quantization (2016)

Ternary Neural Networks with Fine-Grained Quantization (2017)

Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation (2018)

Neural Networks Weights Quantization: Target None-retraining Ternary (TNT) (2019)

Soft Threshold Ternary Networks (2022)

TerDiT: Ternary Diffusion Models with Transformers (2024)

10.

Hyperspherical Quantization: Toward Smaller and More Accurate Models (2022)

11.

Twin Network Augmentation: A Novel Training Strategy for Improved Spiking Neural Networks and Efficient Weight Quantization (2024)

12.

Ternary MobileNets via Per-Layer Hybrid Filter Banks (2019)

13.

Hyperspherical Loss-Aware Ternary Quantization (2022)

14.

Designing strong baselines for ternary neural network quantization through support and mass equalization (2023)

15.

LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning (2025)

16.

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs (2025)

17.

RTN: Reparameterized Ternary Network (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ternary Weight Quantization Scheme.

Ternary Weight Quantization

1. Mathematical Formulation of Ternary Quantization

2. Quantization-Aware Training and Optimization Methods

3. Algorithmic Schemes and Implementation Pipelines

4. Pruning, Hyperspherical, and Hybrid Techniques

5. Scaling, Thresholding, and Adaptation Mechanisms

6. Hardware Efficiency, Sparsity, and Energy Savings

7. Empirical Performance, Benchmark Results, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Ternary Weight Quantization

1. Mathematical Formulation of Ternary Quantization

2. Quantization-Aware Training and Optimization Methods

3. Algorithmic Schemes and Implementation Pipelines

4. Pruning, Hyperspherical, and Hybrid Techniques

5. Scaling, Thresholding, and Adaptation Mechanisms

6. Hardware Efficiency, Sparsity, and Energy Savings

7. Empirical Performance, Benchmark Results, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research