Trained Ternary Quantization (TTQ)
- Trained Ternary Quantization (TTQ) is a method that quantizes network weights to {-Wⁿ, 0, +Wᵖ} using per-layer learnable scaling factors for improved representation.
- TTQ reduces 32-bit weights to 2 bits, achieving nearly 16× compression with strong empirical performance across benchmarks like CIFAR-10 and ImageNet.
- TTQ offers practical benefits including hardware efficiency, lower latency, and enhanced communication in federated learning through sparse representations.
Trained Ternary Quantization (TTQ) is a neural network quantization technique that constrains weights to ternary values and introduces learnable, per-layer scaling factors for positive and negative weights. The TTQ methodology reduces model size and inference complexity, while preserving—or in some cases even improving—model accuracy, especially for deep convolutional architectures. TTQ has become an influential reference for quantization-aware training, hardware-efficient inference, and communication-efficient federated learning (Zhu et al., 2016, Liu et al., 2023, Xu et al., 2020).
1. Formulation of Trained Ternary Quantization
In TTQ, each layer ℓ maintains full-precision latent weights and two trainable scaling factors, for positive ternary weights and for negative. A per-layer threshold is set as a fraction of the maximum absolute latent weight: with in empirical settings.
The ternary quantization function assigns each latent weight to
This produces a codebook per layer (Zhu et al., 2016, Liu et al., 2023, Xu et al., 2020).
2. Optimization and Training Algorithm
Standard network losses (such as cross-entropy) are minimized using quantized weights. TTQ augments the loss with L2 decay on the latent weights: Gradient-based updates are performed for both scaling factors and latent weights:
- Scaling factor gradients:
- Latent weight gradients (using a modified Straight-Through Estimator):
All parameters are updated using SGD or Adam. Pseudocode for both forward and backward passes is provided in (Zhu et al., 2016) and (Liu et al., 2023).
3. Inference and Model Compression
After training, all latent weights are discarded. Each layer stores:
- A ternary (2-bit-per-weight) tensor ,
- Two 32-bit scaling factors .
This results in significant model compression: from 32 bits per weight to 2 bits (a 16 reduction). Inference computes inner products/convolutions as: Zero weights are skipped, reducing memory and computation requirements especially in hardware with sparse computation capabilities (Zhu et al., 2016).
4. Empirical Evaluation
TTQ demonstrates strong empirical performance across multiple benchmarks:
- CIFAR-10 (ResNet-32/44/56): TTQ slightly outperforms full-precision baselines; e.g., ResNet-56 achieves a 0.36% lower error rate when quantized with TTQ.
- ImageNet (AlexNet): TTQ matches or exceeds the accuracy of full-precision AlexNet, surpassing previous binary and ternary quantization methods by up to 3% absolute in Top-1 accuracy.
- ResNet-18 on ImageNet: TTQ delivers Top-1/Top-5 accuracies within 0.6% of the full-precision baseline, with higher accuracy and less degradation than prior TWN/BWN methods.
- Compression: TTQ consistently achieves 16 reduction in model size (Zhu et al., 2016, Liu et al., 2023).
Table: Typical TTQ results (drawn from (Zhu et al., 2016, Liu et al., 2023)):
| Model | Full-precision | TTQ (2 bit) | Δ (TTQ − FP) |
|---|---|---|---|
| ResNet-32 CIFAR-10 | 7.67% error | 7.63% error | –0.04% |
| AlexNet ImageNet | 42.8% Top-1 | 42.5% Top-1 | –0.3% |
| ResNet-18 ImageNet | 30.4% Top-1 | 33.4% Top-1 | +3.0% |
5. Mechanisms and Theoretical Properties
TTQ offers several key advantages:
- Expressivity: Layer-wise, independent positive/negative scaling factors (, ) improve representational flexibility versus symmetric ternary or binary schemes.
- Adaptive Regularization: The binarization and sparsification act as regularizers, mitigating overfitting and sometimes increasing generalization performance, especially in deeper networks.
- Sparsity: Typical TTQ models feature 30–50% zero weights; convolutional layer sparsity can reach 60–70%. This sparsity directly translates to lower arithmetic and memory fetch costs during inference.
- Hardware benefits: With only two inner-product accumulations and two scaling multiplications per output channel, energy and latency are reduced significantly on custom hardware. Empirical evidence suggests TTQ models can run in less than 30% of the wall-clock inference time of full-precision models on dedicated accelerators (Zhu et al., 2016).
In federated settings, FTTQ—a TTQ variant—enables significant communication efficiency (16 reduction in uplink/downlink bandwidth) while retaining convergence and accuracy guarantees. Under certain symmetry assumptions on weight distributions, the two learned scales in TTQ converge towards equality, suggesting a further compression opportunity with a single scaling factor (Xu et al., 2020).
6. Comparison to Related Quantization Methods
TTQ generalizes earlier ternary approaches such as Ternary Weight Networks (TWN) by introducing two learnable scales instead of a single one—enabling asymmetric quantization. In comparison to binary quantization or fixed symmetric ternarization, TTQ reliably achieves higher accuracy at the same bit budget (Liu et al., 2023). Extensions of the TTQ framework (e.g., truncated Gaussian-based ternarization) further expand the joint learning paradigm by optimizing both quantizer thresholds and scaling factors (He et al., 2018).
7. Extensions and Variants
TTQ has motivated numerous extensions. In federated learning, the Federated Trained Ternary Quantization (FTTQ) and Ternary Federated Averaging (T-FedAvg) protocols adapt TTQ for edge settings, theoretically proving unbiasedness, convergence, and reduced weight divergence for non-IID data distributions (Xu et al., 2020). Meanwhile, further research on quantizer optimization—such as simultaneous learning of quantizer thresholds and weights using truncated Gaussian approximations—broadens the range of ternary quantization methods deployable in hardware and resource-constrained contexts (He et al., 2018, Liu et al., 2023).
References:
(Zhu et al., 2016): https://arxiv.org/abs/([1612.01064](/papers/1612.01064), Liu et al., 2023): https://arxiv.org/abs/([2303.01505](/papers/2303.01505), Xu et al., 2020): https://arxiv.org/abs/([2003.03564](/papers/2003.03564), He et al., 2018): https://arxiv.org/abs/([1810.01018](/papers/1810.01018))