Papers
Topics
Authors
Recent
2000 character limit reached

Trained Ternary Quantization (TTQ)

Updated 30 December 2025
  • Trained Ternary Quantization (TTQ) is a method that quantizes network weights to {-Wⁿ, 0, +Wᵖ} using per-layer learnable scaling factors for improved representation.
  • TTQ reduces 32-bit weights to 2 bits, achieving nearly 16× compression with strong empirical performance across benchmarks like CIFAR-10 and ImageNet.
  • TTQ offers practical benefits including hardware efficiency, lower latency, and enhanced communication in federated learning through sparse representations.

Trained Ternary Quantization (TTQ) is a neural network quantization technique that constrains weights to ternary values and introduces learnable, per-layer scaling factors for positive and negative weights. The TTQ methodology reduces model size and inference complexity, while preserving—or in some cases even improving—model accuracy, especially for deep convolutional architectures. TTQ has become an influential reference for quantization-aware training, hardware-efficient inference, and communication-efficient federated learning (Zhu et al., 2016, Liu et al., 2023, Xu et al., 2020).

1. Formulation of Trained Ternary Quantization

In TTQ, each layer ℓ maintains full-precision latent weights w~,i\tilde w_{\ell,i} and two trainable scaling factors, Wp>0W^p_{\ell} > 0 for positive ternary weights and Wn>0W^n_{\ell} > 0 for negative. A per-layer threshold Δ\Delta_{\ell} is set as a fraction of the maximum absolute latent weight: Δ=tmaxiw~,i\Delta_{\ell} = t \cdot \max_i |\tilde w_{\ell,i}| with t0.05t \approx 0.05 in empirical settings.

The ternary quantization function assigns each latent weight to

w,it={+Wpif w~,i>Δ 0if w~,iΔ Wnif w~,i<Δw^t_{\ell,i} = \begin{cases} +W^p_{\ell} & \text{if } \tilde w_{\ell,i} > \Delta_{\ell} \ 0 & \text{if } |\tilde w_{\ell,i}| \le \Delta_{\ell} \ -W^n_{\ell} & \text{if } \tilde w_{\ell,i} < -\Delta_{\ell} \end{cases}

This produces a codebook {Wn,0,+Wp}\{-W^n_{\ell}, 0, +W^p_{\ell}\} per layer (Zhu et al., 2016, Liu et al., 2023, Xu et al., 2020).

2. Optimization and Training Algorithm

Standard network losses L()L(\cdot) (such as cross-entropy) are minimized using quantized weights. TTQ augments the loss with L2 decay on the latent weights: J=L(,wt,)+λ2,i(w~,i)2J = L(\ldots, w^t_\ell, \ldots) + \frac{\lambda}{2} \sum_{\ell,i} (\tilde w_{\ell,i})^2 Gradient-based updates are performed for both scaling factors and latent weights:

  • Scaling factor gradients:

JWp=iIpLw,it\frac{\partial J}{\partial W^p_\ell} = \sum_{i \in I^p_\ell} \frac{\partial L}{\partial w^t_{\ell,i}}

JWn=iInLw,it\frac{\partial J}{\partial W^n_\ell} = -\sum_{i \in I^n_\ell} \frac{\partial L}{\partial w^t_{\ell,i}}

  • Latent weight gradients (using a modified Straight-Through Estimator):

Jw~,i={WpLw,it+λw~,i,iIp Lw,it+λw~,i,iI0 WnLw,it+λw~,i,iIn\frac{\partial J}{\partial \tilde w_{\ell,i}} = \begin{cases} W^p_\ell \cdot \frac{\partial L}{\partial w^t_{\ell,i}} + \lambda \tilde w_{\ell,i}, & i \in I^p_\ell \ \frac{\partial L}{\partial w^t_{\ell,i}} + \lambda \tilde w_{\ell,i}, & i \in I^0_\ell \ W^n_\ell \cdot \frac{\partial L}{\partial w^t_{\ell,i}} + \lambda \tilde w_{\ell,i}, & i \in I^n_\ell \end{cases}

All parameters are updated using SGD or Adam. Pseudocode for both forward and backward passes is provided in (Zhu et al., 2016) and (Liu et al., 2023).

3. Inference and Model Compression

After training, all latent weights are discarded. Each layer stores:

  • A ternary (2-bit-per-weight) tensor wt{1,0,+1}dw^t_\ell \in \{-1,0,+1\}^{d_\ell},
  • Two 32-bit scaling factors Wp,WnW^p_\ell, W^n_\ell.

This results in significant model compression: from 32 bits per weight to 2 bits (a \sim16×\times reduction). Inference computes inner products/convolutions as: y=Wpi:w,it=+1xiWni:w,it=1xiy = W^p_\ell \sum_{i:w^t_{\ell,i}=+1} x_i - W^n_\ell \sum_{i:w^t_{\ell,i}=-1} x_i Zero weights are skipped, reducing memory and computation requirements especially in hardware with sparse computation capabilities (Zhu et al., 2016).

4. Empirical Evaluation

TTQ demonstrates strong empirical performance across multiple benchmarks:

  • CIFAR-10 (ResNet-32/44/56): TTQ slightly outperforms full-precision baselines; e.g., ResNet-56 achieves a 0.36% lower error rate when quantized with TTQ.
  • ImageNet (AlexNet): TTQ matches or exceeds the accuracy of full-precision AlexNet, surpassing previous binary and ternary quantization methods by up to 3% absolute in Top-1 accuracy.
  • ResNet-18 on ImageNet: TTQ delivers Top-1/Top-5 accuracies within 0.6% of the full-precision baseline, with higher accuracy and less degradation than prior TWN/BWN methods.
  • Compression: TTQ consistently achieves \sim16×\times reduction in model size (Zhu et al., 2016, Liu et al., 2023).

Table: Typical TTQ results (drawn from (Zhu et al., 2016, Liu et al., 2023)):

Model Full-precision TTQ (2 bit) Δ (TTQ − FP)
ResNet-32 CIFAR-10 7.67% error 7.63% error –0.04%
AlexNet ImageNet 42.8% Top-1 42.5% Top-1 –0.3%
ResNet-18 ImageNet 30.4% Top-1 33.4% Top-1 +3.0%

5. Mechanisms and Theoretical Properties

TTQ offers several key advantages:

  • Expressivity: Layer-wise, independent positive/negative scaling factors (WpW^p_\ell, WnW^n_\ell) improve representational flexibility versus symmetric ternary or binary schemes.
  • Adaptive Regularization: The binarization and sparsification act as regularizers, mitigating overfitting and sometimes increasing generalization performance, especially in deeper networks.
  • Sparsity: Typical TTQ models feature 30–50% zero weights; convolutional layer sparsity can reach 60–70%. This sparsity directly translates to lower arithmetic and memory fetch costs during inference.
  • Hardware benefits: With only two inner-product accumulations and two scaling multiplications per output channel, energy and latency are reduced significantly on custom hardware. Empirical evidence suggests TTQ models can run in less than 30% of the wall-clock inference time of full-precision models on dedicated accelerators (Zhu et al., 2016).

In federated settings, FTTQ—a TTQ variant—enables significant communication efficiency (16×\times reduction in uplink/downlink bandwidth) while retaining convergence and accuracy guarantees. Under certain symmetry assumptions on weight distributions, the two learned scales in TTQ converge towards equality, suggesting a further compression opportunity with a single scaling factor (Xu et al., 2020).

TTQ generalizes earlier ternary approaches such as Ternary Weight Networks (TWN) by introducing two learnable scales instead of a single one—enabling asymmetric quantization. In comparison to binary quantization or fixed symmetric ternarization, TTQ reliably achieves higher accuracy at the same bit budget (Liu et al., 2023). Extensions of the TTQ framework (e.g., truncated Gaussian-based ternarization) further expand the joint learning paradigm by optimizing both quantizer thresholds and scaling factors (He et al., 2018).

7. Extensions and Variants

TTQ has motivated numerous extensions. In federated learning, the Federated Trained Ternary Quantization (FTTQ) and Ternary Federated Averaging (T-FedAvg) protocols adapt TTQ for edge settings, theoretically proving unbiasedness, convergence, and reduced weight divergence for non-IID data distributions (Xu et al., 2020). Meanwhile, further research on quantizer optimization—such as simultaneous learning of quantizer thresholds and weights using truncated Gaussian approximations—broadens the range of ternary quantization methods deployable in hardware and resource-constrained contexts (He et al., 2018, Liu et al., 2023).


References:

(Zhu et al., 2016): https://arxiv.org/abs/([1612.01064](/papers/1612.01064), Liu et al., 2023): https://arxiv.org/abs/([2303.01505](/papers/2303.01505), Xu et al., 2020): https://arxiv.org/abs/([2003.03564](/papers/2003.03564), He et al., 2018): https://arxiv.org/abs/([1810.01018](/papers/1810.01018))

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Trained Ternary Quantization (TTQ).