Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Trained Ternary Quantization (1612.01064v3)

Published 4 Dec 2016 in cs.LG
Trained Ternary Quantization

Abstract: Deep neural networks are widely used in machine learning applications. However, the deployment of large neural networks models can be difficult to deploy on mobile devices with limited power budgets. To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet. And our AlexNet model is trained from scratch, which means it's as easy as to train normal full precision model. We highlight our trained quantization method that can learn both ternary values and ternary assignment. During inference, only ternary values (2-bit weights) and scaling factors are needed, therefore our models are nearly 16x smaller than full-precision models. Our ternary models can also be viewed as sparse binary weight networks, which can potentially be accelerated with custom circuit. Experiments on CIFAR-10 show that the ternary models obtained by trained quantization method outperform full-precision models of ResNet-32,44,56 by 0.04%, 0.16%, 0.36%, respectively. On ImageNet, our model outperforms full-precision AlexNet model by 0.3% of Top-1 accuracy and outperforms previous ternary models by 3%.

Overview of "Trained Ternary Quantization"

Introduction

The research presented in "Trained Ternary Quantization" addresses one of the fundamental challenges in deploying deep neural networks (DNNs) on edge devices: the substantial resource requirements associated with large model sizes. Traditional DNNs, with millions of parameters, are computationally intensive and memory-demanding, making their deployment on mobile devices with limited power budgets difficult. The paper proposes a method called Trained Ternary Quantization (TTQ) to mitigate this issue by reducing the precision of network weights to ternary values (-1, 0, +1).

Methodology

The TTQ approach diverges from conventional quantization methods by introducing two key innovations:

  1. Learned Scaling Factors: Instead of using fixed scaling factors or symmetric thresholds, TTQ employs two trainable full-precision scaling coefficients, WpW^p and WnW^n, for the positive and negative weights of each layer, respectively. These coefficients are optimized during training, enabling the model to learn the optimal ternary values for the weights.
  2. Gradient Backpropagation: During the training phase, gradients are backpropagated to both the scaling coefficients and the latent full-precision weights. This allows the adjustment of ternary assignments dynamically, enhancing the model's capacity to accurately represent the learned features.

The quantization process involves normalizing the full-precision weights, applying a threshold factor to determine ternary values, and then training the model using both the scaling coefficients and the ternary weights.

Experimental Results

CIFAR-10 Dataset

Experiments on the CIFAR-10 dataset demonstrate the efficacy of the TTQ method. When applied to ResNet architectures with varying depths (32, 44, and 56 layers), TTQ often surpasses the accuracy of full-precision models. For instance:

  • ResNet-32: TTQ improves accuracy by 0.04%.
  • ResNet-44: TTQ improves accuracy by 0.16%.
  • ResNet-56: TTQ improves accuracy by 0.36%.

These results indicate that the more complex the model, the greater the potential benefits of TTQ, likely due to its ability to balance model capacity and regularization.

ImageNet Dataset

The TTQ method also shows strong performance on the ImageNet dataset, a more challenging and large-scale benchmark:

  • AlexNet: The TTQ model achieves a 42.5% Top-1 accuracy, outperforming both the full-precision model (44.1% Top-1) and previous ternary networks (45.5% Top-1 for TWN). The TTQ model also reduces parameter size by approximately 16x, resulting in a lighter, yet highly accurate model suited for edge deployment.

Practical and Theoretical Implications

Practical Implications

  1. Model Compression: The reduction in model size by a factor of 16 greatly facilitates deployment on resource-constrained devices, enabling the use of advanced DNN architectures in applications like autonomous driving and mobile applications where over-the-air updates and storage limitations are significant constraints.
  2. Energy Efficiency: With fewer parameters to load from memory and fewer computations needed, the TTQ method reduces both computational and memory bandwidth requirements, prolonging battery life for mobile devices.
  3. Custom Hardware: The sparsity introduced through ternary quantization suggests potential acceleration using custom circuits designed to exploit these sparse operations, further enhancing inference efficiency.

Theoretical Implications

  1. Quantization Strategies: The success of learned scaling coefficients highlights the importance of adaptable quantization strategies over static or heuristic approaches, suggesting new directions for research in quantization-aware training.
  2. Regularization Effects: The improvement in accuracy over full-precision models for deeper networks hints at an intrinsic regularization effect provided by ternary weights, which may prevent overfitting and enhance generalization.

Future Developments

Future research in AI and DNN deployment could explore extending TTQ to other architectures such as Transformers, which are increasingly used for tasks requiring high computational power. Additionally, integrating TTQ with other model compression techniques like pruning and knowledge distillation could yield further improvements in both model size and performance. Custom hardware implementations leveraging the sparsity of ternary weights also represent a promising avenue for achieving real-time inference in extremely resource-constrained environments.

Conclusion

The TTQ method offers a significant advancement in neural network quantization, effectively combining model compression with minimal loss in accuracy. By learning both the scaling factors and ternary assignments during training, TTQ achieves state-of-the-art results on challenging datasets, demonstrating its potential for practical deployment on edge devices and providing insights for future research in efficient neural network quantization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chenzhuo Zhu (3 papers)
  2. Song Han (155 papers)
  3. Huizi Mao (13 papers)
  4. William J. Dally (21 papers)
Citations (1,014)